---
title: "Core Concepts"
page-category: "searchable"
---

## Introduction
This guide details the most important high level concepts within Airflow, the core abstractions that Conductor provides on top of them, and how these two groups of concepts relate.

At its core, Conductor is a **see-through wrapper** around **Airflow** that supplies Airflow operators with sensible **default configuration** and **auto-provisioned AWS resources**.
Let's break that down:
- **Airflow** is the underlying orchestration framework.
It is a time-tested Apache project for scheduling and monitoring workflows, composed of tasks organized in a directed acyclic graph (DAG).
- Conductor is a **see-through wrapper** for Airflow in the sense that the final DAGs produced by Conductor are simply Airflow DAGs that do not import from Conductor code, meaning that any Airflow cluster can run the DAG.
- The functionality that Conductor offers, then, is strong **default configuration** for Airflow's very flexible suite of operators, handling as much boilerplate as possible so that you can focus on business logic.
- Finally, part of the aforementioned Conductor defaults is the ability to **auto-provision AWS resources** with commonly used infrastructure patterns, allowing you to, in most cases, avoid thinking about how and where your business logic will run.

The following sections will explain the core concepts in Airflow and in Conductor in more detail.

## Airflow Concepts
[Airflow's Core Concepts](https://airflow.apache.org/docs/apache-airflow/stable/concepts/index.html) details the core abstractions provided by Airflow.
Among these, the most important are the **Operator**, **Task**, and **DAG**, which define how your workflows are built.

#### Airflow Operators and Tasks
Operators define *how* a task is run.
Every operator implements an `__init__` constructor, and the `execute` method.

```python
class MyOperator(Operator):
    def __init__(task_id, *args, **kwargs):
        ...

    def execute(self, context):
        ...
```

The `execute` method defines *what* to run, and contains the logic that is executed by the Airflow workers.
The constructor is the user's way to configure the operator and tell the `execute` method what to do.

When you instantiate an operator, the resulting object is called a **task**.
In other words, an operator is a predefined template for creating a task.

```python
train = MyOperator("train") # train is a task
```

The Task is the atomic unit of *execution* within Airflow, i.e. the smallest self-contained unit of logic that an Airflow worker can run.
Examples of tasks are:
- Run a SQL query `select_data = PostgresOperator(task_id="selected_data", sql="SELECT * FROM ...")`
- Train a model `train = SageMakerTrainingOperator(...)`
- Deploy an endpoint `deploy = SageMakerEndpointOperator(...)`

#### Airflow DAG
The DAG is the atomic unit of *scheduling* within Airflow, and corresponds to a complete end-to-end workflow.
DAGs consist of a parent Airflow `DAG` object, with child tasks composed with the `>>` and `<<` operators, for example:

```python
import datetime as dt
from airflow.models import DAG
from airflow.operators.dummy import DummyOperator

dag = DAG(
    dag_id='example_dag',
    schedule_interval='@once',
    start_date=dt.datetime(2021, 5, 21)
)

run_this_first = DummyOperator(task_id='run_this_first', dag=dag)

branch_a = DummyOperator(task_id='branch_a', dag=dag)
branch_b = DummyOperator(task_id='branch_b', dag=dag)

run_this_last = DummyOperator(task_id='run_this_last', dag=dag)

run_this_first >> branch_a
run_this_first >> branch_b
branch_a >> run_this_last << branch_b
```

Airflow detects your DAGs by looking for `.py` files at a specific location, usually `~/airflow/dags/` locally for example, or `s3://my-airflow-env/dags/` for AWS managed Airflow.
Once detected, the DAGs appear in the Airflow UI:

![Package Structure]({{"/assets/images/example_dag.png" | relative_url}})


## Conductor Concepts
The core abstractions in Conductor are the **Project Configuration**, the **Conductor** class, and **Conductor Operators**.
The following sections will explain these in more detail, and relate them to the Airflow core concepts they are built on top of.

#### Conductor Project Config
The Conductor Project configuration class determines the AWS resources available to DAGs within your project.
Resources are organized in environments with `EnvironmentConfig`, and provided to the `Project` class in a dict from the name of the environment to the `EnvironmentConfig` object:

```py
# project_config.py
from conductor.config import (
    EnvironmentConfig,
    Project,
    AirflowConfig,
)

airflow_config = AirflowConfig(mwaa_environment_name="my-cluster-name")
staging_config = EnvironmentConfig(
    account_id="123456789",
    account_name="my-account-name",
    default_region="us-west-2",
    airflow=staging_airflow_config,
)

project = Project(name="follow-bot-detection", environments={"staging": staging_config})
```

These environments are selectable when you run any CLI command, `conductor --env staging deploy` for example.

Conductor uses the `Project` configuration to automatically provision resources, such as S3 Buckets and IAM roles, necessary to execute DAGs, and passes URIs for these resources to the Conductor operators.
The names of these resources are designed to be collision resistant within the same environment, so that you can use the same Environment for `staging` and `dev` for example without worrying about duplicate name errors.

#### Conductor Class
The Conductor Class is a wrapper around Airflow DAGs and Operators that configures them with default behavior.

```python
import airflow
from project_config import project
from conductor.core import Conductor

default_args = {{
    "start_date": airflow.utils.dates.days_ago(0)
}}

c = Conductor(
    "my-dag",
    project=project,
    schedule_interval="@once",
    default_args=default_args,
)
dag = c.dag
```

This class automatically creates an Airflow DAG when instantiated, as shown in the last line of the snippet above, meaning that is it 1:1 with Airflow DAGs.
It takes the `Project` object that is created in `project_config.py` as an argument, and uses it to point Conductor operators to AWS resources.

#### Conductor Operators
Conductor operators are not Airflow operators! They do not inherit from `BaseOperator` or implement the `execute` method, and are never seen by the Airflow schedulers or workers.
Rather, they generate default configuration from the `Project` configuration, and supply it to existing Airflow operators as opinionated defaults that reduce the amount of code necessary to build your workflows.

The resulting tasks are returned in a `TaskWrapper` class, which has the `task` attribute, a pointer to the underlying Airflow task, as well as an `outputs` attribute that contains the operator's output values.

```python
import airflow
from project_config import project
from conductor.core import Conductor
from airflow.operators.dummy import DummyOperator

c = Conductor(...)
dag = c.dag
process_data = c.operators.SageMakerProcessingOperator(...)

process_data.task  # Underlying Airflow task.
from airflow.providers.amazon.aws.operators.sagemaker_processing \
    import SageMakerProcessingOperator
assert isinstance(process_data.task, SageMakerProcessingOperator)
process_data.outputs  # Output values.

dummy_airflow_task = DummyOperator(...)

process_data >> dummy_airflow_task
```

Notice that in the above example we can use the `process_data` variable, which is a `TaskWrapper` instance, to directly to compose a DAG, including tasks generated by non-Conductor operators.
This is because the `TaskWrapper` contains syntactic sugar that forwards the underlying task to the Airflow DAG directly, to reduce the verbosity of the DAG definition.

Conductor operators offer opinionated default functionality, but leave the door open for you to customize their behavior of operators.
For example, the Conductor `SageMakerTrainingOperator` outputs your model file at a particular location, but this location can be overriden:
```python
train = c.operators.SageMakerTraining(task_id="train", model_cld=MyModel)
train.outputs.s3_url  # Default URL.

train_custom = c.operators.SageMakerTraining(
    task_id="train",
    model_cld=MyModel,
    config={
        "OutputDataConfig": {"S3OutputPath": "s3://my/custom/path/"}, # Custom URL.
    }
)
# Or override the task config post instantiation.
train_custom.task.config["OutputDataConfig"] = {"S3OutputPath": "s3://my/custom/path/"}
```

If you find yourself overriding a large portion of the default values of a Conductor operator, such that the defaults are not saving you any time, then it may be easier to build that particular task using an Airflow operator directly.
