Apache Airflow has become a cornerstone tool in the data engineering world, offering robust workflow orchestration capabilities. As someone who has extensively used Airflow, I can attest to its flexibility and power in automating ETL processes. In this guide, I’ll walk you through setting up Airflow using Docker, creating and scheduling Directed Acyclic Graphs (DAGs), and share best practices, including integration with dbt.

Setting Up Apache Airflow with Docker

Using Docker to set up Airflow simplifies the installation process and ensures a consistent environment across different setups.

  1. Prerequisites:
    Ensure you have Docker and Docker Compose installed on your machine.
  2. Docker Compose Setup:
    Create a directory for your Airflow project and navigate into it.
  3. Inside this directory, create a docker-compose.yml, I’ve always used the one provided by airflow and customised it as I require

Starting Airflow:

Run the following command to start the services:

docker-compose up -d

Access the Airflow web interface by navigating to http://localhost:8080 in your browser.

Airflow web view

If you want to see the running services it should look something like this:

docker ps

Airflow docker images

Creating and Scheduling DAGs

Creating a DAG:

DAGs (Directed Acyclic Graphs) are the core concept in Airflow, defining the workflow of tasks.

Create a Python file in the dags directory. For example, example_dag.py:

from airflow import DAG
from airflow.operators.dummy import DummyOperator
from airflow.utils.dates import days_ago

default_args = {
'owner': 'airflow',
'start_date': days_ago(1),
'retries': 1,
}

dag = DAG(
'example_dag',
default_args=default_args,
description='A simple example DAG',
schedule_interval='@daily',
)

start = DummyOperator(task_id='start', dag=dag)
end = DummyOperator(task_id='end', dag=dag)

start >> end

Scheduling the DAG:

The schedule_interval parameter in the DAG definition controls how frequently the DAG runs. In the example above, the DAG is scheduled to run daily.

Best Practices for Using Apache Airflow

  1. Modularize Your Code:
    Break down your DAGs into smaller, reusable components to keep the code clean and manageable.

  2. Use Version Control:
    Store your DAGs and configurations in a version control system like Git to keep track of changes and collaborate effectively.

  3. Monitor and Alert:
    Set up monitoring and alerts for your DAGs to get notified of failures or performance issues. Airflow provides built-in support for email notifications.

  4. Use Variables and Connections:
    Store environment-specific information such as API keys and database connections in Airflow’s Variables and Connections to keep your code environment-agnostic.

Integrating Apache Airflow with dbt

dbt Setup:
Ensure dbt is installed and configured in your environment. Define your dbt models and tests as usual.

Creating an Airflow Task for dbt:

Create a task in your DAG to run dbt commands. For example:

from airflow.operators.bash import BashOperator

dbt_run = BashOperator(
task_id='dbt_run',
bash_command='dbt run',
dag=dag,
)

start >> dbt_run >> end

Scheduling dbt Tasks:

Schedule your dbt tasks within Airflow DAGs to automate the transformation processes seamlessly. This ensures that your data transformations are part of a managed and monitored workflow.

Conclusion

Apache Airflow is a powerful tool for automating ETL workflows, offering flexibility and robustness that can significantly enhance your data engineering processes. By setting up Airflow with Docker, creating and scheduling DAGs, and following best practices, you can streamline your workflows efficiently. Additionally, integrating dbt with Airflow can further automate and manage your data transformations, ensuring high-quality data pipelines.

I’m Paul

Welcome to RedJamJar Software Solutions. I’m a technology consultant focused on data engineering.

Let’s connect