What is Apache Airflow?

What is it?

I work on a system called Apache Airflow.

Airflow is used as an orchestration system which means that it is a system that schedules and does work for you. The primary use cases of Apache Airflow are ETL e.g. extracting data from someplace, transforming it in some way, then loading it (moving) to another place.

High-level architecture

There are 4 components to the system.

  1. The web server – it’s the UI to track the status of jobs that users will write as DAGs (directed acyclic graphs)
  2. The scheduler – this is a service that will schedule the written DAGs for you on some schedule that you specify that gets its work executed elsewhere (usually on some worker nodes)
  3. The executor – this component is responsible for the actual execution of tasks
  4. Metadata database – the data store hosts all the data for Airflow to run. It contains information about the tasks that are running, dependencies, and DAG status amongst many other things

You theoretically could run this all on a single machine but big companies will usually have a distributed architecture that will have the web server running as its own service, the scheduler as its own, the metadata database on another node, and a pool of worker nodes that will scale horizontally with more and more load.

Forgive my bad drawing, but at a high level, it looks something like this.

Of course, between the scheduler and workers, you usually have a queue to push work to for the workers to consume.

Published by Paul Young-Suk Lee

SWE @lyft. Currently working on data infrastructure

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: