Apache Airflow or Apache Beam for data processing and job scheduling

LouisB picture LouisB · May 9, 2018 · Viewed 20k times · Source

I'm trying to give useful information but I am far from being a data engineer.

I am currently using the python library pandas to execute a long series of transformation to my data which has a lot of inputs (currently CSV and excel files). The outputs are several excel files. I would like to be able to execute scheduled monitored batch jobs with parallel computation (I mean not as sequential as what I'm doing with pandas), once a month.

I don't really know Beam or Airflow, I quickly read through the docs and it seems that both can achieve that. Which one should I use ?

Answer

cryanbhu picture cryanbhu · Jul 25, 2018

The other answers are quite technical and hard to understand. I was in your position before so I'll explain in simple terms.

Airflow can do anything. It has BashOperator and PythonOperator which means it can run any bash script or any Python script.
It is a way to organize (setup complicated data pipeline DAGs), schedule, monitor, trigger re-runs of data pipelines, in a easy-to-view and use UI.
Also, it is easy to setup and everything is in familiar Python code.
Doing pipelines in an organized manner (i.e using Airflow) means you don't waste time debugging a mess of data processing (cron) scripts all over the place.

Apache Beam is a wrapper for the many data processing frameworks (Spark, Flink etc.) out there.
The intent is so you just learn Beam and can run on multiple backends (Beam runners).
If you are familiar with Keras and TensorFlow/Theano/Torch, the relationship between Keras and its backends is similar to the relationship between Beam and its data processing backends.

Google Cloud Platform's Cloud Dataflow is one backend for running Beam on.
They call it the Dataflow runner.

GCP's offering, Cloud Composer, is a managed Airflow implementation as a service, running in a Kubernetes cluster in Google Kubernetes Engine (GKE).

So you can either:
- manual Airflow implementation, doing data processing on the instance itself (if your data is small (or your instance is powerful enough), you can process data on the machine running Airflow. This is why many are confused if Airflow can process data or not)
- manual Airflow implementation calling Beam jobs
- Cloud Composer (managed Airflow as a service) calling jobs in Cloud Dataflow
- Cloud Composer running data processing containers in Composer's Kubernetes cluster environment itself, using Airflow's KubernetesPodOperator (KPO)
- Cloud Composer running data processing containers in Composer's Kubernetes cluster environment with Airflow's KPO, but this time in a better isolated fashion by creating a new node-pool and specifying that the KPO pods are to be run in the new node-pool

My personal experience:
Airflow is lightweight and not difficult to learn (easy to implement), you should use it for your data pipelines whenever possible.
Also, since many companies are looking for experience using Airflow, if you're looking to be a data engineer you should probably learn it
Also, managed Airflow (I've only used GCP's Composer so far) is much more convenient than running Airflow yourself, and managing the airflow webserver and scheduler processes.