I'm using airflow to orchestrate some python scripts. I have a "main" dag from which several subdags are run. My main dag is supposed to run according to the following overview:
I've managed to get to this structure in my main dag by using the following lines:
etl_internal_sub_dag1 >> etl_internal_sub_dag2 >> etl_internal_sub_dag3
etl_internal_sub_dag3 >> etl_adzuna_sub_dag
etl_internal_sub_dag3 >> etl_adwords_sub_dag
etl_internal_sub_dag3 >> etl_facebook_sub_dag
etl_internal_sub_dag3 >> etl_pagespeed_sub_dag
etl_adzuna_sub_dag >> etl_combine_sub_dag
etl_adwords_sub_dag >> etl_combine_sub_dag
etl_facebook_sub_dag >> etl_combine_sub_dag
etl_pagespeed_sub_dag >> etl_combine_sub_dag
What I want airflow to do is to first run the etl_internal_sub_dag1
then the etl_internal_sub_dag2
and then the etl_internal_sub_dag3
. When the etl_internal_sub_dag3
is finished I want etl_adzuna_sub_dag
, etl_adwords_sub_dag
, etl_facebook_sub_dag
, and etl_pagespeed_sub_dag
to run in parallel. Finally, when these last four scripts are finished, I want the etl_combine_sub_dag
to run.
However, when I run the main dag, etl_adzuna_sub_dag
, etl_adwords_sub_dag
, etl_facebook_sub_dag
, and etl_pagespeed_sub_dag
are run one by one and not in parallel.
Question: How do I make sure that the scripts etl_adzuna_sub_dag
, etl_adwords_sub_dag
, etl_facebook_sub_dag
, and etl_pagespeed_sub_dag
are run in parallel?
Edit: My default_args
and DAG
look like this:
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': start_date,
'end_date': end_date,
'email': ['[email protected]'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 0,
'retry_delay': timedelta(minutes=5),
}
DAG_NAME = 'main_dag'
dag = DAG(DAG_NAME, default_args=default_args, catchup = False)
You will need to use LocalExecutor
.
Check your configs (airflow.cfg
), you might be using SequentialExectuor
which executes tasks serially.
Airflow uses a Backend database to store metadata. Check your airflow.cfg
file and look for executor
keyword. By default, Airflow uses SequentialExecutor
which would execute task sequentially no matter what. So to allow Airflow to run tasks in Parallel you will need to create a database in Postges or MySQL and configure it in airflow.cfg
(sql_alchemy_conn
param) and then change your executor to LocalExecutor
in airflow.cfg
and then run airflow initdb
.
Note that for using LocalExecutor
you would need to use Postgres or MySQL instead of SQLite as a backend database.
More info: https://airflow.incubator.apache.org/howto/initialize-database.html
If you want to take a real test drive of Airflow, you should consider setting up a real database backend and switching to the LocalExecutor. As Airflow was built to interact with its metadata using the great SqlAlchemy library, you should be able to use any database backend supported as a SqlAlchemy backend. We recommend using MySQL or Postgres.