Recently I have tested airflow so much that have one problem with execution_date
when running airflow trigger_dag <my-dag>
.
I have learned that execution_date
is not what we think at first time from here:
Airflow was developed as a solution for ETL needs. In the ETL world, you typically summarize data. So, if I want to summarize data for 2016-02-19, I would do it at 2016-02-20 midnight GMT, which would be right after all data for 2016-02-19 becomes available.
start_date = datetime.combine(datetime.today(),
datetime.min.time())
args = {
"owner": "xigua",
"start_date": start_date
}
dag = DAG(dag_id="hadoopprojects", default_args=args,
schedule_interval=timedelta(days=1))
wait_5m = ops.TimeDeltaSensor(task_id="wait_5m",
dag=dag,
delta=timedelta(minutes=5))
Above codes is the start part of my daily workflow, the first task is a TimeDeltaSensor that waits another 5 minutes before actual work, so this means my dag will be triggered at 2016-09-09T00:05:00
, 2016-09-10T00:05:00
... etc.
In Web UI, I can see something like scheduled__2016-09-20T00:00:00
, and task is run at 2016-09-21T00:00:00
, which seems reasonable according to ETL
model.
However someday my dag is not triggered for unknown reason, so I trigger it manually, if I trigger it at 2016-09-20T00:10:00
, then the TimeDeltaSensor will wait until 2016-09-21T00:15:00
before run.
This is not what I want, I want it to run at 2016-09-20T00:15:00
not the next day, I have tried passing execution_date
through --conf '{"execution_date": "2016-09-20"}'
, but it doesn't work.
How should I deal with this issue ?
$ airflow version
[2016-09-21 17:26:33,654] {__init__.py:36} INFO - Using executor LocalExecutor
____________ _____________
____ |__( )_________ __/__ /________ __
____ /| |_ /__ ___/_ /_ __ /_ __ \_ | /| / /
___ ___ | / _ / _ __/ _ / / /_/ /_ |/ |/ /
_/_/ |_/_/ /_/ /_/ /_/ \____/____/|__/
v1.7.1.3
First, I recommend you use constants for start_date
, because dynamic ones would act unpredictably based on with your airflow pipeline is evaluated by the scheduler.
More information about start_date
here in an FAQ entry that I wrote and sort all this out:
https://airflow.apache.org/faq.html#what-s-the-deal-with-start-date
Now, about execution_date
and when it is triggered, this is a common gotcha for people onboarding on Airflow. Airflow sets execution_date
based on the left bound of the schedule period it is covering, not based on when it fires (which would be the right bound of the period). When running an schedule='@hourly'
task for instance, a task will fire every hour. The task that fires at 2pm will have an execution_date
of 1pm because it assumes that you are processing the 1pm to 2pm time window at 2pm. Similarly, if you run a daily job, the run an with execution_date
of 2016-01-01
would trigger soon after midnight on 2016-01-02
.
This left-bound labelling makes a lot of sense when thinking in terms of ETL and differential loads, but gets confusing when thinking in terms of a simple, cron-like scheduler.