Airflow 基本概念
Airflow是一个工作流管理系统,这里的工作流指的是一系列的task任务和它们之间的相互依赖,通过有向无环图DAG的形式做可视化的展现,其中每一个方框代表一个task,每一条直线代表task之间的依赖。
另外,DAG还有一些自身属性,比如 dag_id,start_date,schedule_interval,default_arguments 等,这些属性可以在初始化的时候定义,并且这些属性被DAG里所有的task继承
# each Workflow/DAG must have a unique text identifier
WORKFLOW_DAG_ID = 'example_workflow_dag'
# start/end times are datetime objects
# here we start execution on Jan 1st, 2017
WORKFLOW_START_DATE = datetime(2017, 1, 1)
# schedule/retry intervals are timedelta objects
# here we execute the DAGs tasks every day
WORKFLOW_SCHEDULE_INTERVAL = timedelta(1)
# default arguments are applied by default to all tasks
# in the DAG
WORKFLOW_DEFAULT_ARGS = {
'owner': 'example',
'depends_on_past': False,
'start_date': WORKFLOW_START_DATE,
'email': ['example@example_company.com'],
'email_on_failure': True,
'email_on_retry': False,
'retries': 5,
'retry_delay': timedelta(minutes=5)
}
# initialize the DAG
dag = DAG(
dag_id=WORKFLOW_DAG_ID,
start_date=WORKFLOW_START_DATE,
schedule_interval=WORKFLOW_SCHEDULE_INTERVAL,
default_args=WORKFLOW_DEFAULT_ARGS,
)
对于task,可以分为2种类型
- Operators: 执行具体的操作
- Sensors: 检测一个流程或者数据结构的状态
Airflow的架构
metadata存储task和dag的状态
schedule 根据dag的定义和task的状态,决定哪些task将被执行
executor 是一个消息队列(Celery),决定哪个work将要执行哪个task
使用
基本结构
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from datetime import datetime
def greet():
t = datetime.now().strftime("%Y-%m-%d %H:%M")
print("Geeting @ " + str(t))
return "end"
default_args = {
"owner": "dongshaoyang",
"start_date": datetime(2021, 9, 20),
"retries":1,
}
dag = DAG(
dag_id="test_article",
default_args=default_args,
schedule_interval='0 8 * * *',
tags=['once'],
catchup=False)
t1 = BashOperator(task_id="bash_op", bash_command="echo 'Hello World, today is {{ ds }}'", dag=dag)
python_greet = PythonOperator(task_id='run_python_task', python_callable=greet, dag=dag)
sleep_op = BashOperator(task_id='sleep_me', bash_command='sleep 5',dag=dag)
run_this_last = DummyOperator(task_id='run_this_last',dag=dag)
t1 >> python_greet >> sleep_op >> run_this_last
sensor的使用
适用场景:2个不同dag之间的依赖;2个不同dag的不同task之间的依赖
举个例子:crm_test_dag这个DAG中task2成功结束之后,另外一个DAG trade才可以开始执行。这种情况下就要使用ExternalTaskSensor, 使用的时候要注意execution_delta的设置,它的取值是前后两个DAG调度时间的差值。
dag = DAG('crm_test_dag', default_args=default_args, schedule_interval='50 21 * * *', concurrency=1, catchup=True)
t2 = BashOperator(
task_id='task2',
bash_command='sleep 5',
retries=0,
dag=dag,
)
with DAG('trade', default_args=airflow_utli.default_args, schedule_interval='50 8 * * *', concurrency=12, catchup=False, sla_miss_callback=airflow_utli.default_sla_callback) as dag:
check_dependency = ExternalTaskSensor(
task_id='bba_sensor',
external_dag_id='crm_test_dag',
external_task_id= 'task2',
execution_delta = timedelta(hours=11),
dag=dag,
timeout = 60)
注意:timeout一定要设置,因为sensor是被当做task执行的,airflow只允许一定数目的task在单台实例上运行,如果不设置超时时间,该sensor讲一直运行从而占用其他任务的执行资源。timeout的单位是秒
Tricks
1.多任务依赖的话,使用数组
# Using Lists (being a PRO :-D )
task_one >> task_two >> [task_two_1, task_two_2, task_two_3] >> end
2.一个Dag内的任务使用manager的with模式,这样结构更清晰
# DAG with Context Manager
args = { 'owner': 'airflow',
'start_date': airflow.utils.dates.days_ago(2), }
with DAG(dag_id='example_dag', default_args=args, schedule_interval='0 0 * * *') as dag:
run_this_last = DummyOperator( task_id='run_this_last' )
run_this_first = BashOperator( task_id='run_this_first', bash_command='echo 1' )
run_this_first >> run_this_last
3.一些常用的宏定义