Airflow入门

1,693 阅读2分钟

Airflow 基本概念

Airflow是一个工作流管理系统,这里的工作流指的是一系列的task任务和它们之间的相互依赖,通过有向无环图DAG的形式做可视化的展现,其中每一个方框代表一个task,每一条直线代表task之间的依赖。

test.png 另外,DAG还有一些自身属性,比如 dag_id,start_date,schedule_interval,default_arguments 等,这些属性可以在初始化的时候定义,并且这些属性被DAG里所有的task继承

# each Workflow/DAG must have a unique text identifier
WORKFLOW_DAG_ID = 'example_workflow_dag'# start/end times are datetime objects
# here we start execution on Jan 1st, 2017
WORKFLOW_START_DATE = datetime(2017, 1, 1)
​
# schedule/retry intervals are timedelta objects
# here we execute the DAGs tasks every day
WORKFLOW_SCHEDULE_INTERVAL = timedelta(1)
​
# default arguments are applied by default to all tasks 
# in the DAG
WORKFLOW_DEFAULT_ARGS = {
    'owner': 'example',
    'depends_on_past': False,
    'start_date': WORKFLOW_START_DATE,
    'email': ['example@example_company.com'],
    'email_on_failure': True,
    'email_on_retry': False,
    'retries': 5,
    'retry_delay': timedelta(minutes=5)
}
​
# initialize the DAG
dag = DAG(
    dag_id=WORKFLOW_DAG_ID,
    start_date=WORKFLOW_START_DATE,
    schedule_interval=WORKFLOW_SCHEDULE_INTERVAL,
    default_args=WORKFLOW_DEFAULT_ARGS,
)

 

对于task,可以分为2种类型

  • Operators: 执行具体的操作
  • Sensors: 检测一个流程或者数据结构的状态

 

Airflow的架构

test-01.png

metadata存储task和dag的状态

schedule 根据dag的定义和task的状态,决定哪些task将被执行

executor 是一个消息队列(Celery),决定哪个work将要执行哪个task

使用

基本结构

​
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from datetime import datetime
​
def greet():
    t = datetime.now().strftime("%Y-%m-%d %H:%M")
    print("Geeting @ " + str(t))
    return "end"
​
default_args = {
    "owner": "dongshaoyang",
    "start_date": datetime(2021, 9, 20),
    "retries":1,
}
​
dag = DAG(
        dag_id="test_article",
        default_args=default_args,
        schedule_interval='0 8 * * *',
        tags=['once'],
        catchup=False)
​
t1 = BashOperator(task_id="bash_op", bash_command="echo 'Hello World, today is {{ ds }}'", dag=dag)
​
python_greet = PythonOperator(task_id='run_python_task', python_callable=greet, dag=dag)
​
sleep_op = BashOperator(task_id='sleep_me', bash_command='sleep 5',dag=dag)
​
run_this_last = DummyOperator(task_id='run_this_last',dag=dag)
​
t1 >> python_greet >> sleep_op >> run_this_last
​
​

 

sensor的使用

适用场景:2个不同dag之间的依赖;2个不同dag的不同task之间的依赖

举个例子:crm_test_dag这个DAG中task2成功结束之后,另外一个DAG trade才可以开始执行。这种情况下就要使用ExternalTaskSensor, 使用的时候要注意execution_delta的设置,它的取值是前后两个DAG调度时间的差值。

dag = DAG('crm_test_dag', default_args=default_args, schedule_interval='50 21 * * *', concurrency=1, catchup=True)
​
t2 = BashOperator(
    task_id='task2',
    bash_command='sleep 5',
    retries=0,
    dag=dag,
)
​
with DAG('trade', default_args=airflow_utli.default_args, schedule_interval='50 8 * * *', concurrency=12, catchup=False, sla_miss_callback=airflow_utli.default_sla_callback) as dag:
    check_dependency = ExternalTaskSensor(
  task_id='bba_sensor',
  external_dag_id='crm_test_dag',
  external_task_id= 'task2',
  execution_delta = timedelta(hours=11),
  dag=dag,
  timeout = 60)
​

注意:timeout一定要设置,因为sensor是被当做task执行的,airflow只允许一定数目的task在单台实例上运行,如果不设置超时时间,该sensor讲一直运行从而占用其他任务的执行资源。timeout的单位是秒

 

Tricks

1.多任务依赖的话,使用数组

image-20210930114927753.png

# Using Lists (being a PRO :-D ) 
task_one >> task_two >> [task_two_1, task_two_2, task_two_3] >> end

 

2.一个Dag内的任务使用manager的with模式,这样结构更清晰

 # DAG with Context Manager   
 args = { 'owner': 'airflow',
     'start_date': airflow.utils.dates.days_ago(2), }   
  
  
with DAG(dag_id='example_dag', default_args=args, schedule_interval='0 0 * * *') as dag:
    run_this_last = DummyOperator(      task_id='run_this_last'   )
    run_this_first = BashOperator(      task_id='run_this_first',                                                 bash_command='echo 1'   )       
    run_this_first >> run_this_last

 

3.一些常用的宏定义

airflow.apache.org/docs/apache…