The basics of Apache Airflow
Core Components
| name | description |
|---|---|
| Web server | Flask server with Gunicorn serving the UI |
| Scheduler | Daemon in charge of scheduling workflows |
| Metastore | Database where metadata are stored |
| Executor | Class defining how your tasks should be executed |
| Worker | Process/sub process executing your task |
Dag
What is a dag that stands for debated acyclic graph ?
And it is a graph with nodes and directed Agee's but no loops.
Basically in airflow is a dataprep play.
Your data pipeline is represented as g graph.
Operator
For instance,the second important core concepts of airflow is the concept of operator.
db = connect(host, credentials)
db.insert(sql_request)
What is an operator?
You can think of an opertor as an object encapsulating the task,the job that you want to execute.
For example,you want to connect to your database and insert data in it.
You can use a special opertor to do that.
So basically an operator is a task in your DAG,in your data pipeline.
There are three different types of operator and the first one is the action operators.
Action operators are operators in charge of executing something.
Next,you have the transfer operator also allowing you to transfer data from A to a desctination like Presedo to Mystikal operator.And there is a ton of transfer operators available in F,but keep in mind,a transfer operator is and operator transferring data from a source to a desctination.
Then last but no least,the sensors.A sensor allows you to wait for somethind to happen before moving forward,before getting completed.For example,you want to wait for a file to land at a specific location.In your file system,you can use the file sensor.
What Airflow is not ?
Not a data streaming solution neither a data processing framework.
How Airflow works ?
Component Interactions
First,the web server fetches some metadata from the meta database of effort in order to display information corresponding to your Dags,your task instances or your users on th user interface.Next,this category interacts with the meta database and the executor in order to trigger you dags,in order to trigger your tasks.Finally,the executor interacts also with the meta database in order to update the tasks that just have been completed.There is a Q in it and that's how your tasks are executed in order.
If you want to start scaling every if you want to execute as many tasks as you want.In that case you will need to move to another architecture,which is the multi nodes architecture.
What happens when a task is triggered,when a dag is triggered ?
Once your dag is in the field of dags,both those scheduler and the web server will parse you dag.
Once it is parsed,the scheduler will vertify if the dag is really to be triggerd so that a DagRun object is created.
And a DagRun object is nothing more than an instance of your dag running at a given time.
And that DagRun object is stored in the meta database of airflow with the status running.
Next,if there is a task ready to be triggered in your dag,In that case,the scheduler creates a task instance object corresponding to your task with the status scheduled in the meta database of airflow.
Next the schduler sends taskInstance object to the executor with status queued and once the executor is ready to run the task.This time the taskInstance object has the status running and the executor updates the status of the task in the meta database of Airflow.
As soon as the task is completed again,the executor updates the status of the task in the meta database.
And finally,the scheduler verifies if the task is done.If there is no more tasks to execute,if it is the case,then the dag object has the tissues completed as well.
Then,last but not least,the web server udpates the user interface.
Configure the parsing process
With the Scheduler:
min_file_process_interval
Number of seconds after which a DAG file is parsed. The DAG file is parsed every min_file_process_interval number of seconds. Updates to DAGs are reflected after this interval.(update)
dag_dir_list_interval
How often (in seconds) to scan the DAGs directory for new files. Default to 5 minutes.(new)
Those 2 settings tell you that you have to wait up 5 minutes before your DAG gets detected by the scheduler and then it is parsed every 30 seconds by default.
With the Webserver:
worker_refresh_interval
Number of seconds to wait before refreshing a batch of workers. 30 seconds by default.
This setting tells you that every 30 seconds, the web server parses for new DAG in your DAG folder.
Installing Airflow 2.0
Step by step
CLI Commands
Installing Airflow
docker run -it --rm -p 8080:8080 python:3.8-slim /bin/bash
* Create and start a docker container from the Docker image python:3.8-slim and execute the command /bin/bash in order to have a shell session
python -V
* Print the Python version
export AIRFLOW_HOME=/usr/local/airflow
* Export the environment variable AIRFLOW_HOME used by Airflow to store the dags folder, logs folder and configuration file
env | grep airflow
* To check that the environment variable has been well exported
apt-get update -y && apt-get install -y wget libczmq-dev curl libssl-dev git inetutils-telnet bind9utils freetds-dev libkrb5-dev libsasl2-dev libffi-dev libpq-dev freetds-bin build-essential default-libmysqlclient-dev apt-utils rsync zip unzip gcc && apt-get clean
* Install all tools and dependencies that can be required by Airflow
useradd -ms /bin/bash -d ${AIRFLOW_HOME} airflow
* Create the user airflow, set its home directory to the value of AIRFLOW_HOME and log into it
cat /etc/passwd | grep airflow
* Show the file /etc/passwd to check that the airflow user has been created
pip install --upgrade pip
* Upgrade pip (already installed since we use the Docker image python 3.5)
su - airflow
* Log into airflow
python -m venv .sandbox
* Create the virtual env named sandbox
source .sandbox/bin/activate
* Activate the virtual environment sandbox
wget https://raw.githubusercontent.com/apache/airflow/constraints-2.0.2/constraints-3.8.txt
* Download the requirement file to install the right version of Airflow’s dependencies
pip install "apache-airflow[crypto,celery,postgres,cncf.kubernetes,docker]"==2.0.2 --constraint ./constraints-3.8.txt
* Install the version 2.0.2 of apache-airflow with all subpackages defined between square brackets. (Notice that you can still add subpackages after all, you will use the same command with different subpackages even if Airflow is already installed)
airflow db init
* Initialise the metadatabase
airflow scheduler &
* Start Airflow’s scheduler in background
airflow users create -u admin -p admin -r Admin -e admin@admin.com -f admin -l admin
* Create user
airflow webserver &
* Start Airflow’s webserver in background
Use Dockerfile
docker build -t airflow-basic .
* Build a docker image from the Dockerfile in the current directory (airflow-materials/airflow-basic) and name it airflow-basic
docker run --rm -d -p 8080:8080 airflow-basic
Dockerfile
# Base Image
FROM python:3.8-slim
LABEL maintainer="MarcLamberti"
# Arguments that can be set with docker build
ARG AIRFLOW_VERSION=2.0.2
ARG AIRFLOW_HOME=/opt/airflow
# Export the environment variable AIRFLOW_HOME where airflow will be installed
ENV AIRFLOW_HOME=${AIRFLOW_HOME}
# Install dependencies and tools
RUN apt-get update -yqq && \
apt-get upgrade -yqq && \
apt-get install -yqq --no-install-recommends \
wget \
libczmq-dev \
curl \
libssl-dev \
git \
inetutils-telnet \
bind9utils freetds-dev \
libkrb5-dev \
libsasl2-dev \
libffi-dev libpq-dev \
freetds-bin build-essential \
default-libmysqlclient-dev \
apt-utils \
rsync \
zip \
unzip \
gcc \
vim \
locales \
&& apt-get clean
COPY ./constraints-3.8.txt /constraints-3.8.txt
# Upgrade pip
# Create airflow user
# Install apache airflow with subpackages
RUN pip install --upgrade pip && \
useradd -ms /bin/bash -d ${AIRFLOW_HOME} airflow && \
pip install apache-airflow[postgres]==${AIRFLOW_VERSION} --constraint /constraints-3.8.txt
# Copy the entrypoint.sh from host to container (at path AIRFLOW_HOME)
COPY ./entrypoint.sh ./entrypoint.sh
# Set the entrypoint.sh file to be executable
RUN chmod +x ./entrypoint.sh
# Set the owner of the files in AIRFLOW_HOME to the user airflow
RUN chown -R airflow: ${AIRFLOW_HOME}
# Set the username to use
USER airflow
# Set workdir (it's like a cd inside the container)
WORKDIR ${AIRFLOW_HOME}
# Create the dags folder which will contain the DAGs
RUN mkdir dags
# Expose ports (just to indicate that this container needs to map port)
EXPOSE 8080
# Execute the entrypoint.sh
ENTRYPOINT [ "/entrypoint.sh" ]
entrypoint.sh
#!/usr/bin/env bash
# Initiliase the metastore
airflow db init
# Run the scheduler in background
airflow scheduler &> /dev/null &
# Create user
airflow users create -u admin -p admin -r Admin -e admin@admin.com -f admin -l admin
# Run the web server in foreground (for docker logs)
exec airflow webserver
Quick Tour of Airflow CLI
docker ps
* Show running docker containers
docker exec -it container_id /bin/bash
* Execute the command /bin/bash in the container_id to get a shell session
pwd
* Print the current path where you are
airflow db init
* Initialise the metadatabase
airflow db reset
* Reinitialize the metadatabase (Drop everything)
airflow db upgrade
* Upgrade the metadatabase (Latest schemas, values, ...)
airflow webserver
* Start Airflow’s webserver
airflow scheduler
* Start Airflow’s scheduler
airflow celery worker
* Start a Celery worker (Useful in distributed mode to spread tasks among nodes - machines)
airflow dags list
* Give the list of known dags (either those in the examples folder or in dags folder)
ls
* Display the files/folders of the current directory
airflow dags trigger example_python_operator
* Trigger the dag example_python_operator with the current date as execution date
airflow dags trigger example_python_operator -e 2021-01-01
* Trigger the dag example_python_operator with a date in the past as execution date (This won’t trigger the tasks of that dag unless you set the option catchup=True in the DAG definition)
airflow dags trigger_dag example_python_operator -e '2021-01-01 19:04:00+00:00'
* Trigger the dag example_python_operator with a date in the future (change the date here with one having +2 minutes later than the current date displayed in the Airflow UI). The dag will be scheduled at that date.
airflow dags list-runs -d example_python_operator
* Display the history of example_python_operator’s dag runs
airflow tasks list example_python_operator
* List the tasks contained into the example_python_operator dag
airflow tasks test example_python_operator print_the_context 2021-01-01
* Allow to test a task (print_the_context) from a given dag (example_python_operator here) without taking care of dependencies and past runs. Useful for debugging.