Best Practices Apache Airflow
Best Practices Apache Airflow
Best Practices Apache Airflow
A CLAIRVOYANT Story
www.clairvoyantsoft.com
Quick Poll
|2
Robert Sanders Shekhar Vemuri
One of the early employees of Clairvoyant, Robert Shekhar works with clients across various industries and
primarily works with clients to enable them along their helps define data strategy, and lead the implementation of
big data journey. Robert has deep background in web data engineering and data science efforts.
and enterprise systems, working on full-stack
implementations and then focusing on Data Was Co-founder and CTO of Blue Canary, a Predictive
management platforms. analytics solution to help with student retention, Blue
Canary was later Acquired by Blackboard in 2015.
|3
About
|4
Currently working on building a data security solution to help enterprises
discover, secure and monitor sensitive data in their environment.
|5
Agenda
|6
Why?
| 10
Scheduler Landscape
| 11
What is Apache Airflow?
● “Apache Airflow is an Open Source platform to programmatically Author, Schedule and Monitor workflows”
○ Workflows as Python Code (this is huge!!!!!)
○ Provides monitoring tools like alerts and a web interface
● Written in Python
● Apache Incubator Project
○ Joined Apache Foundation in early 2016
○ https://github.com/apache/incubator-airflow/
| 12
Why use Apache Airflow?
| 13
Building Blocks
| 14
Executors
| 15
Single Node Deployment
| 16
Multi-Node Deployment
| 17
Defining a Workflow
# Library Imports
from airflow.models import DAG
from airflow.operators import BashOperator
from datetime import datetime, timedelta
| 18
Defining a Dynamic Workflow
dag = DAG('dag_id', …)
last_task = None
for i in range(1, 3):
task = BashOperator(
task_id='task' + str(i),
bash_command="echo 'Task" + str(i) + "'",
dag=dag)
if last_task is None:
last_task = task
else:
last_task.set_downstream(task)
last_task = task
| 19
Operators
● Action Operators
○ BashOperator(bash_command))
○ SSHOperator(ssh_hook, ssh_conn_id, remote_host, command)
○ PythonOperator(python_callable=python_function)
● Transfer Operators
○ HiveToMySqlTransfer(sql, mysql_table, hiveserver2_conn_id, mysql_conn_id, mysql_preoperator, mysql_postoperator, bulk_load)
○ MySqlToHiveTransfer(sql, hive_table, create, recreate, partition, delimiter, mysql_conn_id, hive_cli_conn_id, tblproperties)
● Sensor Operators
○ HdfsSensor(filepath, hdfs_conn_id, ignored_ext, ignore_copying, file_size, hook)
○ HttpSensor(endpoint, http_conn_id, method, request_params, headers, response_check, extra_options)
Many More
| 20
First Use Case (Description)
| 21
Second Use Case (Description)
| 22
Second Use Case (Architecture)
| 23
Second Use Case (DAG)
1,000 ft view 100,000 ft view
| 24
Lessons Learned
● Support
● Documentation
● Bugs and Odd Behavior
● Monitor Performance with Charts
● Tune Retries
● Tune Parallelism
| 25
Best Practices
| 26
Scaling & High Availability
| 27
High Availability for the Scheduler
Scheduler Failover Controller: https://github.com/teamclairvoyant/airflow-scheduler-failover-controller
| 28
Deployment & Management
| 29
Security
| 30
Testing
● Airflow Unit Test Mode - Loads configurations from the unittests.cfg file
[tests]
unit_test_mode = true
● Always at the very least ensure that the DAG is valid (can be done as part of CI)
● Take it a step ahead by mock pipeline testing(with inputs and outputs) (especially if your DAGs
are broad)
| 31