SlideShare a Scribd company logo
Airflow
Workflow management system
Ilias OKACHA
Index
- Workflows Management Systems
- Architecture
- Building blocks
- More features
- User Interface
- Security
- CLI
- Demo
WTH is a Workflow Management System ?
A worflow Management system
is:
Is a data-centric software (framework) for :
- Settingup
- Performing
- Monitoring
of a defined sequenceofprocesses and tasks
Popular Workflow Management Systems
Airflow Architecture
Airflow architecture
SequentialExecutor / LocalExecutor
Airflow architecture
CeleryExecutor
Airflow architecture
HA + CeleryExector
Airflow architecture
● MesosExecutor : already available in contrib package
● KubernetesExecutor ??
Building blocks
Building blocks
Dags :
- Directed Acyclic Graph
- Is a collection of all the tasks you want to run
- DAGs describe how to run a workflow
Building blocks
Dags :
Building blocks
Operators :
- Describes a singletaskin a workflow.
- Determine what actually gets done
- Operators generally run independently (atomic)
- The DAG make sure that operators run in the correct certain order
- They may run on completely different machines
Building blocks
Operators : There are 3 main types of operators:
● Operators that performs an action,or tell another system to perform an action
● Transferoperators move data from one system to another
● Sensorsare a certain type of operator that will keeprunninguntil a certain criterionis met.
○ Examples include a specific file landing in HDFS or S3.
○ A partition appearing in Hive.
○ A specific time of the day.
Operators
: - Operators :
- BashOperator
- PythonOperator
- EmailOperator
- HTTPOperator
- MySqlOperator
- SqliteOperator
- PostgresOperator
- MsSqlOperator
- OracleOperator
- JdbcOperator
- DockerOperator
- HiveOperator
- SlackOperator
Building blocks
Operators
: - Transfers :
- S3FileTransferOperator
- PrestoToMysqlOperator
- MySqlToHiveTransfer
- S3ToHiveTransfer
- BigQueryToCloudStorageOperator
- GenericTransfer
- HiveToDruidTransfer
- HiveToMySqlTransfer
Building blocks
Operators
: - Sensors :
- ExternalTaskSensor
- HdfsSensor
- HttpSensor
- MetastorePartitionSensor
- HivePartitionSensor
- S3KeySensor
- S3PrefixSensor
- SqlSensor
- TimeDeltaSensor
- TimeSensor
- WebHdfsSensor
Building blocks
Building blocks
Operators :
Tasks : a parameterized instance of an
operator
Building blocks
Building blocks
Task Instance : Dag + Task + point in time
- Specific run of a Task
- A task assigned to a DAG
- Has State associated with a specific run of the DAG
- States : it could be
- running
- success,
- failed
- skipped
- upfor retry
- …
Building blocks
Workflows :
● DAG: a description of the order in which work should take place
● Operator: a class that acts as a template for carrying out some work
● Task: a parameterized instance of an operator
● Task Instance: a task that
○ Has been assigned to a DAG
○ Has a state associated with a specific run of the DAG
● By combining DAGs and Operators to create TaskInstances, you can build complex workflows.
Building blocks
More features
- Features :
- Hooks
- Connections
- Variables
- XComs
- SLA
- Pools
- Queues
- Trigger Rules
- Branchings
- SubDags
More features
- Interface to external platforms and databases :
- Hive
- S3
- MySQL
- PostgreSQL
- HDFS
- Hive
- Pig
- …
- Act as building block for Operators
- Use Connection to retrieve authentication informations
- Keep authentication infos out of pipelines.
More features
Hooks :
Connections
:
Connection informations to external systems are stored in the airflow metadata Database and managed in the UI
More features
More features
More features
Exemple de Hook +
connection :
More features
Variables :
- A generic way to store and retrieve arbitrary content or settings as a simple key value store within Airflow.
- Variables can be listed, created, updated and deleted from the UI (Admin -> Variables), code or CLI.
- While your pipeline code definition and most of your constants and variables should be defined in code and stored in source control, it
can be useful to have some variables or configuration items accessible and modifiable through the UI.
More features
XCom or Cross-
communication:
● Let tasks exchangemessages allowing sharedstate.
● Defined by a key,value,and timestamp.
● Also trackattributes like the task/DAG that created the XCom and when it should become visible.
● Any objectthat can be pickledcan be used as an XCom value.
XComs can be :
● Pushed (sent) :
○ Calling xcom_push()
○ If a task return a value (from its operator execute() method) or from a PythonOperator’s python_callable
● Pulled (received) : calling xcom_pull()
More features
More features
SLA :
- Service Level Agreements, or time by which a task or DAG should have succeeded,
- Can be set at a task level as a timedelta.
- An alert email is sent detailing the list of tasks that missed their SLA.
More features
Pools :
- Some systems can get overwhelmed when too many processes hit them at the same time.
- Limittheexecutionparallelismon arbitrary sets of tasks.
More features
Pools :
Queues : (only on CeleryExecutors) :
- Every Task can be assigned a specific queue name
- By default, both worker and tasks are assigned with the default_queue queue
- Workers can be assigned multiple queues
- Very useful feature when specialized workers are needed (GPU, Spark…)
More features
More features
Trigger Rules:
Though the normal workflow behavior is to trigger tasks when all their directly upstream tasks have succeeded, Airflow allows for more complex
dependency settings.
All operators have a trigger_rule argument which defines the rule by which the generated task get triggered. The default value for trigger_rule is
all_success and can be defined as “trigger this task when all directly upstream tasks have succeeded”. All other rules described here are based
on direct parent tasks and are values that can be passed to any operator while creating tasks:
● all_success: (default) all parents have succeeded
● all_failed: all parents are in a failed or upstream_failed state
● all_done: all parents are done with their execution
● one_failed: fires as soon as at least one parent has failed, it does not wait for all parents to be done • one_success: fires as soon as at
least one parent succeeds, it does not wait for all parents to be done • dummy: dependencies are just for show, trigger at will.
User Interface
User Interface
Dags view
:
User Interface
Tree view :
User Interface
Graph view
:
User Interface
Gantt view
:
User Interface
Task duration
:
User Interface
Data Profiling : SQL
Queries
User Interface
Data Profiling :
Charts
User Interface
Data Profiling :
Charts
CL
I
CL
I
https://airflow.apache.org/cli.html
airflow variables [-h] [-s KEY VAL] [-g KEY] [-j] [-d VAL] [-i FILEPATH] [-e FILEPATH] [-x
KEY]
airflow connections [-h] [-l] [-a] [-d] [--conn_id CONN_ID]
[--conn_uri CONN_URI] [--conn_extra CONN_EXTRA]
[--conn_type CONN_TYPE] [--conn_host CONN_HOST]
[--conn_login CONN_LOGIN] [--conn_password CONN_PASSWORD]
[--conn_schema CONN_SCHEMA] [--conn_port CONN_PORT]
airflow pause [-h] [-sd SUBDIR] dag_id
airflow test [-h] [-sd SUBDIR] [-dr] [-tp TASK_PARAMS] dag_id task_id execution_date
airflow backfill dag_id task_id -s START_DATE -e END_DATE
airflow clear DAG_ID
airflow resetdb [-h] [-y]
...
Security
Security
By default : all access are open
Support ;
● Web authentication with :
○ Password
○ LDAP
○ Custom auth
○ Kerberos
○ OAuth
■ Github Entreprise Authentication
■ Google Authentication
● Impersonation (run as other $USER)
● Secure access via SSL
Demo
Demo
1. Facebook Ads insights data pipeline.
2. Run a pyspark script on a ephemeral dataproc cluster only when s3 data input is available
3. Useless workflow : Hook + Connection + Operators + Sensors + XCom +( SLA ):
○ List s3 files (hooks)
○ Share state with the next task (xcom)
○ Write content to s3 (hooks)
○ Resume the workflow when an S3 DONE.FLAG file is ready (sensor)
Resources
https://airflow.apache.org
http://www.clairvoyantsoft.com/assets/whitepapers/GuideToApacheAirflow.pdf
https://speakerdeck.com/artwr/apache-airflow-at-airbnb-introduction-and-lessons-learned
https://www.slideshare.net/sumitmaheshwari007/apache-airflow
Thanks

More Related Content

PPTX
airflow web UI and CLI.pptx
PDF
Airflow presentation
PDF
AS7 and CLI
PDF
Cli jbug
PDF
GoDocker presentation
PPTX
Running Airflow Workflows as ETL Processes on Hadoop
PDF
Airflow Intro-1.pdf
PPTX
Airflow at lyft
airflow web UI and CLI.pptx
Airflow presentation
AS7 and CLI
Cli jbug
GoDocker presentation
Running Airflow Workflows as ETL Processes on Hadoop
Airflow Intro-1.pdf
Airflow at lyft

Similar to airflowpresentation1-180717183432.pptx (20)

PPTX
Oracle GoldenGate Microservices Overview ( with Demo )
PPTX
adaidoadaoap9dapdadadjoadjoajdoiajodiaoiao
PDF
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
PDF
Airflow tutorials hands_on
PDF
PaaSTA: Autoscaling at Yelp
PPTX
Apache Airdrop detailed description.pptx
PDF
Building Automated Data Pipelines with Airflow.pdf
PDF
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
PPTX
Apache Airflow overview
PDF
Apache Airflow
PDF
Apache Airflow
PPTX
DataPipelineApacheAirflow.pptx
PDF
(ATS6-PLAT07) Managing AEP in an enterprise environment
PPTX
Cloudify workshop at CCCEU 2014
PPTX
Apache airflow
PDF
Introduction to Apache Airflow
PDF
Catalyst MVC
PDF
Presto Testing Tools: Benchto & Tempto (Presto Boston Meetup 10062015)
PPTX
Introduction to kubernetes
PDF
Orchestrating workflows Apache Airflow on GCP & AWS
Oracle GoldenGate Microservices Overview ( with Demo )
adaidoadaoap9dapdadadjoadjoajdoiajodiaoiao
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Airflow tutorials hands_on
PaaSTA: Autoscaling at Yelp
Apache Airdrop detailed description.pptx
Building Automated Data Pipelines with Airflow.pdf
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Apache Airflow overview
Apache Airflow
Apache Airflow
DataPipelineApacheAirflow.pptx
(ATS6-PLAT07) Managing AEP in an enterprise environment
Cloudify workshop at CCCEU 2014
Apache airflow
Introduction to Apache Airflow
Catalyst MVC
Presto Testing Tools: Benchto & Tempto (Presto Boston Meetup 10062015)
Introduction to kubernetes
Orchestrating workflows Apache Airflow on GCP & AWS
Ad

More from VIJAYAPRABAP (9)

PPTX
TRACKING TECHNIQUES OF VIRTUAL REALITY.pptx
PDF
Business intelligence,data warehousing and data visualization.pdf
PPT
Basic Introduction about Virtual Reality .ppt
PPT
Virtual Reality based modern Experiences.ppt
PDF
cloud lab contents in engineering fields
PPTX
01-Introduction-to-Hive.pptx
PPT
DIRECTIVE PRINCIPLES 22APRIL 2022.ppt
PPTX
Constitution.pptx
PPTX
bigdata.pptx
TRACKING TECHNIQUES OF VIRTUAL REALITY.pptx
Business intelligence,data warehousing and data visualization.pdf
Basic Introduction about Virtual Reality .ppt
Virtual Reality based modern Experiences.ppt
cloud lab contents in engineering fields
01-Introduction-to-Hive.pptx
DIRECTIVE PRINCIPLES 22APRIL 2022.ppt
Constitution.pptx
bigdata.pptx
Ad

Recently uploaded (20)

PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PDF
Arduino robotics embedded978-1-4302-3184-4.pdf
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PPTX
Lesson 3_Tessellation.pptx finite Mathematics
PPTX
OOP with Java - Java Introduction (Basics)
PDF
오픈소스 LLM, vLLM으로 Production까지 (Instruct.KR Summer Meetup, 2025)
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PPT
Drone Technology Electronics components_1
PPTX
Simulation of electric circuit laws using tinkercad.pptx
PPTX
The-Looming-Shadow-How-AI-Poses-Dangers-to-Humanity.pptx
PDF
Queuing formulas to evaluate throughputs and servers
PPTX
Unit 5 BSP.pptxytrrftyyydfyujfttyczcgvcd
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PPTX
“Next-Gen AI: Trends Reshaping Our World”
PPTX
AgentX UiPath Community Webinar series - Delhi
PPTX
24AI201_AI_Unit_4 (1).pptx Artificial intelligence
PPTX
web development for engineering and engineering
PPTX
Practice Questions on recent development part 1.pptx
PPTX
Internship_Presentation_Final engineering.pptx
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
Arduino robotics embedded978-1-4302-3184-4.pdf
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
Lesson 3_Tessellation.pptx finite Mathematics
OOP with Java - Java Introduction (Basics)
오픈소스 LLM, vLLM으로 Production까지 (Instruct.KR Summer Meetup, 2025)
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
Drone Technology Electronics components_1
Simulation of electric circuit laws using tinkercad.pptx
The-Looming-Shadow-How-AI-Poses-Dangers-to-Humanity.pptx
Queuing formulas to evaluate throughputs and servers
Unit 5 BSP.pptxytrrftyyydfyujfttyczcgvcd
Model Code of Practice - Construction Work - 21102022 .pdf
“Next-Gen AI: Trends Reshaping Our World”
AgentX UiPath Community Webinar series - Delhi
24AI201_AI_Unit_4 (1).pptx Artificial intelligence
web development for engineering and engineering
Practice Questions on recent development part 1.pptx
Internship_Presentation_Final engineering.pptx

airflowpresentation1-180717183432.pptx

  • 2. Index - Workflows Management Systems - Architecture - Building blocks - More features - User Interface - Security - CLI - Demo
  • 3. WTH is a Workflow Management System ? A worflow Management system is: Is a data-centric software (framework) for : - Settingup - Performing - Monitoring of a defined sequenceofprocesses and tasks
  • 9. Airflow architecture ● MesosExecutor : already available in contrib package ● KubernetesExecutor ??
  • 11. Building blocks Dags : - Directed Acyclic Graph - Is a collection of all the tasks you want to run - DAGs describe how to run a workflow
  • 13. Building blocks Operators : - Describes a singletaskin a workflow. - Determine what actually gets done - Operators generally run independently (atomic) - The DAG make sure that operators run in the correct certain order - They may run on completely different machines
  • 14. Building blocks Operators : There are 3 main types of operators: ● Operators that performs an action,or tell another system to perform an action ● Transferoperators move data from one system to another ● Sensorsare a certain type of operator that will keeprunninguntil a certain criterionis met. ○ Examples include a specific file landing in HDFS or S3. ○ A partition appearing in Hive. ○ A specific time of the day.
  • 15. Operators : - Operators : - BashOperator - PythonOperator - EmailOperator - HTTPOperator - MySqlOperator - SqliteOperator - PostgresOperator - MsSqlOperator - OracleOperator - JdbcOperator - DockerOperator - HiveOperator - SlackOperator Building blocks
  • 16. Operators : - Transfers : - S3FileTransferOperator - PrestoToMysqlOperator - MySqlToHiveTransfer - S3ToHiveTransfer - BigQueryToCloudStorageOperator - GenericTransfer - HiveToDruidTransfer - HiveToMySqlTransfer Building blocks
  • 17. Operators : - Sensors : - ExternalTaskSensor - HdfsSensor - HttpSensor - MetastorePartitionSensor - HivePartitionSensor - S3KeySensor - S3PrefixSensor - SqlSensor - TimeDeltaSensor - TimeSensor - WebHdfsSensor Building blocks
  • 19. Tasks : a parameterized instance of an operator Building blocks
  • 20. Building blocks Task Instance : Dag + Task + point in time - Specific run of a Task - A task assigned to a DAG - Has State associated with a specific run of the DAG - States : it could be - running - success, - failed - skipped - upfor retry - …
  • 21. Building blocks Workflows : ● DAG: a description of the order in which work should take place ● Operator: a class that acts as a template for carrying out some work ● Task: a parameterized instance of an operator ● Task Instance: a task that ○ Has been assigned to a DAG ○ Has a state associated with a specific run of the DAG ● By combining DAGs and Operators to create TaskInstances, you can build complex workflows.
  • 24. - Features : - Hooks - Connections - Variables - XComs - SLA - Pools - Queues - Trigger Rules - Branchings - SubDags More features
  • 25. - Interface to external platforms and databases : - Hive - S3 - MySQL - PostgreSQL - HDFS - Hive - Pig - … - Act as building block for Operators - Use Connection to retrieve authentication informations - Keep authentication infos out of pipelines. More features Hooks :
  • 26. Connections : Connection informations to external systems are stored in the airflow metadata Database and managed in the UI More features
  • 28. More features Exemple de Hook + connection :
  • 29. More features Variables : - A generic way to store and retrieve arbitrary content or settings as a simple key value store within Airflow. - Variables can be listed, created, updated and deleted from the UI (Admin -> Variables), code or CLI. - While your pipeline code definition and most of your constants and variables should be defined in code and stored in source control, it can be useful to have some variables or configuration items accessible and modifiable through the UI.
  • 31. XCom or Cross- communication: ● Let tasks exchangemessages allowing sharedstate. ● Defined by a key,value,and timestamp. ● Also trackattributes like the task/DAG that created the XCom and when it should become visible. ● Any objectthat can be pickledcan be used as an XCom value. XComs can be : ● Pushed (sent) : ○ Calling xcom_push() ○ If a task return a value (from its operator execute() method) or from a PythonOperator’s python_callable ● Pulled (received) : calling xcom_pull() More features
  • 32. More features SLA : - Service Level Agreements, or time by which a task or DAG should have succeeded, - Can be set at a task level as a timedelta. - An alert email is sent detailing the list of tasks that missed their SLA.
  • 33. More features Pools : - Some systems can get overwhelmed when too many processes hit them at the same time. - Limittheexecutionparallelismon arbitrary sets of tasks.
  • 35. Queues : (only on CeleryExecutors) : - Every Task can be assigned a specific queue name - By default, both worker and tasks are assigned with the default_queue queue - Workers can be assigned multiple queues - Very useful feature when specialized workers are needed (GPU, Spark…) More features
  • 36. More features Trigger Rules: Though the normal workflow behavior is to trigger tasks when all their directly upstream tasks have succeeded, Airflow allows for more complex dependency settings. All operators have a trigger_rule argument which defines the rule by which the generated task get triggered. The default value for trigger_rule is all_success and can be defined as “trigger this task when all directly upstream tasks have succeeded”. All other rules described here are based on direct parent tasks and are values that can be passed to any operator while creating tasks: ● all_success: (default) all parents have succeeded ● all_failed: all parents are in a failed or upstream_failed state ● all_done: all parents are done with their execution ● one_failed: fires as soon as at least one parent has failed, it does not wait for all parents to be done • one_success: fires as soon as at least one parent succeeds, it does not wait for all parents to be done • dummy: dependencies are just for show, trigger at will.
  • 46. CL I
  • 47. CL I https://airflow.apache.org/cli.html airflow variables [-h] [-s KEY VAL] [-g KEY] [-j] [-d VAL] [-i FILEPATH] [-e FILEPATH] [-x KEY] airflow connections [-h] [-l] [-a] [-d] [--conn_id CONN_ID] [--conn_uri CONN_URI] [--conn_extra CONN_EXTRA] [--conn_type CONN_TYPE] [--conn_host CONN_HOST] [--conn_login CONN_LOGIN] [--conn_password CONN_PASSWORD] [--conn_schema CONN_SCHEMA] [--conn_port CONN_PORT] airflow pause [-h] [-sd SUBDIR] dag_id airflow test [-h] [-sd SUBDIR] [-dr] [-tp TASK_PARAMS] dag_id task_id execution_date airflow backfill dag_id task_id -s START_DATE -e END_DATE airflow clear DAG_ID airflow resetdb [-h] [-y] ...
  • 49. Security By default : all access are open Support ; ● Web authentication with : ○ Password ○ LDAP ○ Custom auth ○ Kerberos ○ OAuth ■ Github Entreprise Authentication ■ Google Authentication ● Impersonation (run as other $USER) ● Secure access via SSL
  • 50. Demo
  • 51. Demo 1. Facebook Ads insights data pipeline. 2. Run a pyspark script on a ephemeral dataproc cluster only when s3 data input is available 3. Useless workflow : Hook + Connection + Operators + Sensors + XCom +( SLA ): ○ List s3 files (hooks) ○ Share state with the next task (xcom) ○ Write content to s3 (hooks) ○ Resume the workflow when an S3 DONE.FLAG file is ready (sensor)