SlideShare a Scribd company logo
Airflow
The Compele Hands-On Course
강석우 pko89403@gmail.com
Why
AIRFLOW
What is airflow ?
š 프로그램적으로 데이터 파이프라인을 author, schedule, monitor
š 컴포넌트 : Web Server , Scheduler, Executor, Worker, Metadatabase
š 키 컨셉 : DAG, Operator, Task, TaskInstance, Workflow
Airflow Architecture
š Airflow Webserver : Serves the UI Dashboard over http
š Airflow Scheduler : A daemon
š Airflow Worker : working wrapper
š Metadata Database : Stores information regarding state of tasks
š Executor : Message queuing process that is bound to the scheduler and determines the
worker processes that executes scheduled tasks
Airflow Webserver
Airflow Scheduler
Worker
Worker
Worker
Meta DB
Logs
Dags
How airflow works ?
š 1. The scheduler reads the DAG folder
š 2. Your Dag is parsed by a process to create a DagRun based on the scheduling
parameters of your DAG.
š 3. A TaskInstance is instantiated for each Task that needs to be executed and flagged to
“Scheduled” in the metadata database
š 4. The Scheduler gets all TaskInstances flagged “Scheduled” from the metadata database,
changes the state to “Queued” and sends them to the executors to be executed.
š 5. Executors pull out Tasks from the queue ( depending on your execution setup ), change
the state from “Queued” to “Running” and Workers start executing the TaskInstances.
š 6. When a Task is finished, the Executor changes the state of that task to its final
state( success, failed, etc ) in the database and the DAGRun is updated by the scheduler
with the state “Success” or “Failed”. Of course, the web server periodically fetch data
from metadatabae to update UI.
Start Airflow ( From install to UI )
š AIRFLOW_GPL_UNICODE=yes pip install “apache-airflow[celery, crypto, postgres, hive, rabbitmq, redis]”
š Airflow initdb
š Airflow upgraded
š Ls
š cd airflow
š grep dags_folder airflow.cfg
š mkdir –p /home/airflow/airflow/dags
š Ls
š Vim airflow.cfg ( Configuration File )
š Load.example ( false )
š Airflow resetdb
š Airflow scheduler
š Airflow webserver
QuickTour of Airflow
š airflow list_dags
š airflow list_tasks {dag name} –tree
š airflow test {dag name} python_task {execution date}
š airflow –h
What is DAG ?
š Finte directed graph with no directed cycles. No cycle
š Dag represents a collection of tasks to run, organized in a way that represent their
dependencies and relations
š Each node is a Task
š Each edge is Dependency
š 어떻게 워크플로우를 실행시킬건가?
DAG’s important properties
š Defined in Python files placed into Airflow’s DAG_FOLDER ( usually ~/airflow/dags)
š Dag_id
š Description
š Start_date
š Schedule_interval
š Dependent_on_past : run the next DAGRun if the Previous one completed successfully
š Default_args : constructor keyword parameter when initializing opeators
What is Operator?
š Determines what actually gets done.
š Operators are usually (but now always) atomic, meaning they can stand on their own and
don’t need to share resources with any other operators.
š Definition of single task
š Should be idempotent ( 항상 같은 결과를 출력 )
š Task is created by instantiating an Operator class
š An operator defines the nature of this task and how should it be executed
š Operator is instantiated, this task becomes a node in your DAG.
Many Operators
š Bash Operator
š Python Operator
š EmailOperator ( sends an email )
š SqlOperator ( Executes a SQL command
š All Operators inherit from BaseOperator
š 3 types of operators
š Action operators that perform action (BashOperator, PythonOperator, EmailOperator … )
š Transfer operators that move data from one system to another ( sqlOperator, sftpOperator)
š Sensor operators waiting for data to arrive at defined location.
Operator ++
š Transfer Operators
š Move data from one system to another
š Pulled out from the source, staged on the machine where the executor is running, and then transferred
to the target system.
š Don’t use if you are dealing with a large amount of data
š Sensor Operators
š Inherit of BaseSensorOperator
š They are useful for monitoring external processes like waiting for files to be uploaded in HDFS or a
partition appearing in Hive
š Basically long running task
š Sensor operator has a poke method called repeatedly until it returns True ( method used for monitoring
the external process)
Make Dependencies in python
š set_upstream()
š set_downstream()
š << ( = set_upstream )
š >> ( = set_downstream )
A
B
C
D
š B depends of A
š C depends of A
š D depends of B and C
( Example )
A.set_downstream(B)
A >> B
A >> { B, C } >> D
How the Scheduler Works
š DagRun
š A Dag consists of Tasks and need those tasks to run
š When the Scheduler parses a Dag, it automatically creates a DagRun which is an instantiation of a DAG in time according to start_date
and schedule
š Backfill and Catchup
š Scheduler Interval
š None
š @once
š @hourly
š @daily
š @weekly
š @monthly
š @yearly
š Cron time string format can be used : ( * * * * * - Minute(0-59) Hour(0-23) Day of the month(1-31) Month(1-12) Day of the week(0-7)
Concurrency vs Parallelism
š Concurrent – If it can support two or more actions in progress at the same time
š Parallel – If it can support two or more actions executing simultaneously
š In concurrent systems, multiple actions can be in progress (may not be executed) at the
same time
š In parallel systems, multiple actions are simultaneously executed
Database and Executor
š Sequential Executor ( Default executor, SQLlite )
š Default executor you get when you run Apache Airflow
š Only run one task at time (Sequential), useful for debugging
š It is the only executor that can be used with SQLite since SQLlite donesn’t support multiple writers
š Local Executor ( PostgreSQL )
š It can run multiple tasks at a time
š Multiprocessing python library and queues to parallelize the execution of tasks
š Run tasks by spawning processes in a controlled fashion in different modes on the same machine
š Can tune the number of processes to spawn by using the parallelism parameter
Database and Executor
š Celery Executor
š Celery == Python Task-Queue System
š Task-Queue System handle distribution of tasks on workers across threads or network nodes
š Tasks need to be pushed into a broker( RabbitMQ )
š celery workers will pop them and schedule task executions
š Recommend for production use of Airflow
š Allows distributing the execution of task instances to multiple worker node(Computer)
š ++ Dask, Mesos, Kubernetes … etc
Celery Executor, PostgreSQL and RabbitMQ Structure
Executor Architecture
Meta DB
Web Server
Scheduler +
Worker
Local Executor ( Single Machine )
Meta DB
Web Server Scheduler +
Worker
Worker
Worker
Celery
Celery Executor
Advanced Concept
š SubDAG
š Minimising repetitive patterns
š Main DAG mangages all the subDAGs as normal taks
š SubDAGs must be scheduled the same as their parent DAG
š Hooks
š Interfaces to interact with your external sources such as (PostgreSQL, Spark, SFTP … )
XCOM
š Tasks communicate ( cross-communication , allows multiple tasks to exchange messages )
š Principally defined by a key, value and a timestamp
š XCOMs data can be “pushed” or “pulled”
š X_com_push()
š If a task returns a value, a XCOM containing that value is automatically pushed
š X_com_pull()
š Task gets the message based on parameters such as “key”, “task_ids” and “dag_id”
š Keys that are automatically given to XCOMs when they are pushed by being returned from
Branching
š Allowing DAG to choose between different paths according to the result of a specific task
š Use BranchPythonOperator
š When using branch, do not use property depends on past+
Service Level Agreement ( SLAs )
š SLA is a contract between a service provider and the end user that defines the level of
service expected from the service provider
š Define what the end user will received ( Must be received )
š Time, relative to the execution_date of tast not the start time(more than 30 min from exec )
š Different from ‘execution_timeout’ parameter << It makes task stopped and marks failed
Ad

More Related Content

What's hot (20)

Airflow introduction
Airflow introductionAirflow introduction
Airflow introduction
Chandler Huang
 
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Yohei Onishi
 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache Airflow
Sid Anand
 
Apache airflow
Apache airflowApache airflow
Apache airflow
Purna Chander
 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
Knoldus Inc.
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentation
Anant Corporation
 
Introduction to Apache Airflow
Introduction to Apache AirflowIntroduction to Apache Airflow
Introduction to Apache Airflow
mutt_data
 
Apache Airflow Introduction
Apache Airflow IntroductionApache Airflow Introduction
Apache Airflow Introduction
Liangjun Jiang
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentation
Ilias Okacha
 
Apache Airflow
Apache AirflowApache Airflow
Apache Airflow
Knoldus Inc.
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoop
clairvoyantllc
 
Apache airflow
Apache airflowApache airflow
Apache airflow
Pavel Alexeev
 
Apache Airflow Architecture
Apache Airflow ArchitectureApache Airflow Architecture
Apache Airflow Architecture
Gerard Toonstra
 
Airflow Intro-1.pdf
Airflow Intro-1.pdfAirflow Intro-1.pdf
Airflow Intro-1.pdf
BagustTriCahyo1
 
Airflow Best Practises & Roadmap to Airflow 2.0
Airflow Best Practises & Roadmap to Airflow 2.0Airflow Best Practises & Roadmap to Airflow 2.0
Airflow Best Practises & Roadmap to Airflow 2.0
Kaxil Naik
 
Orchestrating workflows Apache Airflow on GCP & AWS
Orchestrating workflows Apache Airflow on GCP & AWSOrchestrating workflows Apache Airflow on GCP & AWS
Orchestrating workflows Apache Airflow on GCP & AWS
Derrick Qin
 
Airflow at WePay
Airflow at WePayAirflow at WePay
Airflow at WePay
Chris Riccomini
 
Airflow at lyft for Airflow summit 2020 conference
Airflow at lyft for Airflow summit 2020 conferenceAirflow at lyft for Airflow summit 2020 conference
Airflow at lyft for Airflow summit 2020 conference
Tao Feng
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
PyData
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
Flink Forward
 
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Yohei Onishi
 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache Airflow
Sid Anand
 
Introduction to Apache Airflow
Introduction to Apache AirflowIntroduction to Apache Airflow
Introduction to Apache Airflow
mutt_data
 
Apache Airflow Introduction
Apache Airflow IntroductionApache Airflow Introduction
Apache Airflow Introduction
Liangjun Jiang
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentation
Ilias Okacha
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoop
clairvoyantllc
 
Apache Airflow Architecture
Apache Airflow ArchitectureApache Airflow Architecture
Apache Airflow Architecture
Gerard Toonstra
 
Airflow Best Practises & Roadmap to Airflow 2.0
Airflow Best Practises & Roadmap to Airflow 2.0Airflow Best Practises & Roadmap to Airflow 2.0
Airflow Best Practises & Roadmap to Airflow 2.0
Kaxil Naik
 
Orchestrating workflows Apache Airflow on GCP & AWS
Orchestrating workflows Apache Airflow on GCP & AWSOrchestrating workflows Apache Airflow on GCP & AWS
Orchestrating workflows Apache Airflow on GCP & AWS
Derrick Qin
 
Airflow at lyft for Airflow summit 2020 conference
Airflow at lyft for Airflow summit 2020 conferenceAirflow at lyft for Airflow summit 2020 conference
Airflow at lyft for Airflow summit 2020 conference
Tao Feng
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
PyData
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
Flink Forward
 

Similar to Airflow tutorials hands_on (20)

Hadoop Introduction
Hadoop IntroductionHadoop Introduction
Hadoop Introduction
SNEHAL MASNE
 
GoDocker presentation
GoDocker presentationGoDocker presentation
GoDocker presentation
Olivier Sallou
 
Building Automated Data Pipelines with Airflow.pdf
Building Automated Data Pipelines with Airflow.pdfBuilding Automated Data Pipelines with Airflow.pdf
Building Automated Data Pipelines with Airflow.pdf
abhaykm804
 
Hadoop institutes in Bangalore
Hadoop institutes in BangaloreHadoop institutes in Bangalore
Hadoop institutes in Bangalore
srikanthhadoop
 
adaidoadaoap9dapdadadjoadjoajdoiajodiaoiao
adaidoadaoap9dapdadadjoadjoajdoiajodiaoiaoadaidoadaoap9dapdadadjoadjoajdoiajodiaoiao
adaidoadaoap9dapdadadjoadjoajdoiajodiaoiao
lyvanlinh519
 
airflowpresentation1-180717183432.pptx
airflowpresentation1-180717183432.pptxairflowpresentation1-180717183432.pptx
airflowpresentation1-180717183432.pptx
VIJAYAPRABAP
 
Lessons Learned: Running InfluxDB Cloud and Other Cloud Services at Scale | T...
Lessons Learned: Running InfluxDB Cloud and Other Cloud Services at Scale | T...Lessons Learned: Running InfluxDB Cloud and Other Cloud Services at Scale | T...
Lessons Learned: Running InfluxDB Cloud and Other Cloud Services at Scale | T...
InfluxData
 
Introduce Airflow.ppsx
Introduce Airflow.ppsxIntroduce Airflow.ppsx
Introduce Airflow.ppsx
ManKD
 
hadoop.ppt
hadoop.ppthadoop.ppt
hadoop.ppt
AnushkaChauhan68
 
Postgres Vienna DB Meetup 2014
Postgres Vienna DB Meetup 2014Postgres Vienna DB Meetup 2014
Postgres Vienna DB Meetup 2014
Michael Renner
 
Hadoop 2
Hadoop 2Hadoop 2
Hadoop 2
EasyMedico.com
 
Hadoop 3
Hadoop 3Hadoop 3
Hadoop 3
shams03159691010
 
Nov. 4, 2011 o reilly webcast-hbase- lars george
Nov. 4, 2011 o reilly webcast-hbase- lars georgeNov. 4, 2011 o reilly webcast-hbase- lars george
Nov. 4, 2011 o reilly webcast-hbase- lars george
O'Reilly Media
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's coming
Databricks
 
Lessons Learned Running InfluxDB Cloud and Other Cloud Services at Scale by T...
Lessons Learned Running InfluxDB Cloud and Other Cloud Services at Scale by T...Lessons Learned Running InfluxDB Cloud and Other Cloud Services at Scale by T...
Lessons Learned Running InfluxDB Cloud and Other Cloud Services at Scale by T...
InfluxData
 
Intro to Reactive Thinking and RxJava 2
Intro to Reactive Thinking and RxJava 2Intro to Reactive Thinking and RxJava 2
Intro to Reactive Thinking and RxJava 2
JollyRogers5
 
G pars
G parsG pars
G pars
NexThoughts Technologies
 
airflow web UI and CLI.pptx
airflow web UI and CLI.pptxairflow web UI and CLI.pptx
airflow web UI and CLI.pptx
VIJAYAPRABAP
 
Stream processing - Apache flink
Stream processing - Apache flinkStream processing - Apache flink
Stream processing - Apache flink
Renato Guimaraes
 
Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)
Stephan Ewen
 
Hadoop Introduction
Hadoop IntroductionHadoop Introduction
Hadoop Introduction
SNEHAL MASNE
 
Building Automated Data Pipelines with Airflow.pdf
Building Automated Data Pipelines with Airflow.pdfBuilding Automated Data Pipelines with Airflow.pdf
Building Automated Data Pipelines with Airflow.pdf
abhaykm804
 
Hadoop institutes in Bangalore
Hadoop institutes in BangaloreHadoop institutes in Bangalore
Hadoop institutes in Bangalore
srikanthhadoop
 
adaidoadaoap9dapdadadjoadjoajdoiajodiaoiao
adaidoadaoap9dapdadadjoadjoajdoiajodiaoiaoadaidoadaoap9dapdadadjoadjoajdoiajodiaoiao
adaidoadaoap9dapdadadjoadjoajdoiajodiaoiao
lyvanlinh519
 
airflowpresentation1-180717183432.pptx
airflowpresentation1-180717183432.pptxairflowpresentation1-180717183432.pptx
airflowpresentation1-180717183432.pptx
VIJAYAPRABAP
 
Lessons Learned: Running InfluxDB Cloud and Other Cloud Services at Scale | T...
Lessons Learned: Running InfluxDB Cloud and Other Cloud Services at Scale | T...Lessons Learned: Running InfluxDB Cloud and Other Cloud Services at Scale | T...
Lessons Learned: Running InfluxDB Cloud and Other Cloud Services at Scale | T...
InfluxData
 
Introduce Airflow.ppsx
Introduce Airflow.ppsxIntroduce Airflow.ppsx
Introduce Airflow.ppsx
ManKD
 
Postgres Vienna DB Meetup 2014
Postgres Vienna DB Meetup 2014Postgres Vienna DB Meetup 2014
Postgres Vienna DB Meetup 2014
Michael Renner
 
Nov. 4, 2011 o reilly webcast-hbase- lars george
Nov. 4, 2011 o reilly webcast-hbase- lars georgeNov. 4, 2011 o reilly webcast-hbase- lars george
Nov. 4, 2011 o reilly webcast-hbase- lars george
O'Reilly Media
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's coming
Databricks
 
Lessons Learned Running InfluxDB Cloud and Other Cloud Services at Scale by T...
Lessons Learned Running InfluxDB Cloud and Other Cloud Services at Scale by T...Lessons Learned Running InfluxDB Cloud and Other Cloud Services at Scale by T...
Lessons Learned Running InfluxDB Cloud and Other Cloud Services at Scale by T...
InfluxData
 
Intro to Reactive Thinking and RxJava 2
Intro to Reactive Thinking and RxJava 2Intro to Reactive Thinking and RxJava 2
Intro to Reactive Thinking and RxJava 2
JollyRogers5
 
airflow web UI and CLI.pptx
airflow web UI and CLI.pptxairflow web UI and CLI.pptx
airflow web UI and CLI.pptx
VIJAYAPRABAP
 
Stream processing - Apache flink
Stream processing - Apache flinkStream processing - Apache flink
Stream processing - Apache flink
Renato Guimaraes
 
Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)Flink 0.10 @ Bay Area Meetup (October 2015)
Flink 0.10 @ Bay Area Meetup (October 2015)
Stephan Ewen
 
Ad

More from pko89403 (11)

Wide&Deep Recommendation Model
Wide&Deep Recommendation ModelWide&Deep Recommendation Model
Wide&Deep Recommendation Model
pko89403
 
DeepAR:Probabilistic Forecasting with Autogressive Recurrent Networks
DeepAR:Probabilistic Forecasting with Autogressive Recurrent Networks DeepAR:Probabilistic Forecasting with Autogressive Recurrent Networks
DeepAR:Probabilistic Forecasting with Autogressive Recurrent Networks
pko89403
 
Item2Vec
Item2VecItem2Vec
Item2Vec
pko89403
 
Improving Language Understanding by Generative Pre-Training
Improving Language Understanding by Generative Pre-TrainingImproving Language Understanding by Generative Pre-Training
Improving Language Understanding by Generative Pre-Training
pko89403
 
CNN Introduction
CNN IntroductionCNN Introduction
CNN Introduction
pko89403
 
AutoEncoder&GAN Introduction
AutoEncoder&GAN IntroductionAutoEncoder&GAN Introduction
AutoEncoder&GAN Introduction
pko89403
 
Accelerating the machine learning lifecycle with m lflow
Accelerating the machine learning lifecycle with m lflowAccelerating the machine learning lifecycle with m lflow
Accelerating the machine learning lifecycle with m lflow
pko89403
 
Auto rec autoencoders meets collaborative filtering
Auto rec autoencoders meets collaborative filteringAuto rec autoencoders meets collaborative filtering
Auto rec autoencoders meets collaborative filtering
pko89403
 
Graph convolutional matrix completion
Graph convolutional  matrix completionGraph convolutional  matrix completion
Graph convolutional matrix completion
pko89403
 
Efficient thompson sampling for online matrix factorization recommendation
Efficient thompson sampling for online matrix factorization recommendationEfficient thompson sampling for online matrix factorization recommendation
Efficient thompson sampling for online matrix factorization recommendation
pko89403
 
Session based rcommendations with recurrent neural networks
Session based rcommendations with recurrent neural networksSession based rcommendations with recurrent neural networks
Session based rcommendations with recurrent neural networks
pko89403
 
Wide&Deep Recommendation Model
Wide&Deep Recommendation ModelWide&Deep Recommendation Model
Wide&Deep Recommendation Model
pko89403
 
DeepAR:Probabilistic Forecasting with Autogressive Recurrent Networks
DeepAR:Probabilistic Forecasting with Autogressive Recurrent Networks DeepAR:Probabilistic Forecasting with Autogressive Recurrent Networks
DeepAR:Probabilistic Forecasting with Autogressive Recurrent Networks
pko89403
 
Improving Language Understanding by Generative Pre-Training
Improving Language Understanding by Generative Pre-TrainingImproving Language Understanding by Generative Pre-Training
Improving Language Understanding by Generative Pre-Training
pko89403
 
CNN Introduction
CNN IntroductionCNN Introduction
CNN Introduction
pko89403
 
AutoEncoder&GAN Introduction
AutoEncoder&GAN IntroductionAutoEncoder&GAN Introduction
AutoEncoder&GAN Introduction
pko89403
 
Accelerating the machine learning lifecycle with m lflow
Accelerating the machine learning lifecycle with m lflowAccelerating the machine learning lifecycle with m lflow
Accelerating the machine learning lifecycle with m lflow
pko89403
 
Auto rec autoencoders meets collaborative filtering
Auto rec autoencoders meets collaborative filteringAuto rec autoencoders meets collaborative filtering
Auto rec autoencoders meets collaborative filtering
pko89403
 
Graph convolutional matrix completion
Graph convolutional  matrix completionGraph convolutional  matrix completion
Graph convolutional matrix completion
pko89403
 
Efficient thompson sampling for online matrix factorization recommendation
Efficient thompson sampling for online matrix factorization recommendationEfficient thompson sampling for online matrix factorization recommendation
Efficient thompson sampling for online matrix factorization recommendation
pko89403
 
Session based rcommendations with recurrent neural networks
Session based rcommendations with recurrent neural networksSession based rcommendations with recurrent neural networks
Session based rcommendations with recurrent neural networks
pko89403
 
Ad

Recently uploaded (20)

IAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdfIAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
mcgardenlevi9
 
Developing Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response ApplicationsDeveloping Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response Applications
VICTOR MAESTRE RAMIREZ
 
chapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptxchapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptx
justinebandajbn
 
GenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.aiGenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.ai
Inspirient
 
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjksPpt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
panchariyasahil
 
Classification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptxClassification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptx
wencyjorda88
 
Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...
Pixellion
 
FPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptxFPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptx
ssuser4ef83d
 
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbbEDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
JessaMaeEvangelista2
 
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
Simran112433
 
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
Molecular methods diagnostic and monitoring of infection  -  Repaired.pptxMolecular methods diagnostic and monitoring of infection  -  Repaired.pptx
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
7tzn7x5kky
 
VKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptxVKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptx
Vinod Srivastava
 
How to join illuminati Agent in uganda call+256776963507/0741506136
How to join illuminati Agent in uganda call+256776963507/0741506136How to join illuminati Agent in uganda call+256776963507/0741506136
How to join illuminati Agent in uganda call+256776963507/0741506136
illuminati Agent uganda call+256776963507/0741506136
 
Ch3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendencyCh3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendency
ayeleasefa2
 
Simple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptxSimple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptx
ssuser2aa19f
 
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
James Francis Paradigm Asset Management
 
chapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.pptchapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.ppt
justinebandajbn
 
VKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptxVKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptx
Vinod Srivastava
 
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
ThanushsaranS
 
Digilocker under workingProcess Flow.pptx
Digilocker  under workingProcess Flow.pptxDigilocker  under workingProcess Flow.pptx
Digilocker under workingProcess Flow.pptx
satnamsadguru491
 
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdfIAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
mcgardenlevi9
 
Developing Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response ApplicationsDeveloping Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response Applications
VICTOR MAESTRE RAMIREZ
 
chapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptxchapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptx
justinebandajbn
 
GenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.aiGenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.ai
Inspirient
 
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjksPpt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
panchariyasahil
 
Classification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptxClassification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptx
wencyjorda88
 
Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...
Pixellion
 
FPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptxFPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptx
ssuser4ef83d
 
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbbEDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
JessaMaeEvangelista2
 
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
Simran112433
 
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
Molecular methods diagnostic and monitoring of infection  -  Repaired.pptxMolecular methods diagnostic and monitoring of infection  -  Repaired.pptx
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
7tzn7x5kky
 
VKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptxVKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptx
Vinod Srivastava
 
Ch3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendencyCh3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendency
ayeleasefa2
 
Simple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptxSimple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptx
ssuser2aa19f
 
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
James Francis Paradigm Asset Management
 
chapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.pptchapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.ppt
justinebandajbn
 
VKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptxVKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptx
Vinod Srivastava
 
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
ThanushsaranS
 
Digilocker under workingProcess Flow.pptx
Digilocker  under workingProcess Flow.pptxDigilocker  under workingProcess Flow.pptx
Digilocker under workingProcess Flow.pptx
satnamsadguru491
 

Airflow tutorials hands_on

  • 2. Why
  • 4. What is airflow ? š 프로그램적으로 데이터 파이프라인을 author, schedule, monitor š 컴포넌트 : Web Server , Scheduler, Executor, Worker, Metadatabase š 키 컨셉 : DAG, Operator, Task, TaskInstance, Workflow
  • 5. Airflow Architecture š Airflow Webserver : Serves the UI Dashboard over http š Airflow Scheduler : A daemon š Airflow Worker : working wrapper š Metadata Database : Stores information regarding state of tasks š Executor : Message queuing process that is bound to the scheduler and determines the worker processes that executes scheduled tasks Airflow Webserver Airflow Scheduler Worker Worker Worker Meta DB Logs Dags
  • 6. How airflow works ? š 1. The scheduler reads the DAG folder š 2. Your Dag is parsed by a process to create a DagRun based on the scheduling parameters of your DAG. š 3. A TaskInstance is instantiated for each Task that needs to be executed and flagged to “Scheduled” in the metadata database š 4. The Scheduler gets all TaskInstances flagged “Scheduled” from the metadata database, changes the state to “Queued” and sends them to the executors to be executed. š 5. Executors pull out Tasks from the queue ( depending on your execution setup ), change the state from “Queued” to “Running” and Workers start executing the TaskInstances. š 6. When a Task is finished, the Executor changes the state of that task to its final state( success, failed, etc ) in the database and the DAGRun is updated by the scheduler with the state “Success” or “Failed”. Of course, the web server periodically fetch data from metadatabae to update UI.
  • 7. Start Airflow ( From install to UI ) š AIRFLOW_GPL_UNICODE=yes pip install “apache-airflow[celery, crypto, postgres, hive, rabbitmq, redis]” š Airflow initdb š Airflow upgraded š Ls š cd airflow š grep dags_folder airflow.cfg š mkdir –p /home/airflow/airflow/dags š Ls š Vim airflow.cfg ( Configuration File ) š Load.example ( false ) š Airflow resetdb š Airflow scheduler š Airflow webserver
  • 8. QuickTour of Airflow š airflow list_dags š airflow list_tasks {dag name} –tree š airflow test {dag name} python_task {execution date} š airflow –h
  • 9. What is DAG ? š Finte directed graph with no directed cycles. No cycle š Dag represents a collection of tasks to run, organized in a way that represent their dependencies and relations š Each node is a Task š Each edge is Dependency š 어떻게 워크플로우를 실행시킬건가?
  • 10. DAG’s important properties š Defined in Python files placed into Airflow’s DAG_FOLDER ( usually ~/airflow/dags) š Dag_id š Description š Start_date š Schedule_interval š Dependent_on_past : run the next DAGRun if the Previous one completed successfully š Default_args : constructor keyword parameter when initializing opeators
  • 11. What is Operator? š Determines what actually gets done. š Operators are usually (but now always) atomic, meaning they can stand on their own and don’t need to share resources with any other operators. š Definition of single task š Should be idempotent ( 항상 같은 결과를 출력 ) š Task is created by instantiating an Operator class š An operator defines the nature of this task and how should it be executed š Operator is instantiated, this task becomes a node in your DAG.
  • 12. Many Operators š Bash Operator š Python Operator š EmailOperator ( sends an email ) š SqlOperator ( Executes a SQL command š All Operators inherit from BaseOperator š 3 types of operators š Action operators that perform action (BashOperator, PythonOperator, EmailOperator … ) š Transfer operators that move data from one system to another ( sqlOperator, sftpOperator) š Sensor operators waiting for data to arrive at defined location.
  • 13. Operator ++ š Transfer Operators š Move data from one system to another š Pulled out from the source, staged on the machine where the executor is running, and then transferred to the target system. š Don’t use if you are dealing with a large amount of data š Sensor Operators š Inherit of BaseSensorOperator š They are useful for monitoring external processes like waiting for files to be uploaded in HDFS or a partition appearing in Hive š Basically long running task š Sensor operator has a poke method called repeatedly until it returns True ( method used for monitoring the external process)
  • 14. Make Dependencies in python š set_upstream() š set_downstream() š << ( = set_upstream ) š >> ( = set_downstream ) A B C D š B depends of A š C depends of A š D depends of B and C ( Example ) A.set_downstream(B) A >> B A >> { B, C } >> D
  • 15. How the Scheduler Works š DagRun š A Dag consists of Tasks and need those tasks to run š When the Scheduler parses a Dag, it automatically creates a DagRun which is an instantiation of a DAG in time according to start_date and schedule š Backfill and Catchup š Scheduler Interval š None š @once š @hourly š @daily š @weekly š @monthly š @yearly š Cron time string format can be used : ( * * * * * - Minute(0-59) Hour(0-23) Day of the month(1-31) Month(1-12) Day of the week(0-7)
  • 16. Concurrency vs Parallelism š Concurrent – If it can support two or more actions in progress at the same time š Parallel – If it can support two or more actions executing simultaneously š In concurrent systems, multiple actions can be in progress (may not be executed) at the same time š In parallel systems, multiple actions are simultaneously executed
  • 17. Database and Executor š Sequential Executor ( Default executor, SQLlite ) š Default executor you get when you run Apache Airflow š Only run one task at time (Sequential), useful for debugging š It is the only executor that can be used with SQLite since SQLlite donesn’t support multiple writers š Local Executor ( PostgreSQL ) š It can run multiple tasks at a time š Multiprocessing python library and queues to parallelize the execution of tasks š Run tasks by spawning processes in a controlled fashion in different modes on the same machine š Can tune the number of processes to spawn by using the parallelism parameter
  • 18. Database and Executor š Celery Executor š Celery == Python Task-Queue System š Task-Queue System handle distribution of tasks on workers across threads or network nodes š Tasks need to be pushed into a broker( RabbitMQ ) š celery workers will pop them and schedule task executions š Recommend for production use of Airflow š Allows distributing the execution of task instances to multiple worker node(Computer) š ++ Dask, Mesos, Kubernetes … etc
  • 19. Celery Executor, PostgreSQL and RabbitMQ Structure
  • 20. Executor Architecture Meta DB Web Server Scheduler + Worker Local Executor ( Single Machine ) Meta DB Web Server Scheduler + Worker Worker Worker Celery Celery Executor
  • 21. Advanced Concept š SubDAG š Minimising repetitive patterns š Main DAG mangages all the subDAGs as normal taks š SubDAGs must be scheduled the same as their parent DAG š Hooks š Interfaces to interact with your external sources such as (PostgreSQL, Spark, SFTP … )
  • 22. XCOM š Tasks communicate ( cross-communication , allows multiple tasks to exchange messages ) š Principally defined by a key, value and a timestamp š XCOMs data can be “pushed” or “pulled” š X_com_push() š If a task returns a value, a XCOM containing that value is automatically pushed š X_com_pull() š Task gets the message based on parameters such as “key”, “task_ids” and “dag_id” š Keys that are automatically given to XCOMs when they are pushed by being returned from
  • 23. Branching š Allowing DAG to choose between different paths according to the result of a specific task š Use BranchPythonOperator š When using branch, do not use property depends on past+
  • 24. Service Level Agreement ( SLAs ) š SLA is a contract between a service provider and the end user that defines the level of service expected from the service provider š Define what the end user will received ( Must be received ) š Time, relative to the execution_date of tast not the start time(more than 30 min from exec ) š Different from ‘execution_timeout’ parameter << It makes task stopped and marks failed