SlideShare a Scribd company logo
AIRFLOW - DATA FLOW
ENGINE FROM AIRBNB
By Walter Liu 2016/01/28
Who am I?
 Archer Architect of TrendMicro Coretech
backend
 A new user in Airflow
Why Data Flow Engine?
Cron Job
Direct Acyclic Graph (DAG)
Airflow - a data flow engine
Airflow - a data flow engine
Data relationships
 Data availability
 if the data is not there, trigger the process to
generate the data.
 Data dependency
 Some data relies on other data to generate.
Operability
 Job failed and resume
 Job monitor
 Backfill
Airflow
DAG structure as code
Airflow - a data flow engine
Airflow - a data flow engine
Airflow - a data flow engine
Airflow - a data flow engine
Demo
 python tutorial.py
 airflow list_dags
 airflow list_tasks tutorial
 airflow list_tasks tutorial --tree
 airflow test tutorial print_date 2015-06-01
 airflow test tutorial sleep 2015-06-01
 airflow run tutorial templated 2015-06-01
 Backfill: airflow backfill tutorial -s 2015-06-07 -e 2015-06-10
 Run again
 Run another date range
Site: http://localhost:8080/admin/
UI – DAG view
UI – Tree View
UI – Graph View
UI - Gantt
UI – Task duration
UI – Code
Airflow - Pros
 Dynamic generating path
 Have both Time scheduler and Command line trigger
 Has Master/Worker model (automatically distribute tasks)
 Scale if you have many tasks in a chain.
 But not be so useful to most of our tasks. Maybe useful for full dump.
 Fancy UI
 Dependencies of tasks
 Task success/failure
 Scheduled tasks status
 Has utility lib to wait_data for S3, Hadoop …
Airflow - Cons
 Additional DB/Redis or Rabbitmq for Celery
 HA design: Use RDBMS/redis-cache in AWS
 Require python 2.7 and many other libraries.
 Not dependent on data. Just task dependency.
(Not big cons)
 Write check data file code.
A snippet of the task of Luigi
Recap
 Solving DAG job management problem
 Features to improve daily operation
 Monitor/Visualization
 Job failure and resume
 Backfill
Backup slides
Other Selections?
 Oozie (on Hadoop)
 Luigi (by Spotify)
 Mario (Luigi in Scala)
 Ruigi (Luigi in R)
 Airflow (by Airbnb)
 Mistral (by OpenStack)
 Pinball (by Pinterest)
 …
Github statistics
Start Date Star Commits in
recent 2
months
Airflow 2014/10/05 1516 362
Luigi 2011/11/13 3777 105
Pinball 2015/05/01 476 0
Oozie 2011/08/28 167 14
Oozie
 Pros
 Mature
 Native support of Hadoop ECO system
 Python is not required.
 Cons
 Big and complex XML to define the flow and config.
 Control flow is somehow restrictive
 Hadoop ECO system is required.
 Unix box, Java, Hadoop, Pig, ExtJS library
Oozie XML Example
Luigi - Pros
 Makefile like, dependencies on data (Upward instead of downward)
 Command line trigger
 Not only Hadoop
 Support target: HDFS/S3/RDBMS/…. (not limited)
 Write your own code to support Casandra, etc.
 Dependencies are decentralized, no big XML.
 UI
 Dynamic dependency/task is supported. (Don’t know it is better than
airflow or not)
 [util] Date algebra
Luigi - Cons
 No built-in triggering
 No crontab like things (could borrow Chronos)
 Use cron to run tasks on specific nodes.
(normally)
Luigi - Notes
 Local execution
 Pros: Easy to debug
 Cons:
 need access to other system. (Well, other system didn’t achieve this either.)
 No scalability if you have many tasks in a chain.
 Centralized scheduler servers (optional, recommended for
production)
 Make sure two instances of the same task are not running
simultaneously
 Provide visualization of everything that’s going on.
 HA option
 Run 2+ tasks on 2+ machines at the same time
References
 http://erikbern.com/2015/07/02/more-luigi-
alternatives/
Ad

More Related Content

What's hot (20)

Airflow for Beginners
Airflow for BeginnersAirflow for Beginners
Airflow for Beginners
Varya Karpenko
 
Building an analytics workflow using Apache Airflow
Building an analytics workflow using Apache AirflowBuilding an analytics workflow using Apache Airflow
Building an analytics workflow using Apache Airflow
Yohei Onishi
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentation
Ilias Okacha
 
Apache Airflow overview
Apache Airflow overviewApache Airflow overview
Apache Airflow overview
NikolayGrishchenkov
 
Introduction to Apache Airflow
Introduction to Apache AirflowIntroduction to Apache Airflow
Introduction to Apache Airflow
mutt_data
 
Apache Airflow Architecture
Apache Airflow ArchitectureApache Airflow Architecture
Apache Airflow Architecture
Gerard Toonstra
 
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Yohei Onishi
 
Airflow 101
Airflow 101Airflow 101
Airflow 101
SaarBergerbest
 
Apache airflow
Apache airflowApache airflow
Apache airflow
Pavel Alexeev
 
Airflow Intro-1.pdf
Airflow Intro-1.pdfAirflow Intro-1.pdf
Airflow Intro-1.pdf
BagustTriCahyo1
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
PyData
 
Airflow Best Practises & Roadmap to Airflow 2.0
Airflow Best Practises & Roadmap to Airflow 2.0Airflow Best Practises & Roadmap to Airflow 2.0
Airflow Best Practises & Roadmap to Airflow 2.0
Kaxil Naik
 
Apache Airflow Introduction
Apache Airflow IntroductionApache Airflow Introduction
Apache Airflow Introduction
Liangjun Jiang
 
Apache Airflow in Production
Apache Airflow in ProductionApache Airflow in Production
Apache Airflow in Production
Robert Sanders
 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache Airflow
Sid Anand
 
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Kaxil Naik
 
Observability: Beyond the Three Pillars with Spring
Observability: Beyond the Three Pillars with SpringObservability: Beyond the Three Pillars with Spring
Observability: Beyond the Three Pillars with Spring
VMware Tanzu
 
Prometheus - basics
Prometheus - basicsPrometheus - basics
Prometheus - basics
Juraj Hantak
 
Airflow tutorials hands_on
Airflow tutorials hands_onAirflow tutorials hands_on
Airflow tutorials hands_on
pko89403
 
Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)
Brian Brazil
 
Building an analytics workflow using Apache Airflow
Building an analytics workflow using Apache AirflowBuilding an analytics workflow using Apache Airflow
Building an analytics workflow using Apache Airflow
Yohei Onishi
 
Airflow presentation
Airflow presentationAirflow presentation
Airflow presentation
Ilias Okacha
 
Introduction to Apache Airflow
Introduction to Apache AirflowIntroduction to Apache Airflow
Introduction to Apache Airflow
mutt_data
 
Apache Airflow Architecture
Apache Airflow ArchitectureApache Airflow Architecture
Apache Airflow Architecture
Gerard Toonstra
 
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Yohei Onishi
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
PyData
 
Airflow Best Practises & Roadmap to Airflow 2.0
Airflow Best Practises & Roadmap to Airflow 2.0Airflow Best Practises & Roadmap to Airflow 2.0
Airflow Best Practises & Roadmap to Airflow 2.0
Kaxil Naik
 
Apache Airflow Introduction
Apache Airflow IntroductionApache Airflow Introduction
Apache Airflow Introduction
Liangjun Jiang
 
Apache Airflow in Production
Apache Airflow in ProductionApache Airflow in Production
Apache Airflow in Production
Robert Sanders
 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache Airflow
Sid Anand
 
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Kaxil Naik
 
Observability: Beyond the Three Pillars with Spring
Observability: Beyond the Three Pillars with SpringObservability: Beyond the Three Pillars with Spring
Observability: Beyond the Three Pillars with Spring
VMware Tanzu
 
Prometheus - basics
Prometheus - basicsPrometheus - basics
Prometheus - basics
Juraj Hantak
 
Airflow tutorials hands_on
Airflow tutorials hands_onAirflow tutorials hands_on
Airflow tutorials hands_on
pko89403
 
Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)
Brian Brazil
 

Similar to Airflow - a data flow engine (20)

Rounds analytics pipeline
Rounds analytics pipelineRounds analytics pipeline
Rounds analytics pipeline
Aviv Laufer
 
Capital onehadoopintro
Capital onehadoopintroCapital onehadoopintro
Capital onehadoopintro
Doug Chang
 
Building data pipelines
Building data pipelinesBuilding data pipelines
Building data pipelines
Jonathan Holloway
 
Building an MLOps Stack for Companies at Reasonable Scale
Building an MLOps Stack for Companies at Reasonable ScaleBuilding an MLOps Stack for Companies at Reasonable Scale
Building an MLOps Stack for Companies at Reasonable Scale
Merelda
 
Spring boot 3g
Spring boot 3gSpring boot 3g
Spring boot 3g
vasya10
 
Google Cloud Platform for Data Science teams
Google Cloud Platform for Data Science teamsGoogle Cloud Platform for Data Science teams
Google Cloud Platform for Data Science teams
Barton Rhodes
 
The 90-Day Startup with Google AppEngine for Java
The 90-Day Startup with Google AppEngine for JavaThe 90-Day Startup with Google AppEngine for Java
The 90-Day Startup with Google AppEngine for Java
David Chandler
 
Big data or big deal
Big data or big dealBig data or big deal
Big data or big deal
eduarderwee
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Cloudera, Inc.
 
Yaetos Tech Overview
Yaetos Tech OverviewYaetos Tech Overview
Yaetos Tech Overview
prevota
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Bhupesh Bansal
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
Hadoop User Group
 
Zeppelin at Twitter
Zeppelin at TwitterZeppelin at Twitter
Zeppelin at Twitter
Prasad Wagle
 
Application Architecture For The Cloud
Application Architecture For The CloudApplication Architecture For The Cloud
Application Architecture For The Cloud
Steve Loughran
 
Using and Extending Memory Analyzer into Uncharted Waters
Using and Extending Memory Analyzer into Uncharted WatersUsing and Extending Memory Analyzer into Uncharted Waters
Using and Extending Memory Analyzer into Uncharted Waters
Vladimir Pavlov
 
Apache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonApache Spark Introduction @ University College London
Apache Spark Introduction @ University College London
Vitthal Gogate
 
Building Automated Data Pipelines with Airflow.pdf
Building Automated Data Pipelines with Airflow.pdfBuilding Automated Data Pipelines with Airflow.pdf
Building Automated Data Pipelines with Airflow.pdf
abhaykm804
 
Zeppelin at twitter (sf data science meetup, july 2016)
Zeppelin at twitter (sf data science meetup, july 2016)Zeppelin at twitter (sf data science meetup, july 2016)
Zeppelin at twitter (sf data science meetup, july 2016)
Prasad Wagle
 
Discovery & Consumption of Analytics Data @Twitter
Discovery & Consumption of Analytics Data @TwitterDiscovery & Consumption of Analytics Data @Twitter
Discovery & Consumption of Analytics Data @Twitter
Kamran Munshi
 
Endofday: A Container Workflow Engine for Scalable, Reproducible Computation
Endofday: A Container Workflow Engine for Scalable, Reproducible ComputationEndofday: A Container Workflow Engine for Scalable, Reproducible Computation
Endofday: A Container Workflow Engine for Scalable, Reproducible Computation
Enis Afgan
 
Rounds analytics pipeline
Rounds analytics pipelineRounds analytics pipeline
Rounds analytics pipeline
Aviv Laufer
 
Capital onehadoopintro
Capital onehadoopintroCapital onehadoopintro
Capital onehadoopintro
Doug Chang
 
Building an MLOps Stack for Companies at Reasonable Scale
Building an MLOps Stack for Companies at Reasonable ScaleBuilding an MLOps Stack for Companies at Reasonable Scale
Building an MLOps Stack for Companies at Reasonable Scale
Merelda
 
Spring boot 3g
Spring boot 3gSpring boot 3g
Spring boot 3g
vasya10
 
Google Cloud Platform for Data Science teams
Google Cloud Platform for Data Science teamsGoogle Cloud Platform for Data Science teams
Google Cloud Platform for Data Science teams
Barton Rhodes
 
The 90-Day Startup with Google AppEngine for Java
The 90-Day Startup with Google AppEngine for JavaThe 90-Day Startup with Google AppEngine for Java
The 90-Day Startup with Google AppEngine for Java
David Chandler
 
Big data or big deal
Big data or big dealBig data or big deal
Big data or big deal
eduarderwee
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Cloudera, Inc.
 
Yaetos Tech Overview
Yaetos Tech OverviewYaetos Tech Overview
Yaetos Tech Overview
prevota
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Bhupesh Bansal
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
Hadoop User Group
 
Zeppelin at Twitter
Zeppelin at TwitterZeppelin at Twitter
Zeppelin at Twitter
Prasad Wagle
 
Application Architecture For The Cloud
Application Architecture For The CloudApplication Architecture For The Cloud
Application Architecture For The Cloud
Steve Loughran
 
Using and Extending Memory Analyzer into Uncharted Waters
Using and Extending Memory Analyzer into Uncharted WatersUsing and Extending Memory Analyzer into Uncharted Waters
Using and Extending Memory Analyzer into Uncharted Waters
Vladimir Pavlov
 
Apache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonApache Spark Introduction @ University College London
Apache Spark Introduction @ University College London
Vitthal Gogate
 
Building Automated Data Pipelines with Airflow.pdf
Building Automated Data Pipelines with Airflow.pdfBuilding Automated Data Pipelines with Airflow.pdf
Building Automated Data Pipelines with Airflow.pdf
abhaykm804
 
Zeppelin at twitter (sf data science meetup, july 2016)
Zeppelin at twitter (sf data science meetup, july 2016)Zeppelin at twitter (sf data science meetup, july 2016)
Zeppelin at twitter (sf data science meetup, july 2016)
Prasad Wagle
 
Discovery & Consumption of Analytics Data @Twitter
Discovery & Consumption of Analytics Data @TwitterDiscovery & Consumption of Analytics Data @Twitter
Discovery & Consumption of Analytics Data @Twitter
Kamran Munshi
 
Endofday: A Container Workflow Engine for Scalable, Reproducible Computation
Endofday: A Container Workflow Engine for Scalable, Reproducible ComputationEndofday: A Container Workflow Engine for Scalable, Reproducible Computation
Endofday: A Container Workflow Engine for Scalable, Reproducible Computation
Enis Afgan
 
Ad

More from Walter Liu (18)

Generative AI 在手機遊戲發行上的應用介紹.pdf
Generative AI 在手機遊戲發行上的應用介紹.pdfGenerative AI 在手機遊戲發行上的應用介紹.pdf
Generative AI 在手機遊戲發行上的應用介紹.pdf
Walter Liu
 
Infrastructure as code using Kubernetes
Infrastructure as code using KubernetesInfrastructure as code using Kubernetes
Infrastructure as code using Kubernetes
Walter Liu
 
手機遊戲數位行銷工具淺談
手機遊戲數位行銷工具淺談手機遊戲數位行銷工具淺談
手機遊戲數位行銷工具淺談
Walter Liu
 
關於第三方追蹤廣告再行銷的那些事兒
關於第三方追蹤廣告再行銷的那些事兒關於第三方追蹤廣告再行銷的那些事兒
關於第三方追蹤廣告再行銷的那些事兒
Walter Liu
 
Kubernetes Workshop
Kubernetes WorkshopKubernetes Workshop
Kubernetes Workshop
Walter Liu
 
Using Kubernetes to deploy Django in GCP
Using Kubernetes to deploy Django in GCPUsing Kubernetes to deploy Django in GCP
Using Kubernetes to deploy Django in GCP
Walter Liu
 
Game DDOS Prevention
Game DDOS PreventionGame DDOS Prevention
Game DDOS Prevention
Walter Liu
 
Super Fast Gevent Introduction
Super Fast Gevent IntroductionSuper Fast Gevent Introduction
Super Fast Gevent Introduction
Walter Liu
 
HTTP/2 to web dev
HTTP/2 to web devHTTP/2 to web dev
HTTP/2 to web dev
Walter Liu
 
HTTP/2 Introduction
HTTP/2 IntroductionHTTP/2 Introduction
HTTP/2 Introduction
Walter Liu
 
Consul - service discovery and others
Consul - service discovery and othersConsul - service discovery and others
Consul - service discovery and others
Walter Liu
 
Docker introduction
Docker introductionDocker introduction
Docker introduction
Walter Liu
 
WRS GIT Branching Model - draft
WRS GIT Branching Model - draftWRS GIT Branching Model - draft
WRS GIT Branching Model - draft
Walter Liu
 
Salty OPS – Saltstack Introduction
Salty OPS – Saltstack IntroductionSalty OPS – Saltstack Introduction
Salty OPS – Saltstack Introduction
Walter Liu
 
Django deployment and rpm+yum
Django deployment and rpm+yumDjango deployment and rpm+yum
Django deployment and rpm+yum
Walter Liu
 
Game Localization in Python
Game Localization in PythonGame Localization in Python
Game Localization in Python
Walter Liu
 
Service production from d3 pitfall viewpoint
Service production from d3 pitfall viewpointService production from d3 pitfall viewpoint
Service production from d3 pitfall viewpoint
Walter Liu
 
Celery in the Django
Celery in the DjangoCelery in the Django
Celery in the Django
Walter Liu
 
Generative AI 在手機遊戲發行上的應用介紹.pdf
Generative AI 在手機遊戲發行上的應用介紹.pdfGenerative AI 在手機遊戲發行上的應用介紹.pdf
Generative AI 在手機遊戲發行上的應用介紹.pdf
Walter Liu
 
Infrastructure as code using Kubernetes
Infrastructure as code using KubernetesInfrastructure as code using Kubernetes
Infrastructure as code using Kubernetes
Walter Liu
 
手機遊戲數位行銷工具淺談
手機遊戲數位行銷工具淺談手機遊戲數位行銷工具淺談
手機遊戲數位行銷工具淺談
Walter Liu
 
關於第三方追蹤廣告再行銷的那些事兒
關於第三方追蹤廣告再行銷的那些事兒關於第三方追蹤廣告再行銷的那些事兒
關於第三方追蹤廣告再行銷的那些事兒
Walter Liu
 
Kubernetes Workshop
Kubernetes WorkshopKubernetes Workshop
Kubernetes Workshop
Walter Liu
 
Using Kubernetes to deploy Django in GCP
Using Kubernetes to deploy Django in GCPUsing Kubernetes to deploy Django in GCP
Using Kubernetes to deploy Django in GCP
Walter Liu
 
Game DDOS Prevention
Game DDOS PreventionGame DDOS Prevention
Game DDOS Prevention
Walter Liu
 
Super Fast Gevent Introduction
Super Fast Gevent IntroductionSuper Fast Gevent Introduction
Super Fast Gevent Introduction
Walter Liu
 
HTTP/2 to web dev
HTTP/2 to web devHTTP/2 to web dev
HTTP/2 to web dev
Walter Liu
 
HTTP/2 Introduction
HTTP/2 IntroductionHTTP/2 Introduction
HTTP/2 Introduction
Walter Liu
 
Consul - service discovery and others
Consul - service discovery and othersConsul - service discovery and others
Consul - service discovery and others
Walter Liu
 
Docker introduction
Docker introductionDocker introduction
Docker introduction
Walter Liu
 
WRS GIT Branching Model - draft
WRS GIT Branching Model - draftWRS GIT Branching Model - draft
WRS GIT Branching Model - draft
Walter Liu
 
Salty OPS – Saltstack Introduction
Salty OPS – Saltstack IntroductionSalty OPS – Saltstack Introduction
Salty OPS – Saltstack Introduction
Walter Liu
 
Django deployment and rpm+yum
Django deployment and rpm+yumDjango deployment and rpm+yum
Django deployment and rpm+yum
Walter Liu
 
Game Localization in Python
Game Localization in PythonGame Localization in Python
Game Localization in Python
Walter Liu
 
Service production from d3 pitfall viewpoint
Service production from d3 pitfall viewpointService production from d3 pitfall viewpoint
Service production from d3 pitfall viewpoint
Walter Liu
 
Celery in the Django
Celery in the DjangoCelery in the Django
Celery in the Django
Walter Liu
 
Ad

Recently uploaded (20)

Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 

Airflow - a data flow engine

  • 1. AIRFLOW - DATA FLOW ENGINE FROM AIRBNB By Walter Liu 2016/01/28
  • 2. Who am I?  Archer Architect of TrendMicro Coretech backend  A new user in Airflow
  • 3. Why Data Flow Engine?
  • 8. Data relationships  Data availability  if the data is not there, trigger the process to generate the data.  Data dependency  Some data relies on other data to generate.
  • 9. Operability  Job failed and resume  Job monitor  Backfill
  • 16. Demo  python tutorial.py  airflow list_dags  airflow list_tasks tutorial  airflow list_tasks tutorial --tree  airflow test tutorial print_date 2015-06-01  airflow test tutorial sleep 2015-06-01  airflow run tutorial templated 2015-06-01  Backfill: airflow backfill tutorial -s 2015-06-07 -e 2015-06-10  Run again  Run another date range Site: http://localhost:8080/admin/
  • 17. UI – DAG view
  • 18. UI – Tree View
  • 19. UI – Graph View
  • 21. UI – Task duration
  • 23. Airflow - Pros  Dynamic generating path  Have both Time scheduler and Command line trigger  Has Master/Worker model (automatically distribute tasks)  Scale if you have many tasks in a chain.  But not be so useful to most of our tasks. Maybe useful for full dump.  Fancy UI  Dependencies of tasks  Task success/failure  Scheduled tasks status  Has utility lib to wait_data for S3, Hadoop …
  • 24. Airflow - Cons  Additional DB/Redis or Rabbitmq for Celery  HA design: Use RDBMS/redis-cache in AWS  Require python 2.7 and many other libraries.  Not dependent on data. Just task dependency. (Not big cons)  Write check data file code.
  • 25. A snippet of the task of Luigi
  • 26. Recap  Solving DAG job management problem  Features to improve daily operation  Monitor/Visualization  Job failure and resume  Backfill
  • 28. Other Selections?  Oozie (on Hadoop)  Luigi (by Spotify)  Mario (Luigi in Scala)  Ruigi (Luigi in R)  Airflow (by Airbnb)  Mistral (by OpenStack)  Pinball (by Pinterest)  …
  • 29. Github statistics Start Date Star Commits in recent 2 months Airflow 2014/10/05 1516 362 Luigi 2011/11/13 3777 105 Pinball 2015/05/01 476 0 Oozie 2011/08/28 167 14
  • 30. Oozie  Pros  Mature  Native support of Hadoop ECO system  Python is not required.  Cons  Big and complex XML to define the flow and config.  Control flow is somehow restrictive  Hadoop ECO system is required.  Unix box, Java, Hadoop, Pig, ExtJS library
  • 32. Luigi - Pros  Makefile like, dependencies on data (Upward instead of downward)  Command line trigger  Not only Hadoop  Support target: HDFS/S3/RDBMS/…. (not limited)  Write your own code to support Casandra, etc.  Dependencies are decentralized, no big XML.  UI  Dynamic dependency/task is supported. (Don’t know it is better than airflow or not)  [util] Date algebra
  • 33. Luigi - Cons  No built-in triggering  No crontab like things (could borrow Chronos)  Use cron to run tasks on specific nodes. (normally)
  • 34. Luigi - Notes  Local execution  Pros: Easy to debug  Cons:  need access to other system. (Well, other system didn’t achieve this either.)  No scalability if you have many tasks in a chain.  Centralized scheduler servers (optional, recommended for production)  Make sure two instances of the same task are not running simultaneously  Provide visualization of everything that’s going on.  HA option  Run 2+ tasks on 2+ machines at the same time

Editor's Notes

  • #5: Cron is not operational scalable.
  • #21: Bottleneck and where the time spent.
  • #32: Reference: http://www.infoq.com/cn/articles/introductionOozie