SlideShare a Scribd company logo
Orchestrating workflows
Apache Airflow on GCP & AWS
Derrick Qin - Cloud Data Architect @ DoiT International
Multi-Cloud Engineering Meetup Australia
Tuesday 23 March 2021
About the speaker
Derrick Qin
Cloud Data Architect, DoiT International
Data engineering on GCP and AWS
Agenda
● Typical workflows
● Introducing Airflow
● Why Airflow is popular
● Airflow Concepts
● Demo
● How to run Airflow
● Google Cloud Composer
● Amazon Managed Workflows for Apache Airflow (MWAA)
● Airflow Best Practices (my view)
● Q&A
Typical workflow
● Daily - load batch files from different databases to a reporting database
● Daily/Weekly/Monthly - generate and deliver reports to stakeholders
● Daily - re-train machine learning models with fresh data
● Hourly - back up database
● Hourly - generate and send recommended products to customers based on customers
activities - think spamsemails you get from eBay
● On-demand - send registration emails to newly registered customers
Every 5 minutes - run your price/discount watchdog -
automatic price check on retail websites or OZBargain
Introducing Airflow
● Airflow is an orchestration platform to programatically schedule and monitor workflows
● Started in late 2014 @Airbnb, open sourced in mid-2015
● Governed under Apache Foundation
● Over 20K stars on Github
● Used by lots of well-known organizations
Why Airflow is popular
● Workflows are defined as Python code
○ More flexible - because of Python programming language
○ Workflow as code is more testable
○ Reuse
● Battery included platform
○ popular database: mysql, postgres, mongodb, oracle, SQL server, Snowflake, BigQuery
○ services: Databricks, Datadog, ElasticSearch, Jenkins, Salesforce, SendGrid, Slack,
Zendesk
○ public cloud platform: AWS, GCP, Azure
● Informational and feature-rich UI to visualize workflows' status, monitor progress,
troubleshoot issues, trigger and re-trigger workflows and tasks in them
Orchestrating workflows Apache Airflow on GCP & AWS
Orchestrating workflows Apache Airflow on GCP & AWS
Orchestrating workflows Apache Airflow on GCP & AWS
Airflow concepts
Airflow concepts
● DAG: directed acyclic graph is a directed graph with no directed cycles - workflows
● Operator: they define what should be executed. Example: Bash command, read a file, call an API,
load data to a table, etc
● Task: instance of an operator, it is a node in a DAG/Workflow
● Sensor: a special operator which runs repeatedly until the predefined condition is fulfilled.
Example: a file sensor can wait until the file lands, then continue the workflow
● Hook: an interface to external platform or system. Example: S3Hook wraps AWS S3 API to
provide easy access to S3 bucket
● DAG run: when a DAG is triggered, it is called a DAG run. It represents the instance of the
workflow
Airflow architecture
● Web UI/webserver
● Scheduler
● Worker
● Metadata database
Executors
● SequentialExecutor
● LocalExecutor
● CeleryExecutor
How to run Airflow locally
● Local setup with Python Virtualenv
○ http://airflow.apache.org/docs/apache-airflow/stable/start/local.html
● Local setup with Docker
○ http://airflow.apache.org/docs/apache-airflow/stable/start/docker.html
Demo - local docker setup
Any hosted Airflow solutions?
● GCP Cloud Composer
● Amazon Managed Workflows for Apache Airflow (MWAA)
● astronomer.io
GCP Cloud
Composer
● Deployment via Console,
gcloud, API, Terraform
● Run on GKE with
*auto-scaling support
○ https://medium.com/tra
veloka-engineering/enabl
ing-autoscaling-in-google
-cloud-composer-ac84d3
ddd60
● Running the scheduler and
workers on GKE enables you to
use the KubernetesPodOperator
to run any container workload.
GCP Cloud Composer
● DAGs and plugins are deployed and managed on Google Cloud Storage(GCS) buckets
● DAGs can be triggered via
○ Composer API
○ Composer CLI - wrapped on top of Airflow CLI
● Plugin management can be tricky
○ Managed by CloudBuild - troubleshooting from CloudBuild Logs
GCP Cloud Composer
● Airflow containers can be accessed via Kubectl
○ kubectl -n composer-1-14-4-airflow-example-namespace exec -it airflow-worker-1a2b3c-x0yz -c
airflow-worker -- /bin/bash
● Airflow Data can only be accessed from worker - ssh to worker, then use SQLAlchemy
● Security
○ Airflow permission binds with service account
○ Google Cloud Secret Manager - used as Airflow connection/variable/jinja template
○ Overwrite default service account using a new connection
Demo - Cloud Composer
AWS MWAA
● Deployment via Console,
AWS CLI, SDK,
Cloudformation
● Run on AWS Fargate and
AWS SQS
○ Workers can be
auto-scaled based on
load
● Security
○ Data encrypted using AWS
KMS
○ Use AWS Secret Manager
to manage secret,
connection/variables
AWS MWAA
● DAGs and plugins are deployed and managed on AWS S3 buckets
● DAGs can be triggered via
○ AWS SDK/API
○ AWS CLI
Airflow
container
access?
Demo - AWS MWAA
Airflow Best Practices
● Try to balance between DAG readability and code abstraction
● Limit local compute
● Use built-in libraries if possible
● Generate custom dashboard for non-technical stakeholders
● One DAG per data source
● Testing DAG and custom Plugins
○ Unit testing and end-to-end testing
■ Will be covered in next meetup talk
Is Airflow my only option?
● Crontab
● Jenkins
● GCP CloudBuild
● Argo
● AWS Step Functions
● ...
Orchestrating workflows Apache Airflow on GCP & AWS

More Related Content

What's hot (20)

PPTX
Airflow - a data flow engine
Walter Liu
 
PDF
Airflow introduction
Chandler Huang
 
PDF
Apache Airflow
Knoldus Inc.
 
PDF
Building an analytics workflow using Apache Airflow
Yohei Onishi
 
PDF
Apache airflow
Purna Chander
 
PDF
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Yohei Onishi
 
PDF
Airflow presentation
Ilias Okacha
 
PPTX
Airflow 101
SaarBergerbest
 
PPTX
Apache Airflow Introduction
Liangjun Jiang
 
PPTX
Apache airflow
Pavel Alexeev
 
PDF
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Kaxil Naik
 
PDF
Airflow Intro-1.pdf
BagustTriCahyo1
 
PDF
Building Better Data Pipelines using Apache Airflow
Sid Anand
 
PPTX
Running Airflow Workflows as ETL Processes on Hadoop
clairvoyantllc
 
PDF
From airflow to google cloud composer
Bruce Kuo
 
PDF
A GitOps Kubernetes Native CICD Solution with Argo Events, Workflows, and CD
Julian Mazzitelli
 
PPTX
Apache Airflow in Production
Robert Sanders
 
PDF
Airflow for Beginners
Varya Karpenko
 
PDF
Airflow Best Practises & Roadmap to Airflow 2.0
Kaxil Naik
 
PPTX
Airflow and supervisor
Rafael Roman Otero
 
Airflow - a data flow engine
Walter Liu
 
Airflow introduction
Chandler Huang
 
Apache Airflow
Knoldus Inc.
 
Building an analytics workflow using Apache Airflow
Yohei Onishi
 
Apache airflow
Purna Chander
 
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Yohei Onishi
 
Airflow presentation
Ilias Okacha
 
Airflow 101
SaarBergerbest
 
Apache Airflow Introduction
Liangjun Jiang
 
Apache airflow
Pavel Alexeev
 
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Kaxil Naik
 
Airflow Intro-1.pdf
BagustTriCahyo1
 
Building Better Data Pipelines using Apache Airflow
Sid Anand
 
Running Airflow Workflows as ETL Processes on Hadoop
clairvoyantllc
 
From airflow to google cloud composer
Bruce Kuo
 
A GitOps Kubernetes Native CICD Solution with Argo Events, Workflows, and CD
Julian Mazzitelli
 
Apache Airflow in Production
Robert Sanders
 
Airflow for Beginners
Varya Karpenko
 
Airflow Best Practises & Roadmap to Airflow 2.0
Kaxil Naik
 
Airflow and supervisor
Rafael Roman Otero
 

Similar to Orchestrating workflows Apache Airflow on GCP & AWS (20)

PDF
Building Automated Data Pipelines with Airflow.pdf
abhaykm804
 
PPTX
DataPipelineApacheAirflow.pptx
John J Zhao
 
PDF
Airflow techtonic template
Sampath Kumar
 
PDF
From business requirements to working pipelines with apache airflow
Derrick Qin
 
PDF
Data Pipelines with Apache Airflow 1st Edition Bas P Harenslak Julian Rutger ...
uzjrbdj376
 
PPTX
Apache Airdrop detailed description.pptx
prince07031999
 
PPSX
Introduce Airflow.ppsx
ManKD
 
PPTX
Apache AirfowAsaSAsaSAsSas - Session1.pptx
MuhamedAhmed35
 
PDF
Airflow 4 manager
Worapol Alex Pongpech, PhD
 
PPTX
Installing & Setting Up Apache Airflow (Local & Cloud) - AccentFuture
Shaik Dasthagiri
 
PDF
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
Kaxil Naik
 
PDF
How I learned to time travel, or, data pipelining and scheduling with Airflow
Laura Lorenz
 
PDF
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
DataWorks Summit
 
PDF
Data Pipelines with Apache Airflow 1st Edition Bas P Harenslak Julian Rutger ...
awuahmeraiga
 
PPTX
Apache Airflow presentation by GenPPT.pptx
VikasTomar93
 
PDF
How I learned to time travel, or, data pipelining and scheduling with Airflow
PyData
 
PPTX
airflow web UI and CLI.pptx
VIJAYAPRABAP
 
PPTX
Introduction to Apache Airflow & Workflow Orchestration.pptx
Accentfuture
 
PDF
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Kaxil Naik
 
PPTX
airflowpresentation1-180717183432.pptx
VIJAYAPRABAP
 
Building Automated Data Pipelines with Airflow.pdf
abhaykm804
 
DataPipelineApacheAirflow.pptx
John J Zhao
 
Airflow techtonic template
Sampath Kumar
 
From business requirements to working pipelines with apache airflow
Derrick Qin
 
Data Pipelines with Apache Airflow 1st Edition Bas P Harenslak Julian Rutger ...
uzjrbdj376
 
Apache Airdrop detailed description.pptx
prince07031999
 
Introduce Airflow.ppsx
ManKD
 
Apache AirfowAsaSAsaSAsSas - Session1.pptx
MuhamedAhmed35
 
Airflow 4 manager
Worapol Alex Pongpech, PhD
 
Installing & Setting Up Apache Airflow (Local & Cloud) - AccentFuture
Shaik Dasthagiri
 
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
Kaxil Naik
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
Laura Lorenz
 
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
DataWorks Summit
 
Data Pipelines with Apache Airflow 1st Edition Bas P Harenslak Julian Rutger ...
awuahmeraiga
 
Apache Airflow presentation by GenPPT.pptx
VikasTomar93
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
PyData
 
airflow web UI and CLI.pptx
VIJAYAPRABAP
 
Introduction to Apache Airflow & Workflow Orchestration.pptx
Accentfuture
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Kaxil Naik
 
airflowpresentation1-180717183432.pptx
VIJAYAPRABAP
 
Ad

Recently uploaded (20)

PPTX
covid 19 data analysis updates in our municipality
RhuAyungon1
 
PDF
CT-2-Ancient ancient accept-Criticism.pdf
DepartmentofEnglishC1
 
PDF
Blood pressure (3).pdfbdbsbsbhshshshhdhdhshshs
hernandezemma379
 
PDF
Unlocking Insights: Introducing i-Metrics Asia-Pacific Corporation and Strate...
Janette Toral
 
PPTX
办理学历认证InformaticsLetter新加坡英华美学院毕业证书,Informatics成绩单
Taqyea
 
PPTX
Discrete Logarithm Problem in Cryptography (1).pptx
meshablinx38
 
PPTX
Krezentios memories in college data.pptx
notknown9
 
PDF
UNISE-Operation-Procedure-InDHIS2trainng
ahmedabduselam23
 
PPTX
Generative AI Boost Data Governance and Quality- Tejasvi Addagada
Tejasvi Addagada
 
PPTX
big data eco system fundamentals of data science
arivukarasi
 
PDF
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
PPTX
Presentation.pptx hhgihyugyygyijguuffddfffffff
abhiruppal2007
 
PDF
A GraphRAG approach for Energy Efficiency Q&A
Marco Brambilla
 
PDF
Business Automation Solution with Excel 1.1.pdf
Vivek Kedia
 
PPTX
MENU-DRIVEN PROGRAM ON ARUNACHAL PRADESH.pptx
manvi200807
 
PPTX
01_Nico Vincent_Sailpeak.pptx_AI_Barometer_2025
FinTech Belgium
 
DOCX
INDUSTRIAL BENEFIT FROM MICROSOFT AZURE.docx
writercontent500
 
PDF
5991-5857_Agilent_MS_Theory_EN (1).pdf. pdf
NohaSalah45
 
PDF
Business implication of Artificial Intelligence.pdf
VishalChugh12
 
PDF
5- Global Demography Concepts _ Population Pyramids .pdf
pkhadka824
 
covid 19 data analysis updates in our municipality
RhuAyungon1
 
CT-2-Ancient ancient accept-Criticism.pdf
DepartmentofEnglishC1
 
Blood pressure (3).pdfbdbsbsbhshshshhdhdhshshs
hernandezemma379
 
Unlocking Insights: Introducing i-Metrics Asia-Pacific Corporation and Strate...
Janette Toral
 
办理学历认证InformaticsLetter新加坡英华美学院毕业证书,Informatics成绩单
Taqyea
 
Discrete Logarithm Problem in Cryptography (1).pptx
meshablinx38
 
Krezentios memories in college data.pptx
notknown9
 
UNISE-Operation-Procedure-InDHIS2trainng
ahmedabduselam23
 
Generative AI Boost Data Governance and Quality- Tejasvi Addagada
Tejasvi Addagada
 
big data eco system fundamentals of data science
arivukarasi
 
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
Presentation.pptx hhgihyugyygyijguuffddfffffff
abhiruppal2007
 
A GraphRAG approach for Energy Efficiency Q&A
Marco Brambilla
 
Business Automation Solution with Excel 1.1.pdf
Vivek Kedia
 
MENU-DRIVEN PROGRAM ON ARUNACHAL PRADESH.pptx
manvi200807
 
01_Nico Vincent_Sailpeak.pptx_AI_Barometer_2025
FinTech Belgium
 
INDUSTRIAL BENEFIT FROM MICROSOFT AZURE.docx
writercontent500
 
5991-5857_Agilent_MS_Theory_EN (1).pdf. pdf
NohaSalah45
 
Business implication of Artificial Intelligence.pdf
VishalChugh12
 
5- Global Demography Concepts _ Population Pyramids .pdf
pkhadka824
 
Ad

Orchestrating workflows Apache Airflow on GCP & AWS

  • 1. Orchestrating workflows Apache Airflow on GCP & AWS Derrick Qin - Cloud Data Architect @ DoiT International Multi-Cloud Engineering Meetup Australia Tuesday 23 March 2021
  • 2. About the speaker Derrick Qin Cloud Data Architect, DoiT International Data engineering on GCP and AWS
  • 3. Agenda ● Typical workflows ● Introducing Airflow ● Why Airflow is popular ● Airflow Concepts ● Demo ● How to run Airflow ● Google Cloud Composer ● Amazon Managed Workflows for Apache Airflow (MWAA) ● Airflow Best Practices (my view) ● Q&A
  • 4. Typical workflow ● Daily - load batch files from different databases to a reporting database ● Daily/Weekly/Monthly - generate and deliver reports to stakeholders ● Daily - re-train machine learning models with fresh data ● Hourly - back up database ● Hourly - generate and send recommended products to customers based on customers activities - think spamsemails you get from eBay ● On-demand - send registration emails to newly registered customers Every 5 minutes - run your price/discount watchdog - automatic price check on retail websites or OZBargain
  • 5. Introducing Airflow ● Airflow is an orchestration platform to programatically schedule and monitor workflows ● Started in late 2014 @Airbnb, open sourced in mid-2015 ● Governed under Apache Foundation ● Over 20K stars on Github ● Used by lots of well-known organizations
  • 6. Why Airflow is popular ● Workflows are defined as Python code ○ More flexible - because of Python programming language ○ Workflow as code is more testable ○ Reuse ● Battery included platform ○ popular database: mysql, postgres, mongodb, oracle, SQL server, Snowflake, BigQuery ○ services: Databricks, Datadog, ElasticSearch, Jenkins, Salesforce, SendGrid, Slack, Zendesk ○ public cloud platform: AWS, GCP, Azure ● Informational and feature-rich UI to visualize workflows' status, monitor progress, troubleshoot issues, trigger and re-trigger workflows and tasks in them
  • 11. Airflow concepts ● DAG: directed acyclic graph is a directed graph with no directed cycles - workflows ● Operator: they define what should be executed. Example: Bash command, read a file, call an API, load data to a table, etc ● Task: instance of an operator, it is a node in a DAG/Workflow ● Sensor: a special operator which runs repeatedly until the predefined condition is fulfilled. Example: a file sensor can wait until the file lands, then continue the workflow ● Hook: an interface to external platform or system. Example: S3Hook wraps AWS S3 API to provide easy access to S3 bucket ● DAG run: when a DAG is triggered, it is called a DAG run. It represents the instance of the workflow
  • 12. Airflow architecture ● Web UI/webserver ● Scheduler ● Worker ● Metadata database Executors ● SequentialExecutor ● LocalExecutor ● CeleryExecutor
  • 13. How to run Airflow locally ● Local setup with Python Virtualenv ○ http://airflow.apache.org/docs/apache-airflow/stable/start/local.html ● Local setup with Docker ○ http://airflow.apache.org/docs/apache-airflow/stable/start/docker.html
  • 14. Demo - local docker setup
  • 15. Any hosted Airflow solutions? ● GCP Cloud Composer ● Amazon Managed Workflows for Apache Airflow (MWAA) ● astronomer.io
  • 16. GCP Cloud Composer ● Deployment via Console, gcloud, API, Terraform ● Run on GKE with *auto-scaling support ○ https://medium.com/tra veloka-engineering/enabl ing-autoscaling-in-google -cloud-composer-ac84d3 ddd60 ● Running the scheduler and workers on GKE enables you to use the KubernetesPodOperator to run any container workload.
  • 17. GCP Cloud Composer ● DAGs and plugins are deployed and managed on Google Cloud Storage(GCS) buckets ● DAGs can be triggered via ○ Composer API ○ Composer CLI - wrapped on top of Airflow CLI ● Plugin management can be tricky ○ Managed by CloudBuild - troubleshooting from CloudBuild Logs
  • 18. GCP Cloud Composer ● Airflow containers can be accessed via Kubectl ○ kubectl -n composer-1-14-4-airflow-example-namespace exec -it airflow-worker-1a2b3c-x0yz -c airflow-worker -- /bin/bash ● Airflow Data can only be accessed from worker - ssh to worker, then use SQLAlchemy ● Security ○ Airflow permission binds with service account ○ Google Cloud Secret Manager - used as Airflow connection/variable/jinja template ○ Overwrite default service account using a new connection
  • 19. Demo - Cloud Composer
  • 20. AWS MWAA ● Deployment via Console, AWS CLI, SDK, Cloudformation ● Run on AWS Fargate and AWS SQS ○ Workers can be auto-scaled based on load ● Security ○ Data encrypted using AWS KMS ○ Use AWS Secret Manager to manage secret, connection/variables
  • 21. AWS MWAA ● DAGs and plugins are deployed and managed on AWS S3 buckets ● DAGs can be triggered via ○ AWS SDK/API ○ AWS CLI Airflow container access?
  • 22. Demo - AWS MWAA
  • 23. Airflow Best Practices ● Try to balance between DAG readability and code abstraction ● Limit local compute ● Use built-in libraries if possible ● Generate custom dashboard for non-technical stakeholders ● One DAG per data source ● Testing DAG and custom Plugins ○ Unit testing and end-to-end testing ■ Will be covered in next meetup talk
  • 24. Is Airflow my only option? ● Crontab ● Jenkins ● GCP CloudBuild ● Argo ● AWS Step Functions ● ...