0% found this document useful (0 votes)

25 views34 pages

Airflow_Best_Practices

The document provides a comprehensive overview of Apache Airflow, including its architecture, workflow components, and scheduling mechanisms. It discusses best practices for creating and managing workflows, such as ensuring tasks are idempotent and using meaningful IDs. Additionally, it covers unit testing, CI/CD integration with Jenkins, and handling compute-intensive tasks using PySpark and AWS EMR.

Uploaded by

crypto.ai.nft

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views34 pages

Airflow_Best_Practices

Uploaded by

crypto.ai.nft

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

Agenda

• What is Airflow?
• Architecture Overview
• Workflow Components
• Example Workflow
• Establishing Connections
• Scheduling and Execution
• Best Practices
Agenda Continued

• Unit Testing Workflows

• Unit Testing Plugins
• Compute Intensive Tasks
• Managing work in JIRA
Airflow Overview

• Open-source and Python based

• Scheduling (like CRON)
• Workflow orchestration
• UI (web interface)
• Alerting and Monitoring
• Connection Management
• Many out-of-the-box integrations
• Scalable
CRON Scheduling

0 0 19 1/1 * ? * bash /scripts/hello.sh

Run every day at 7:00 PM

Airflow uses similar concepts to CRON for scheduling, except a more complex set of tasks may be executed
within a workflow.
Workflow (DAG) Overview

• Workflow and DAG term

used interchangeably
• DAG – Directed Acyclic
Graph
• Python script
• Instructions to execute
• Parallel execution
Task Failures

• Workflows may start

where they failed
• May include defined
automatic retries
• Email/Messaging of task
and workflow failure
sent
Example Workflow

Here is an example workflow with parallel execution.

Extract data from a database, write it to the data lake, clean and validate it, transform it, and finally load it
somewhere. Each node is a separate task and the arrows illustrate task dependency.
Workflow Components: Task

• Tasks make up a workflow

• Considered to be the building blocks
• May have dependencies between one another
• Composed of hooks, operators, and XComs
• The previous slide showed a workflow with many tasks
• Example: mssql_to_landing
Workflow Components: Hooks

• Essentially any connection to an external system

• Database connection
• Web API
• FTP server
• The "mssql_to_landing" task uses a MSSQL hook and S3 hook
Workflow Components: Operators

• Logic used within a task

• 3 primary types
• Action operator – this is used most often
• Transfer operator – move data from point A to point B
• Sensor operator – identify if an endpoint is up
• The "mssql_to_landing" task is a custom operator moving data from a
MSSQL database via query to a S3 location
Workflow Components: Sensor Operator

• Run until a certain criteria is met

• API is up
• Database contains data
• A file exists within a folder
• Time limit has exceeded
• Pauses the downstream dependencies until criteria is met
Workflow Components: Transfer Operator

• Simply move data from point A to point B

• May be any type of hook
• The "mssql_to_landing" operator is a custom specialized operator
Workflow Components: XComs

• Allows you to share state/data between tasks

• Not recommended for large data
• Instead use remote storage like S3 and pass the path to it in an XCom
Workflow Components: Templating and
Macros

• Airflow provides Jinja templating of commands and scripts

• Macros are template variables to call
• Useful for identifying incremental load date and time ranges in SQL queries
• May define custom macros for use in templates
Connections

• Configured in the UI
• Have a unique ID used within hooks
• Abstract in the sense that a file path may be a hook used within a
FileSensor
• Encrypted information with Fernet keys
Variables

• Airflow allows you store arbitrary variables in the database

• Useful for environment specific information
• Development or Production server
• May be used within a task
Scheduling and Execution

• Scheduling is configured like CRON format

• Start date defined in DAG specifies next execution date
• Example: @daily schedule with start day of today will execute at
midnight
• Manual execution may be performed via web UI or CLI
• The CLI is useful in identifying the execution dates and times
Backfill and Catchup

• Backfill allows past executions of DAGs

• With a start date of 30 days ago, a daily schedule, and catchup set to
true, there will be 30 DAG runs to "backfill"
• Can be problematic if not thought through
• Schedule of 5-minute intervals and 1+ month or years in past
• With backfill you may re-run a specific task
• Let's assume you change a statistic compute task and need to re-run
against all data
Best Practice: Provide meaningful DAG and
Task IDs

• DAG and task IDs are required

• DAG ID must be unique
• Arbitrary label shown in the UI
• Providing meaningful IDs make it easy to interpret the DAG at a high level
through the web UI
Best Practice: Tasks should be idempotent and
deterministic

• Concepts derived from functional programming paradigm

• Idempotent – same input provides the same result
• Deterministic – knowing the input means you will know the result
Best Practice: Document DAGs and Tasks

DAGs and tasks may be documented, and this Create markdown templates to follow and require
documentation is displayed within the web UI them to be used with every DAG and task.
Best Practice: Avoid
costly code execution
during load time of a
DAG
• Airflow loads the DAGs on a
regular basis (default 30 seconds)
reading the entire script
• Long and slow running code in the
global scope of the script makes
the load take extra time
Best Practice: Use the with statement

• The with statement in Python provides

"context" to the block of code
• It is useful in DAG script creation to
provide the DAG context to associated
tasks
Best Practice: Never hard code configurable
paths

• Code maintenance may become problematic when many hard-coded paths

exist
• Instead use one of the following
• Airflow Variable
• Configuration file
Best Practice: Always use bitshift operators for
defining task dependencies

Difficult to
interpret quickly! Much better!
Best Practice: Use factories to generate
common patterns

• Write a function to generate a DAG or

set of tasks
Best Practice: Create new DAGs for major
changes

• Airflow loses track of tasks that are deleted

• It is best to create a new DAG and leave the old one
• Simply create a new one and label it with versionX
Best Practice: Detect long running tasks with
SLAs and alerts

• SLA – Service Level Agreement

• SLA can be assigned to task or DAG
• Defined as a timeout
Best Practice: Use pools for concurrency
management

• Pools may be defined in the web UI

• Assume you want N number of
concurrent tasks running against a
database. Pooling allows this.
• They are all user defined
Best Practice: Use an airflowignore file to avoid
unnecessary file scanning

• airflowignore may be defined in workflows directory

• It works similarly to gitignore files
• Using this allows you to avoid unnecessary file scanning by airflow engine
CI/CD with Jenkins

• Automated build and

deployment in dev and
prod
• Deployment only occurs
when unit tests, code
coverage, and the
packaging succeeds
• Deploy code to S3 or
directly to the Airflow
server for syncing
Unit Testing DAGs

• At a minimum, a test should be written

to ensure the DAG can be loaded
• This will avoid deploying broken DAGs
from simplistic errors
Unit Testing Custom Code

• Custom utility functions and code should test

• Valid input
• Invalid input (exceptions caught and raised)
• May be written as functions or classes
• pytest is used for running the tests
Compute Intensive Tasks

• Computationally intensive tasks should make use of PySpark

• Apache Airflow allows you to run any code in an operator, however the
worker node may lack compute resources
• AWS EMR (Elastic Map Reduce) will be used for PySpark
• Apache Airflow will orchestrate AWS EMR launching, compute, and tear
down

Apache Airflow Certification - Study Guide For DAG Authoring
No ratings yet
Apache Airflow Certification - Study Guide For DAG Authoring
17 pages
Apache Airflow
No ratings yet
Apache Airflow
24 pages
Data Engineering Assignment Report
No ratings yet
Data Engineering Assignment Report
9 pages
GuideToApacheAirflow PDF
100% (1)
GuideToApacheAirflow PDF
6 pages
Best Practices Apache Airflow
100% (1)
Best Practices Apache Airflow
28 pages
R33640G1 Eng
No ratings yet
R33640G1 Eng
2 pages
Fundamentals of Adventure Game Design
100% (1)
Fundamentals of Adventure Game Design
61 pages
Airflow Web UI and CLI
No ratings yet
Airflow Web UI and CLI
51 pages
Intro To Apache Airflow
No ratings yet
Intro To Apache Airflow
14 pages
The Ultimate Guide to Apache Airflow DAGs
No ratings yet
The Ultimate Guide to Apache Airflow DAGs
135 pages
Appache Airflow
No ratings yet
Appache Airflow
5 pages
Airflow DAG - Best Practices: DAG As Configuration File
100% (1)
Airflow DAG - Best Practices: DAG As Configuration File
6 pages
Airflow Introduction
No ratings yet
Airflow Introduction
9 pages
Apache_Airflow_1741977651
No ratings yet
Apache_Airflow_1741977651
83 pages
2.Airflow_2
No ratings yet
2.Airflow_2
17 pages
AIRFLOW
No ratings yet
AIRFLOW
4 pages
2 - Apache Airflow
No ratings yet
2 - Apache Airflow
5 pages
Study Guide For Apache Airflow Fundamentals Certification
No ratings yet
Study Guide For Apache Airflow Fundamentals Certification
6 pages
Airflow Notes
No ratings yet
Airflow Notes
5 pages
Airflow
No ratings yet
Airflow
7 pages
Week 6. Airflow Overview
No ratings yet
Week 6. Airflow Overview
71 pages
Developing Elegant Workflows in Python Code With Apache Airflow
100% (1)
Developing Elegant Workflows in Python Code With Apache Airflow
35 pages
Airflow - Notes
No ratings yet
Airflow - Notes
82 pages
ETL Pipeline, Class Notes
No ratings yet
ETL Pipeline, Class Notes
2 pages
Airflowintroduction 190217155729
No ratings yet
Airflowintroduction 190217155729
21 pages
Etalab Talk Apache Airflow Embulk
No ratings yet
Etalab Talk Apache Airflow Embulk
29 pages
Lecture+notes+-+Automating+Machine+Learning+Workflows
No ratings yet
Lecture+notes+-+Automating+Machine+Learning+Workflows
12 pages
C2_W4
No ratings yet
C2_W4
93 pages
Airflow Notes
No ratings yet
Airflow Notes
10 pages
Apache Airflow
50% (2)
Apache Airflow
8 pages
Airflow Git CICD
No ratings yet
Airflow Git CICD
6 pages
Sid Anand Qcon Ai 2018 v2 PDF
No ratings yet
Sid Anand Qcon Ai 2018 v2 PDF
35 pages
airflow-techtonic-template
No ratings yet
airflow-techtonic-template
18 pages
Airflow
No ratings yet
Airflow
97 pages
Apache Airflow 50
100% (1)
Apache Airflow 50
50 pages
Apache Airflow Workflow
No ratings yet
Apache Airflow Workflow
4 pages
Dags Definitive Guide Mobile
No ratings yet
Dags Definitive Guide Mobile
176 pages
Scenario_Based_Airflow_Interview_Questions
No ratings yet
Scenario_Based_Airflow_Interview_Questions
4 pages
Modern Data Pipelines With Apache Airflow
No ratings yet
Modern Data Pipelines With Apache Airflow
36 pages
Data Engineering Assignment Report
No ratings yet
Data Engineering Assignment Report
9 pages
Apacheairflow 160827123852
No ratings yet
Apacheairflow 160827123852
25 pages
Airflow
No ratings yet
Airflow
3 pages
What is Apache Airflow
No ratings yet
What is Apache Airflow
22 pages
Learning Azure DevOps: Outperform DevOps using Azure Pipelines, Artifacts, Boards, Azure CLI, Test Plans and Repos
From Everand
Learning Azure DevOps: Outperform DevOps using Azure Pipelines, Artifacts, Boards, Azure CLI, Test Plans and Repos
Myra Kelnor
No ratings yet
Learning Azure DevOps
From Everand
Learning Azure DevOps
Myra Kelnor
No ratings yet
Apache Airflow Fundamentals Study Guide
No ratings yet
Apache Airflow Fundamentals Study Guide
7 pages
Airflow User Guide
No ratings yet
Airflow User Guide
444 pages
Apache Airflow - A Python Hands-On Guide
No ratings yet
Apache Airflow - A Python Hands-On Guide
9 pages
Apache Airflow for Data Engineering_ The Ultimate Guide _ by Vijay Gadhave _ Mar, 2025 _ Medium
No ratings yet
Apache Airflow for Data Engineering_ The Ultimate Guide _ by Vijay Gadhave _ Mar, 2025 _ Medium
18 pages
Dags Definitive Guide
No ratings yet
Dags Definitive Guide
89 pages
Apache-Airflow-Fundamentals-Study-Guide
No ratings yet
Apache-Airflow-Fundamentals-Study-Guide
7 pages
Overview- DAg Structure and Operators-1
No ratings yet
Overview- DAg Structure and Operators-1
6 pages
Dags Definitive Guide
No ratings yet
Dags Definitive Guide
82 pages
Running Airflow Reliably With Kubernetes
100% (1)
Running Airflow Reliably With Kubernetes
47 pages
Apache Airflow TRAINING12532
No ratings yet
Apache Airflow TRAINING12532
3 pages
Chapter 1
No ratings yet
Chapter 1
34 pages
Building Scalable CICD Pipelines With GitHub Actions
No ratings yet
Building Scalable CICD Pipelines With GitHub Actions
32 pages
98-Exploring-DAG-Design-Patterns-in-Apache-Airflow
No ratings yet
98-Exploring-DAG-Design-Patterns-in-Apache-Airflow
32 pages
Airflow Documentation 080818
100% (1)
Airflow Documentation 080818
157 pages
Apache Airflow Documentation
No ratings yet
Apache Airflow Documentation
101 pages
Airflow Dag Bash
No ratings yet
Airflow Dag Bash
6 pages
Notes Airflow MQTT
No ratings yet
Notes Airflow MQTT
6 pages
Frensville_Explainer_Video_Boards
No ratings yet
Frensville_Explainer_Video_Boards
8 pages
FrensvilleChronicle_Issue07
No ratings yet
FrensvilleChronicle_Issue07
7 pages
Atama_2024
No ratings yet
Atama_2024
16 pages
Revocable Living Trust
No ratings yet
Revocable Living Trust
18 pages
MDI ETF Cheatsheet v2
No ratings yet
MDI ETF Cheatsheet v2
5 pages
dbms11
No ratings yet
dbms11
36 pages
Troubleshooting FCC Unit Circulation and Fluidization Problems
No ratings yet
Troubleshooting FCC Unit Circulation and Fluidization Problems
4 pages
Original Gas in Place
No ratings yet
Original Gas in Place
2 pages
Reciprocating Compressors: CO // Semi-Hermetic
No ratings yet
Reciprocating Compressors: CO // Semi-Hermetic
16 pages
FreeBSD Journal May/June 2023
No ratings yet
FreeBSD Journal May/June 2023
56 pages
Design Patterns
No ratings yet
Design Patterns
48 pages
En Help Roblox Com HC en Us Articles 115004647846 Roblox Ter
No ratings yet
En Help Roblox Com HC en Us Articles 115004647846 Roblox Ter
32 pages
To Study of Employee Welfare Facilities-4
No ratings yet
To Study of Employee Welfare Facilities-4
66 pages
Ramp
No ratings yet
Ramp
4 pages
Smart Fan VFD
No ratings yet
Smart Fan VFD
12 pages
TWG Greer LMI Overview
No ratings yet
TWG Greer LMI Overview
8 pages
Lm-80 Test Report: Everlight Electronics Co., LTD
No ratings yet
Lm-80 Test Report: Everlight Electronics Co., LTD
13 pages
Canon Ir c5255 c5250 c5240 Brochure
No ratings yet
Canon Ir c5255 c5250 c5240 Brochure
12 pages
gsf1250k7 6 Steering 99500-39300-01e tcm34-12281
No ratings yet
gsf1250k7 6 Steering 99500-39300-01e tcm34-12281
16 pages
Application of Derivatives
No ratings yet
Application of Derivatives
1 page
yodeck-quick-setup-guide
No ratings yet
yodeck-quick-setup-guide
8 pages
E. H. Miller, "A Note On Reflector Arrays," IEEE: Transation On Antennas Propagation., To Be Published
No ratings yet
E. H. Miller, "A Note On Reflector Arrays," IEEE: Transation On Antennas Propagation., To Be Published
2 pages
Fronius Tauro P Datasheet 2023
No ratings yet
Fronius Tauro P Datasheet 2023
4 pages
Weld Overlay Procedure For Inconel
100% (3)
Weld Overlay Procedure For Inconel
3 pages
Master Deck Oil and Gas Pick What You Want
No ratings yet
Master Deck Oil and Gas Pick What You Want
21 pages
Manual Smar Tt301
100% (1)
Manual Smar Tt301
58 pages
The Challenges of Digital Forensics Inve
No ratings yet
The Challenges of Digital Forensics Inve
7 pages
feismo.com-accounting-pr_028c2c51b61e9eaafe61ac790e3a257e
No ratings yet
feismo.com-accounting-pr_028c2c51b61e9eaafe61ac790e3a257e
49 pages
Sales Text Imp
No ratings yet
Sales Text Imp
7 pages
Hvac Design Sourcebook 2nd Edition 34
No ratings yet
Hvac Design Sourcebook 2nd Edition 34
1 page
CV Rihem
No ratings yet
CV Rihem
1 page
Unit 1 Assignment
No ratings yet
Unit 1 Assignment
2 pages
Module 3 Mobile and Assistive Media
No ratings yet
Module 3 Mobile and Assistive Media
29 pages

Airflow_Best_Practices

Uploaded by

Airflow_Best_Practices

Uploaded by

Agenda

• Unit Testing Workflows

• Open-source and Python based

0 0 19 1/1 * ? * bash /scripts/hello.sh

Run every day at 7:00 PM

• Workflow and DAG term

• Workflows may start

Here is an example workflow with parallel execution.

• Tasks make up a workflow

• Essentially any connection to an external system

• Logic used within a task

• Run until a certain criteria is met

• Simply move data from point A to point B

• Allows you to share state/data between tasks

• Airflow provides Jinja templating of commands and scripts

• Airflow allows you store arbitrary variables in the database

• Scheduling is configured like CRON format

• Backfill allows past executions of DAGs

• DAG and task IDs are required

• Concepts derived from functional programming paradigm

• The with statement in Python provides

• Code maintenance may become problematic when many hard-coded paths

• Write a function to generate a DAG or

• Airflow loses track of tasks that are deleted

• SLA – Service Level Agreement

• Pools may be defined in the web UI

• airflowignore may be defined in workflows directory

• Automated build and

• At a minimum, a test should be written

• Custom utility functions and code should test

• Computationally intensive tasks should make use of PySpark

You might also like