0% found this document useful (0 votes)
25 views34 pages

Airflow_Best_Practices

The document provides a comprehensive overview of Apache Airflow, including its architecture, workflow components, and scheduling mechanisms. It discusses best practices for creating and managing workflows, such as ensuring tasks are idempotent and using meaningful IDs. Additionally, it covers unit testing, CI/CD integration with Jenkins, and handling compute-intensive tasks using PySpark and AWS EMR.

Uploaded by

crypto.ai.nft
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views34 pages

Airflow_Best_Practices

The document provides a comprehensive overview of Apache Airflow, including its architecture, workflow components, and scheduling mechanisms. It discusses best practices for creating and managing workflows, such as ensuring tasks are idempotent and using meaningful IDs. Additionally, it covers unit testing, CI/CD integration with Jenkins, and handling compute-intensive tasks using PySpark and AWS EMR.

Uploaded by

crypto.ai.nft
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

Agenda

• What is Airflow?
• Architecture Overview
• Workflow Components
• Example Workflow
• Establishing Connections
• Scheduling and Execution
• Best Practices
Agenda Continued

• Unit Testing Workflows


• Unit Testing Plugins
• Compute Intensive Tasks
• Managing work in JIRA
Airflow Overview

• Open-source and Python based


• Scheduling (like CRON)
• Workflow orchestration
• UI (web interface)
• Alerting and Monitoring
• Connection Management
• Many out-of-the-box integrations
• Scalable
CRON Scheduling

0 0 19 1/1 * ? * bash /scripts/hello.sh

Run every day at 7:00 PM

Airflow uses similar concepts to CRON for scheduling, except a more complex set of tasks may be executed
within a workflow.
Workflow (DAG) Overview

• Workflow and DAG term


used interchangeably
• DAG – Directed Acyclic
Graph
• Python script
• Instructions to execute
• Parallel execution
Task Failures

• Workflows may start


where they failed
• May include defined
automatic retries
• Email/Messaging of task
and workflow failure
sent
Example Workflow

Here is an example workflow with parallel execution.

Extract data from a database, write it to the data lake, clean and validate it, transform it, and finally load it
somewhere. Each node is a separate task and the arrows illustrate task dependency.
Workflow Components: Task

• Tasks make up a workflow


• Considered to be the building blocks
• May have dependencies between one another
• Composed of hooks, operators, and XComs
• The previous slide showed a workflow with many tasks
• Example: mssql_to_landing
Workflow Components: Hooks

• Essentially any connection to an external system


• Database connection
• Web API
• FTP server
• The "mssql_to_landing" task uses a MSSQL hook and S3 hook
Workflow Components: Operators

• Logic used within a task


• 3 primary types
• Action operator – this is used most often
• Transfer operator – move data from point A to point B
• Sensor operator – identify if an endpoint is up
• The "mssql_to_landing" task is a custom operator moving data from a
MSSQL database via query to a S3 location
Workflow Components: Sensor Operator

• Run until a certain criteria is met


• API is up
• Database contains data
• A file exists within a folder
• Time limit has exceeded
• Pauses the downstream dependencies until criteria is met
Workflow Components: Transfer Operator

• Simply move data from point A to point B


• May be any type of hook
• The "mssql_to_landing" operator is a custom specialized operator
Workflow Components: XComs

• Allows you to share state/data between tasks


• Not recommended for large data
• Instead use remote storage like S3 and pass the path to it in an XCom
Workflow Components: Templating and
Macros

• Airflow provides Jinja templating of commands and scripts


• Macros are template variables to call
• Useful for identifying incremental load date and time ranges in SQL queries
• May define custom macros for use in templates
Connections

• Configured in the UI
• Have a unique ID used within hooks
• Abstract in the sense that a file path may be a hook used within a
FileSensor
• Encrypted information with Fernet keys
Variables

• Airflow allows you store arbitrary variables in the database


• Useful for environment specific information
• Development or Production server
• May be used within a task
Scheduling and Execution

• Scheduling is configured like CRON format


• Start date defined in DAG specifies next execution date
• Example: @daily schedule with start day of today will execute at
midnight
• Manual execution may be performed via web UI or CLI
• The CLI is useful in identifying the execution dates and times
Backfill and Catchup

• Backfill allows past executions of DAGs


• With a start date of 30 days ago, a daily schedule, and catchup set to
true, there will be 30 DAG runs to "backfill"
• Can be problematic if not thought through
• Schedule of 5-minute intervals and 1+ month or years in past
• With backfill you may re-run a specific task
• Let's assume you change a statistic compute task and need to re-run
against all data
Best Practice: Provide meaningful DAG and
Task IDs

• DAG and task IDs are required


• DAG ID must be unique
• Arbitrary label shown in the UI
• Providing meaningful IDs make it easy to interpret the DAG at a high level
through the web UI
Best Practice: Tasks should be idempotent and
deterministic

• Concepts derived from functional programming paradigm


• Idempotent – same input provides the same result
• Deterministic – knowing the input means you will know the result
Best Practice: Document DAGs and Tasks

DAGs and tasks may be documented, and this Create markdown templates to follow and require
documentation is displayed within the web UI them to be used with every DAG and task.
Best Practice: Avoid
costly code execution
during load time of a
DAG
• Airflow loads the DAGs on a
regular basis (default 30 seconds)
reading the entire script
• Long and slow running code in the
global scope of the script makes
the load take extra time
Best Practice: Use the with statement

• The with statement in Python provides


"context" to the block of code
• It is useful in DAG script creation to
provide the DAG context to associated
tasks
Best Practice: Never hard code configurable
paths

• Code maintenance may become problematic when many hard-coded paths


exist
• Instead use one of the following
• Airflow Variable
• Configuration file
Best Practice: Always use bitshift operators for
defining task dependencies

Difficult to
interpret quickly! Much better!
Best Practice: Use factories to generate
common patterns

• Write a function to generate a DAG or


set of tasks
Best Practice: Create new DAGs for major
changes

• Airflow loses track of tasks that are deleted


• It is best to create a new DAG and leave the old one
• Simply create a new one and label it with versionX
Best Practice: Detect long running tasks with
SLAs and alerts

• SLA – Service Level Agreement


• SLA can be assigned to task or DAG
• Defined as a timeout
Best Practice: Use pools for concurrency
management

• Pools may be defined in the web UI


• Assume you want N number of
concurrent tasks running against a
database. Pooling allows this.
• They are all user defined
Best Practice: Use an airflowignore file to avoid
unnecessary file scanning

• airflowignore may be defined in workflows directory


• It works similarly to gitignore files
• Using this allows you to avoid unnecessary file scanning by airflow engine
CI/CD with Jenkins

• Automated build and


deployment in dev and
prod
• Deployment only occurs
when unit tests, code
coverage, and the
packaging succeeds
• Deploy code to S3 or
directly to the Airflow
server for syncing
Unit Testing DAGs

• At a minimum, a test should be written


to ensure the DAG can be loaded
• This will avoid deploying broken DAGs
from simplistic errors
Unit Testing Custom Code

• Custom utility functions and code should test


• Valid input
• Invalid input (exceptions caught and raised)
• May be written as functions or classes
• pytest is used for running the tests
Compute Intensive Tasks

• Computationally intensive tasks should make use of PySpark


• Apache Airflow allows you to run any code in an operator, however the
worker node may lack compute resources
• AWS EMR (Elastic Map Reduce) will be used for PySpark
• Apache Airflow will orchestrate AWS EMR launching, compute, and tear
down

You might also like