SlideShare a Scribd company logo
Building and managing complex
dependencies pipeline using
Apache Oozie
Purshotam Shah (purushah@yahoo-inc.com)
Sr. Software Engineer, Yahoo Hadoop team
Apache Oozie PMC member and committer
Agenda
Oozie at Yahoo1
Data Pipelines
SLA and Monitoring
Monitoring Limitations and User monitoring systems
Future Work
2
3
4
5
Why Oozie?
3
 Out-of-box support for multiple job types
 Java, shell, distcp
 Mapreduce
• Pipes, streaming
 pig, hive, spark
 Highly scalable
 High availability
 Hot-Hot with rolling upgrades
 Load balanced
 Hue Integration
Oozie
Hbase
Pig
Hive
Spark
Yarn
HDFS
Hue
HCata
log
4
Security: https + kerberos /
cookie-based auth
Deployment Architecture at Yahoo
Load
Balancer
Oracle
RAC
Hadoop Cluster, HBase, HCatalog
submit request
request redirection
Oozie Server 1
Oozie Server 2
Inter server communication
for log streaming,sharelib update etc
Zookeeper
Curator
Security: https + kerberos / cookie-
based-auth
Security: https+kerberos
Lock management
Security: kerberos
Security: kerberos
Scale at Yahoo
5
Deployed on all clusters (production, non-production)
One instance per cluster
75 products / 2000 + projects
255 monthly users
90,00 workflow jobs daily June 2016, one busy cluster)
Between 1-8 actions :Avg. 4 actions/workflow
Extreme use case, submit 100-200 workflow jobs per min
2,277 coordinator jobs daily (June 2016, one busy cluster)
Frequency: 5, 10, 15 mins, hourly, daily, weekly, monthly (25% : < 15 min)
99 % of workflow jobs kicked from coordinator
97 bundle jobs daily (June 2016, one busy cluster)
Agenda
Oozie at Yahoo1
Data Pipelines
SLA and monitoring
Monitoring Limitations and User monitoring systems
Future Work
2
3
4
5
Data Pipelines
7
Ad Exchange
Ad Latency
Search Advertising
Content Management
Content Optimization
Content Personalization
Flickr Video
Audience Targeting
Behavioral Targeting
Partner Targeting
Retargeting
Web Targeting
Advertisement Content Targeting
Data Pipelines
8
Anti Spam
Content
Retargeting
Research
Dashboards & Reports
Forecasting
Email Data Intelligence Data Management
Audience Pipeline
Use Case - Data pipeline
9
Large Scale Data Pipeline Requirements
10
 Administrative
 One should be able to start, stop and pause all related pipelines or part of it at the
same time
 Dependency Management
 BCP support
 Data is not guaranteed, start processing even if partial data is available
 Mandatory and optional feeds
Large Scale Data Pipeline Requirements
11
 Multiple Providers
 If data is available from multiple providers, I want to specify the provider priority
 Combining dataset from multiple providers
 SLA Management
 Monitor pipeline processing to take immediate action in case of failures or SLA misses
 Pipelines owners should get notified if an SLA is missed
Bundle
12
 The Bundle system allows the user to define and execute a bunch of
Loosely coupled set of coordinators. They are dependent on each
other, but dependency is enforced via inputs and outputs.
 Bundle can be used to start/stop/suspend/resume/rerun whole pipeline
Complex dependencies
13
OOZIE-1976 : Specifying coordinator input datasets in more logical ways
BCP Support
Pull data from A or B. Specify dataset as AorB. Action will start running as soon
either dataset A or B is available.
<input-logic>
<or name=“AorB”>
<data-in dataset="A”/>
<data-in dataset="B"/>
</or>
</input-logic>
14
Minimum availability processing
15
 Some time, we want to process even if partial data is available.
<input-logic>
<data-in dataset=“A" min=”4”/>
</input-logic>
Optional feeds
16
 Dataset B is optional, Oozie will start processing as soon as A is available. It will include
dataset from A and whatever is available from B.
<input-logic>
<and name="optional>
<data-in dataset="A"/>
<data-in dataset="B" min=”0”/>
</and>
</input-logic>
Priority Among Dataset Instances
A will have higher precedence over B and B will have higher precedence over C.
<input-logic>
<or name="AorBorC">
<data-in dataset="A"/>
<data-in dataset="B"/>
<data-in dataset="C”/>
</or>
</input-logic>
17
Wait for primary
Sometime we want to give preference to primary data source and switch to secondary
only after waiting for some specific amount of time.
<input-logic>
<or name="AorB">
<data-in dataset="A” wait=“120”/>
<data-in dataset="B"/>
</or>
</input-logic>
18
Combining Dataset From Multiple Providers
Combine function will first check instances from A and go to B next for whatever is
missing in A.
<data-in name="A" dataset="dataset_A">
<start-instance> ${coord:current(-5)} </start-instance>
<end-instance> ${coord:current(-1)} </end-instance>
</data-in>
<data-in name="B" dataset="dataset_B">
<start-instance>${coord:current(-5)}</start-instance>
<end-instance>${coord:current(-1)}</end-instance>
</data-in>
<input-logic>
<combine name="AB">
<data-in dataset="A"/>
<data-in dataset="B"/>
</combine>
</input-logic>
19
Agenda
Oozie at Yahoo1
Data Pipelines
SLA and monitoring
Monitoring Limitations and User monitoring systems
Future Work
2
3
4
5
Monitoring
21
 Configure to receive notifications
 Email action
 HTTP notifications for job status change
 Email notification for SLA misses
 JMS notification for SLA events
 By Polling
 CLI/REST API monitoring
• Single Job monitoring
• Bulk Monitoring for Bundles and Coordinators
• SLA monitoring
Monitoring
22
 Email action can be added to workflow to send mail
 Job status change notification for coordinator action
 oozie.coord.action.notification.url
 oozie.coord.action.notification.proxy
 Job status change notification for workflow
 “oozie.wf.workflow.notification.url”
 “oozie.wf.workflow.notification.proxy”
Job Monitoring - polling
23
 Supported for both CLI and web service
 Single job monitoring
 Bulk job monitoring
 Multiple parameter like,
• Bundle name, bundle id, username, startcreatedtime, endcreatedtime
 Multiple job status such as
• oozie jobs -bulk bundle=bundle-app-1; actionstatus=RUNNING; actionstatus=FAILED
 Oozie can actively track SLAs on Jobs’
 Start-time, End-time, Duration
 Access/Filter SLA info via
 Web-console dashboard
 REST API
 JMS Messages
 Email alert
24
SLA Monitoring
25
SLA dashboard – tabular view
26
SLA dashboard – Graph view
Agenda
Oozie at Yahoo1
Data Pipelines
SLA and monitoring
Monitoring Limitations and User monitoring systems
Future Work
2
3
4
5
 User view
 BCP SLA support
 No Color coding
 Paging/oncall
 Threshold
 Consolidated email
 Multi grid view
28
Monitoring Limitations
29
Data pipeline monitoring use case from Y!
 Setup cron job which periodically pull SLA information from oozie
 If there is any SLA miss, notification is sent to internal monitoring
system
› Pages and sends mobile alert to on-call person
› Send email alert
30
Case-1
Case-1
31
Case-2
32
 Divided into four section
 SLA Details
 Error jobs
 Long Running Jobs
 Running jobs
SLA information
33
SLA-status
34
Long Waiting jobs
35
Long Waiting jobs – missing dependencies
36
Error Jobs
37
Running job details
38
Job explorer
39
Feeds - jobs
40
Validation job
41
 Data pipe line also run periodically validation jobs to validate the output
 Those multiple pipeline has multiple validation requirement, One example of validation
job is to validate the number of click impression with billing details.
Alert
42
Reprocessing
43
 One of the biggest requirements of a pipeline is to reprocess whole
dependent DAG.
 Oozie does not support any data dependencies
 This makes it very difficult to rerun the whole pipeline for a particular
nominal time.
Reprocessing
44
 To solve Oozie limitation, they have built a job dependency DAG.
 It is very similar to job explorer->feed lookup feature.
 job explorer->feed lookup is based on the output produced by
coordinator jobs.
 Job dependencies DAG is based on the input to jobs.
 Currently there is no UI to this, they parse oozie jobs daily and store the
dependencies in text file.
Reprocessing
45
 Rerun the failed action and all dependent coordinator jobs.
• Easy to do
• Cons
– Difficult to monitor
 Create a new coordinator for timeline which has failed
• Easy to monitor
Reprocessing
46
Reprocessing
47
Consolidate SLA Monitoring
48
Agenda
Oozie at Yahoo1
Data Pipelines
SLA and monitoring
Monitoring Limitations and User monitoring systems
Future Work
2
3
4
5
Future Work
50
 Oozie Unit testing framework
 No unit tests now. Directly tested by running in staging
 Coordinator Dependency management
 Better reprocessing
 Aperiodic and Incremental processing
 Managed through workarounds
Oozie BOF at Ballroom B
51
THANK YOU
Purshotam Shah (purushah@yahoo-inc.com)
Sr. Software Engineer, Yahoo Hadoop team

More Related Content

What's hot (20)

PPTX
Apache Tez: Accelerating Hadoop Query Processing
DataWorks Summit
 
PDF
[D11] SQL Server エンジニアに知ってもらいたい!! SQL Server チューニングアプローチ by masayuki ozawa
Insight Technology, Inc.
 
PPTX
分析指向データレイク実現の次の一手 ~Delta Lake、なにそれおいしいの?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)
NTT DATA Technology & Innovation
 
PDF
Intro to Delta Lake
Databricks
 
PDF
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Databricks
 
PDF
ヤフーを支えるフラッシュストレージ
Yahoo!デベロッパーネットワーク
 
PDF
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
GetInData
 
PPTX
Snowflake essentials
qureshihamid
 
PDF
Fundamentals of Apache Kafka
Chhavi Parasher
 
PDF
Fast analytics kudu to druid
Worapol Alex Pongpech, PhD
 
PDF
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Databricks
 
PDF
Intro to Cassandra
DataStax Academy
 
PDF
Changelog Stream Processing with Apache Flink
Flink Forward
 
PDF
Spark shuffle introduction
colorant
 
PDF
Solving Enterprise Data Challenges with Apache Arrow
Wes McKinney
 
PDF
Building Reliable Data Lakes at Scale with Delta Lake
Databricks
 
PDF
CDC Stream Processing with Apache Flink
Timo Walther
 
PPTX
iceberg introduction.pptx
Dori Waldman
 
PPTX
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Dremio Corporation
 
PDF
Apache Spark + Arrow
Takeshi Yamamuro
 
Apache Tez: Accelerating Hadoop Query Processing
DataWorks Summit
 
[D11] SQL Server エンジニアに知ってもらいたい!! SQL Server チューニングアプローチ by masayuki ozawa
Insight Technology, Inc.
 
分析指向データレイク実現の次の一手 ~Delta Lake、なにそれおいしいの?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)
NTT DATA Technology & Innovation
 
Intro to Delta Lake
Databricks
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Databricks
 
ヤフーを支えるフラッシュストレージ
Yahoo!デベロッパーネットワーク
 
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
GetInData
 
Snowflake essentials
qureshihamid
 
Fundamentals of Apache Kafka
Chhavi Parasher
 
Fast analytics kudu to druid
Worapol Alex Pongpech, PhD
 
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Databricks
 
Intro to Cassandra
DataStax Academy
 
Changelog Stream Processing with Apache Flink
Flink Forward
 
Spark shuffle introduction
colorant
 
Solving Enterprise Data Challenges with Apache Arrow
Wes McKinney
 
Building Reliable Data Lakes at Scale with Delta Lake
Databricks
 
CDC Stream Processing with Apache Flink
Timo Walther
 
iceberg introduction.pptx
Dori Waldman
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Dremio Corporation
 
Apache Spark + Arrow
Takeshi Yamamuro
 

Viewers also liked (20)

PPTX
August 2016 HUG: Recent development in Apache Oozie
Yahoo Developer Network
 
PPTX
Oozie at Yahoo
Mona Chitnis
 
PPTX
July 2012 HUG: Overview of Oozie Qualification Process
Yahoo Developer Network
 
PPTX
May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop
Yahoo Developer Network
 
PPTX
Loan Decisioning Transformation
DataWorks Summit/Hadoop Summit
 
PDF
High-Scale Entity Resolution in Hadoop
DataWorks Summit/Hadoop Summit
 
PDF
Oozie HUG May12
mislam77
 
PPTX
Oozie meetup - HA
Mona Chitnis
 
PPTX
Advanced Oozie
Chicago Hadoop Users Group
 
PPT
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
Yahoo Developer Network
 
PPTX
October 2014 HUG : Oozie HA
Yahoo Developer Network
 
PDF
Planification intégrée de ressources de la production d’électricité jusqu’au ...
Thearkvalais
 
PPT
Hrm Od Presentation 6oct
junaidhr
 
PPT
Egger
zeraguall
 
PDF
Talking Social TV 2 with Ed Keller and Beth Rockwood
Keller Fay Group
 
PDF
Local Govt & the 1st Amendment: Legislative Voting as "Speech" & Union Grieva...
inversecondemnation
 
PPT
Tom @ Leo's Academy
Tom Noeding
 
PDF
Poster KOBE
Karim Ouertani
 
PDF
Certa Servicios Periciales - Peritos de Seguros y Comisarios de Averías
athworz
 
PPTX
Pinterest for Nonprofits
Anne Yurasek
 
August 2016 HUG: Recent development in Apache Oozie
Yahoo Developer Network
 
Oozie at Yahoo
Mona Chitnis
 
July 2012 HUG: Overview of Oozie Qualification Process
Yahoo Developer Network
 
May 2012 HUG: Oozie: Towards a scalable Workflow Management System for Hadoop
Yahoo Developer Network
 
Loan Decisioning Transformation
DataWorks Summit/Hadoop Summit
 
High-Scale Entity Resolution in Hadoop
DataWorks Summit/Hadoop Summit
 
Oozie HUG May12
mislam77
 
Oozie meetup - HA
Mona Chitnis
 
Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N
Yahoo Developer Network
 
October 2014 HUG : Oozie HA
Yahoo Developer Network
 
Planification intégrée de ressources de la production d’électricité jusqu’au ...
Thearkvalais
 
Hrm Od Presentation 6oct
junaidhr
 
Egger
zeraguall
 
Talking Social TV 2 with Ed Keller and Beth Rockwood
Keller Fay Group
 
Local Govt & the 1st Amendment: Legislative Voting as "Speech" & Union Grieva...
inversecondemnation
 
Tom @ Leo's Academy
Tom Noeding
 
Poster KOBE
Karim Ouertani
 
Certa Servicios Periciales - Peritos de Seguros y Comisarios de Averías
athworz
 
Pinterest for Nonprofits
Anne Yurasek
 
Ad

Similar to Building and managing complex dependencies pipeline using Apache Oozie (20)

DOCX
Working Procedure SAP BW Testing
Gavaskar Selvarajan
 
PDF
The Practice of Presto & Alluxio in E-Commerce Big Data Platform
Alluxio, Inc.
 
PDF
Errors while sending packages from oltp to bi (one of error at the time of da...
bhaskarbi
 
PPT
HW09 Hadoop Vaidya
Cloudera, Inc.
 
PDF
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
AboutYouGmbH
 
PDF
Connecting Hadoop and Oracle
Tanel Poder
 
PPTX
Oracle REST Data Services Best Practices/ Overview
Kris Rice
 
PDF
#GeodeSummit - Integration & Future Direction for Spring Cloud Data Flow & Geode
PivotalOpenSourceHub
 
PDF
Nephele 2.0: How to get the most out of your Nephele results
Bioinformatics and Computational Biosciences Branch
 
PPTX
Data Architectures for Robust Decision Making
Gwen (Chen) Shapira
 
PDF
Exploring Scenarios of Flink CDC in Streaming Data Integration
Leonard Xu
 
PDF
Day 8.1 system_admin_tasks
tovetrivel
 
PDF
Migrating Data Warehouse Solutions from Oracle to non-Oracle Databases
Jade Global
 
PDF
Oracle Drivers configuration for High Availability, is it a developer's job?
Ludovico Caldara
 
PPT
Sap basis online training classes
sapehsit
 
PDF
Rakuten Technology Conference 2017 A Distributed SQL Database For Data Analy...
Rakuten Group, Inc.
 
PPTX
Deep Dive - Usage of on premises data gateway for hybrid integration scenarios
Sajith C P Nair
 
PDF
What's New in Apache Hive 3.0?
DataWorks Summit
 
PDF
What's New in Apache Hive 3.0 - Tokyo
DataWorks Summit
 
PDF
Sap bw lo extraction
Obaid shaikh
 
Working Procedure SAP BW Testing
Gavaskar Selvarajan
 
The Practice of Presto & Alluxio in E-Commerce Big Data Platform
Alluxio, Inc.
 
Errors while sending packages from oltp to bi (one of error at the time of da...
bhaskarbi
 
HW09 Hadoop Vaidya
Cloudera, Inc.
 
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
AboutYouGmbH
 
Connecting Hadoop and Oracle
Tanel Poder
 
Oracle REST Data Services Best Practices/ Overview
Kris Rice
 
#GeodeSummit - Integration & Future Direction for Spring Cloud Data Flow & Geode
PivotalOpenSourceHub
 
Nephele 2.0: How to get the most out of your Nephele results
Bioinformatics and Computational Biosciences Branch
 
Data Architectures for Robust Decision Making
Gwen (Chen) Shapira
 
Exploring Scenarios of Flink CDC in Streaming Data Integration
Leonard Xu
 
Day 8.1 system_admin_tasks
tovetrivel
 
Migrating Data Warehouse Solutions from Oracle to non-Oracle Databases
Jade Global
 
Oracle Drivers configuration for High Availability, is it a developer's job?
Ludovico Caldara
 
Sap basis online training classes
sapehsit
 
Rakuten Technology Conference 2017 A Distributed SQL Database For Data Analy...
Rakuten Group, Inc.
 
Deep Dive - Usage of on premises data gateway for hybrid integration scenarios
Sajith C P Nair
 
What's New in Apache Hive 3.0?
DataWorks Summit
 
What's New in Apache Hive 3.0 - Tokyo
DataWorks Summit
 
Sap bw lo extraction
Obaid shaikh
 
Ad

More from DataWorks Summit/Hadoop Summit (20)

PPT
Running Apache Spark & Apache Zeppelin in Production
DataWorks Summit/Hadoop Summit
 
PPT
State of Security: Apache Spark & Apache Zeppelin
DataWorks Summit/Hadoop Summit
 
PDF
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
 
PDF
Enabling Digital Diagnostics with a Data Science Platform
DataWorks Summit/Hadoop Summit
 
PDF
Revolutionize Text Mining with Spark and Zeppelin
DataWorks Summit/Hadoop Summit
 
PDF
Double Your Hadoop Performance with Hortonworks SmartSense
DataWorks Summit/Hadoop Summit
 
PDF
Hadoop Crash Course
DataWorks Summit/Hadoop Summit
 
PDF
Data Science Crash Course
DataWorks Summit/Hadoop Summit
 
PDF
Apache Spark Crash Course
DataWorks Summit/Hadoop Summit
 
PDF
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
 
PPTX
Schema Registry - Set you Data Free
DataWorks Summit/Hadoop Summit
 
PPTX
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
DataWorks Summit/Hadoop Summit
 
PDF
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
DataWorks Summit/Hadoop Summit
 
PPTX
Mool - Automated Log Analysis using Data Science and ML
DataWorks Summit/Hadoop Summit
 
PPTX
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
PPTX
HBase in Practice
DataWorks Summit/Hadoop Summit
 
PPTX
The Challenge of Driving Business Value from the Analytics of Things (AOT)
DataWorks Summit/Hadoop Summit
 
PDF
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 
PPTX
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
DataWorks Summit/Hadoop Summit
 
PPTX
Backup and Disaster Recovery in Hadoop
DataWorks Summit/Hadoop Summit
 
Running Apache Spark & Apache Zeppelin in Production
DataWorks Summit/Hadoop Summit
 
State of Security: Apache Spark & Apache Zeppelin
DataWorks Summit/Hadoop Summit
 
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
DataWorks Summit/Hadoop Summit
 
Revolutionize Text Mining with Spark and Zeppelin
DataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
DataWorks Summit/Hadoop Summit
 
Hadoop Crash Course
DataWorks Summit/Hadoop Summit
 
Data Science Crash Course
DataWorks Summit/Hadoop Summit
 
Apache Spark Crash Course
DataWorks Summit/Hadoop Summit
 
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
 
Schema Registry - Set you Data Free
DataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
DataWorks Summit/Hadoop Summit
 
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
HBase in Practice
DataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
DataWorks Summit/Hadoop Summit
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
DataWorks Summit/Hadoop Summit
 
Backup and Disaster Recovery in Hadoop
DataWorks Summit/Hadoop Summit
 

Recently uploaded (20)

PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
PDF
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
PDF
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PPTX
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PPTX
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
PPTX
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
PPTX
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
PDF
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
PPTX
Digital Circuits, important subject in CS
contactparinay1
 
PDF
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
PDF
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
Digital Circuits, important subject in CS
contactparinay1
 
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 

Building and managing complex dependencies pipeline using Apache Oozie

  • 1. Building and managing complex dependencies pipeline using Apache Oozie Purshotam Shah ([email protected]) Sr. Software Engineer, Yahoo Hadoop team Apache Oozie PMC member and committer
  • 2. Agenda Oozie at Yahoo1 Data Pipelines SLA and Monitoring Monitoring Limitations and User monitoring systems Future Work 2 3 4 5
  • 3. Why Oozie? 3  Out-of-box support for multiple job types  Java, shell, distcp  Mapreduce • Pipes, streaming  pig, hive, spark  Highly scalable  High availability  Hot-Hot with rolling upgrades  Load balanced  Hue Integration Oozie Hbase Pig Hive Spark Yarn HDFS Hue HCata log
  • 4. 4 Security: https + kerberos / cookie-based auth Deployment Architecture at Yahoo Load Balancer Oracle RAC Hadoop Cluster, HBase, HCatalog submit request request redirection Oozie Server 1 Oozie Server 2 Inter server communication for log streaming,sharelib update etc Zookeeper Curator Security: https + kerberos / cookie- based-auth Security: https+kerberos Lock management Security: kerberos Security: kerberos
  • 5. Scale at Yahoo 5 Deployed on all clusters (production, non-production) One instance per cluster 75 products / 2000 + projects 255 monthly users 90,00 workflow jobs daily June 2016, one busy cluster) Between 1-8 actions :Avg. 4 actions/workflow Extreme use case, submit 100-200 workflow jobs per min 2,277 coordinator jobs daily (June 2016, one busy cluster) Frequency: 5, 10, 15 mins, hourly, daily, weekly, monthly (25% : < 15 min) 99 % of workflow jobs kicked from coordinator 97 bundle jobs daily (June 2016, one busy cluster)
  • 6. Agenda Oozie at Yahoo1 Data Pipelines SLA and monitoring Monitoring Limitations and User monitoring systems Future Work 2 3 4 5
  • 7. Data Pipelines 7 Ad Exchange Ad Latency Search Advertising Content Management Content Optimization Content Personalization Flickr Video Audience Targeting Behavioral Targeting Partner Targeting Retargeting Web Targeting Advertisement Content Targeting
  • 8. Data Pipelines 8 Anti Spam Content Retargeting Research Dashboards & Reports Forecasting Email Data Intelligence Data Management Audience Pipeline
  • 9. Use Case - Data pipeline 9
  • 10. Large Scale Data Pipeline Requirements 10  Administrative  One should be able to start, stop and pause all related pipelines or part of it at the same time  Dependency Management  BCP support  Data is not guaranteed, start processing even if partial data is available  Mandatory and optional feeds
  • 11. Large Scale Data Pipeline Requirements 11  Multiple Providers  If data is available from multiple providers, I want to specify the provider priority  Combining dataset from multiple providers  SLA Management  Monitor pipeline processing to take immediate action in case of failures or SLA misses  Pipelines owners should get notified if an SLA is missed
  • 12. Bundle 12  The Bundle system allows the user to define and execute a bunch of Loosely coupled set of coordinators. They are dependent on each other, but dependency is enforced via inputs and outputs.  Bundle can be used to start/stop/suspend/resume/rerun whole pipeline
  • 13. Complex dependencies 13 OOZIE-1976 : Specifying coordinator input datasets in more logical ways
  • 14. BCP Support Pull data from A or B. Specify dataset as AorB. Action will start running as soon either dataset A or B is available. <input-logic> <or name=“AorB”> <data-in dataset="A”/> <data-in dataset="B"/> </or> </input-logic> 14
  • 15. Minimum availability processing 15  Some time, we want to process even if partial data is available. <input-logic> <data-in dataset=“A" min=”4”/> </input-logic>
  • 16. Optional feeds 16  Dataset B is optional, Oozie will start processing as soon as A is available. It will include dataset from A and whatever is available from B. <input-logic> <and name="optional> <data-in dataset="A"/> <data-in dataset="B" min=”0”/> </and> </input-logic>
  • 17. Priority Among Dataset Instances A will have higher precedence over B and B will have higher precedence over C. <input-logic> <or name="AorBorC"> <data-in dataset="A"/> <data-in dataset="B"/> <data-in dataset="C”/> </or> </input-logic> 17
  • 18. Wait for primary Sometime we want to give preference to primary data source and switch to secondary only after waiting for some specific amount of time. <input-logic> <or name="AorB"> <data-in dataset="A” wait=“120”/> <data-in dataset="B"/> </or> </input-logic> 18
  • 19. Combining Dataset From Multiple Providers Combine function will first check instances from A and go to B next for whatever is missing in A. <data-in name="A" dataset="dataset_A"> <start-instance> ${coord:current(-5)} </start-instance> <end-instance> ${coord:current(-1)} </end-instance> </data-in> <data-in name="B" dataset="dataset_B"> <start-instance>${coord:current(-5)}</start-instance> <end-instance>${coord:current(-1)}</end-instance> </data-in> <input-logic> <combine name="AB"> <data-in dataset="A"/> <data-in dataset="B"/> </combine> </input-logic> 19
  • 20. Agenda Oozie at Yahoo1 Data Pipelines SLA and monitoring Monitoring Limitations and User monitoring systems Future Work 2 3 4 5
  • 21. Monitoring 21  Configure to receive notifications  Email action  HTTP notifications for job status change  Email notification for SLA misses  JMS notification for SLA events  By Polling  CLI/REST API monitoring • Single Job monitoring • Bulk Monitoring for Bundles and Coordinators • SLA monitoring
  • 22. Monitoring 22  Email action can be added to workflow to send mail  Job status change notification for coordinator action  oozie.coord.action.notification.url  oozie.coord.action.notification.proxy  Job status change notification for workflow  “oozie.wf.workflow.notification.url”  “oozie.wf.workflow.notification.proxy”
  • 23. Job Monitoring - polling 23  Supported for both CLI and web service  Single job monitoring  Bulk job monitoring  Multiple parameter like, • Bundle name, bundle id, username, startcreatedtime, endcreatedtime  Multiple job status such as • oozie jobs -bulk bundle=bundle-app-1; actionstatus=RUNNING; actionstatus=FAILED
  • 24.  Oozie can actively track SLAs on Jobs’  Start-time, End-time, Duration  Access/Filter SLA info via  Web-console dashboard  REST API  JMS Messages  Email alert 24 SLA Monitoring
  • 25. 25 SLA dashboard – tabular view
  • 26. 26 SLA dashboard – Graph view
  • 27. Agenda Oozie at Yahoo1 Data Pipelines SLA and monitoring Monitoring Limitations and User monitoring systems Future Work 2 3 4 5
  • 28.  User view  BCP SLA support  No Color coding  Paging/oncall  Threshold  Consolidated email  Multi grid view 28 Monitoring Limitations
  • 29. 29 Data pipeline monitoring use case from Y!
  • 30.  Setup cron job which periodically pull SLA information from oozie  If there is any SLA miss, notification is sent to internal monitoring system › Pages and sends mobile alert to on-call person › Send email alert 30 Case-1
  • 32. Case-2 32  Divided into four section  SLA Details  Error jobs  Long Running Jobs  Running jobs
  • 36. Long Waiting jobs – missing dependencies 36
  • 41. Validation job 41  Data pipe line also run periodically validation jobs to validate the output  Those multiple pipeline has multiple validation requirement, One example of validation job is to validate the number of click impression with billing details.
  • 43. Reprocessing 43  One of the biggest requirements of a pipeline is to reprocess whole dependent DAG.  Oozie does not support any data dependencies  This makes it very difficult to rerun the whole pipeline for a particular nominal time.
  • 44. Reprocessing 44  To solve Oozie limitation, they have built a job dependency DAG.  It is very similar to job explorer->feed lookup feature.  job explorer->feed lookup is based on the output produced by coordinator jobs.  Job dependencies DAG is based on the input to jobs.  Currently there is no UI to this, they parse oozie jobs daily and store the dependencies in text file.
  • 45. Reprocessing 45  Rerun the failed action and all dependent coordinator jobs. • Easy to do • Cons – Difficult to monitor  Create a new coordinator for timeline which has failed • Easy to monitor
  • 49. Agenda Oozie at Yahoo1 Data Pipelines SLA and monitoring Monitoring Limitations and User monitoring systems Future Work 2 3 4 5
  • 50. Future Work 50  Oozie Unit testing framework  No unit tests now. Directly tested by running in staging  Coordinator Dependency management  Better reprocessing  Aperiodic and Incremental processing  Managed through workarounds
  • 51. Oozie BOF at Ballroom B 51
  • 52. THANK YOU Purshotam Shah ([email protected]) Sr. Software Engineer, Yahoo Hadoop team