SlideShare a Scribd company logo
S M A R T V I D E O A D V E R T I S I N G
Processing Complex Workflows
in Advertising Using Hadoop
June 3rd, 2014
Who we are
Rahul Ravindran Bernardo de Seabra
Data Team Data Team
rahul@brightroll.com bernardo@brightroll.com
@bseabra
Agenda
• Introduction to BrightRoll
• Data Consumer Requirements
• Motivation
• Design
– Streaming log data into HDFS
– Anatomy of an event
– Event de-duplication
– Configuration driven processing
– Auditing
• Future
Introduction: BrightRoll
• Largest Online Video Advertisement Platform
• BrightRoll builds technology that improves
and automates video advertising globally
• Reaching 53.9% of US audience, 168MM
unique viewers
• 3+ Billion video ads / month
• 20+ Billion events processed / day
Data Consumer Requirements
• Processing results
– Campaign delivery
– Analytics
– Telemetry
• Consumers of processed data
– Delivery algorithms to augment decision behavior
– Campaign managers to monitor/tweak campaigns
– Billing system
– Forecasting/planning tools
– Business Analysts: long/short term analysis
Motivation – legacy data pipeline
• Not linearly scale-able
• Unit of processing was single campaign
• Not HA
• Lots of moving parts, no centralized control
and monitoring
• Failure recovery was time consuming
Motivation – legacy data pipeline
• Lots of boilerplate code
– hard to onboard new data/computations
• Interval based processing
– 2 hour sliding window
– Inherited delay
– Inefficient use of resources
• All data must be retrieved prior to processing
Performance requirements
• Low end-to-end delivery of aggregated
metrics
– Feedback loop into delivery algorithm
– Campaign managers can react faster to their
campaign performance
• Linearly scalable
Design decisions
• Streaming model
– Data is continuously being written
– Process data once
– Checkpoint states
– Low end-to-end latency (5 mins)
• Idempotent
– Jobs can fail, tasks can fail, allows repeatability
• Configuration driven join semantics
– Ease of on-boarding new data/computations
Overview Data Processing Pipeline
ProcessDe-duplicate Store
HDFS
M/R HBase
Data
Data Producers
Flume NG
Data Warehouse
Stream log data into HDFS using Flume
Adser
v
Adser
v
Adser
v
HDFS
Flume
• Flume rolls files every
2 minutes
• Files lexicographically
ordered
• Treat files written
from flume to be a
stream
• Maintain a marker
which points to
current location in
the input stream
• Enables us to always
process new data
logs
logs
logs
File.1239
File.1238
File.1237
File.1236
File.1235
File.1234
Marker
Files written by Flume
Event Header Event
payload
Event ID Event
timestamp
Event type Machine id
Anatomy of an event
De-duplication
Raw logs Hbase table
• We load raw logs into
an hbase table
• We use hbase table as a
stream
• We keep track of a
time-based marker per
table which represents
a point in time up to
which we have
processed data
Hbase table
Start time
End time
• Next run will read data which was
inserted from start time to end time
(window of TO_BE_PROCESSED data)
• Rowkey is <salt, event timestamp,
event id>
Chunk 1
Chunk 3
Chunk 2
• Break up data in
WINDOW_TO_BE_PROCESSED
into chunks
• Each chunk has same salt and
contiguous event timestamp
• Each chunk is sorted – artifact of
hbase storage
Salt time id
4 1234 Foobar
1
4 1234 Foobar
2
4 1235 Foobar
3
6 1234 Foobar
4
7 1235 Foobar
5
7 1236 foobar6
StartRow
EndRow
Historical Scan
without time range ,
multi-versions
• New Scan object gives
historical view
• Perform de-duplication of data
in chunk based on historical
view
Key Event
payload
4,1234,
foobar1
4,1234,
foobar2
4,1235,
foobar3
De-duplication performance
• High Dedup throughput – 1.2+ million events
per second
• Dedup across 4 days of historical data
StartRow/EndRow
scan
TimeRange scan
Compaction co-
processor to
compact files older
than table start
Time
Processing - Joins
Impression Auction Computation
Arbitrary joins
• Use of an mechanism very similar to the de-
duplication previously described
• Historical scan now checks for other events
specified in the join
• Business level de-duplication – duplicate
impressions for same auction performed here
as well
• “Session debugging”
Auditing
Adser
v
Adser
v
Adser
v
Metadata
Auditor
Metadata
Machine id #
events
Time
interval
Deduped.1 Deduped.2 Deduped.3
Disk Replay
What we have now
• All the stuff we have talked about plus system
which
– Scales linearly
– HA within our data center
– HA across data centers (by switching traffic)
– Allows us to on-board new computations easily
– Provide guarantees on consumption on data in
pipeline
Future
• Move to HBase 0.98/1.x
• Further improvements to De-duplication
algorithm
• Dynamic definition of join semantics
• HDFS Federation
Questions
Ad

More Related Content

What's hot (14)

206560 p6 analytics 3 1
206560 p6 analytics 3 1206560 p6 analytics 3 1
206560 p6 analytics 3 1
p6academy
 
Integrated Planning Using Enterprise Planning & Budgeting Cloud Service (EPBC...
Integrated Planning Using Enterprise Planning & Budgeting Cloud Service (EPBC...Integrated Planning Using Enterprise Planning & Budgeting Cloud Service (EPBC...
Integrated Planning Using Enterprise Planning & Budgeting Cloud Service (EPBC...
Alithya
 
Nagios Conference 2013 - Rodrigue Chakode - Effective Monitoring for Demanding
Nagios Conference 2013 - Rodrigue Chakode - Effective Monitoring for DemandingNagios Conference 2013 - Rodrigue Chakode - Effective Monitoring for Demanding
Nagios Conference 2013 - Rodrigue Chakode - Effective Monitoring for Demanding
Nagios
 
The Wright Way into the Cloud: The Argument for ARCS
The Wright Way into the Cloud:  The Argument for ARCSThe Wright Way into the Cloud:  The Argument for ARCS
The Wright Way into the Cloud: The Argument for ARCS
Alithya
 
#GeodeSummit - Wall St. Derivative Risk Solutions Using Geode
#GeodeSummit - Wall St. Derivative Risk Solutions Using Geode#GeodeSummit - Wall St. Derivative Risk Solutions Using Geode
#GeodeSummit - Wall St. Derivative Risk Solutions Using Geode
PivotalOpenSourceHub
 
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
HostedbyConfluent
 
What’s New in Assure MIMIX 10
What’s New in Assure MIMIX 10What’s New in Assure MIMIX 10
What’s New in Assure MIMIX 10
Precisely
 
Introduction to Real-Time Data Processing
Introduction to Real-Time Data ProcessingIntroduction to Real-Time Data Processing
Introduction to Real-Time Data Processing
Apache Apex
 
Row #9: An architecture overview of APNIC's RDAP deployment to the cloud
Row #9: An architecture overview of APNIC's RDAP deployment to the cloudRow #9: An architecture overview of APNIC's RDAP deployment to the cloud
Row #9: An architecture overview of APNIC's RDAP deployment to the cloud
APNIC
 
HANA Intro (KR)
HANA Intro (KR)HANA Intro (KR)
HANA Intro (KR)
Jim Miller, MBA
 
Towards Apache Flink 2.0 - Unified Data Processing and Beyond, Bowen Li
Towards Apache Flink 2.0 - Unified Data Processing and Beyond, Bowen LiTowards Apache Flink 2.0 - Unified Data Processing and Beyond, Bowen Li
Towards Apache Flink 2.0 - Unified Data Processing and Beyond, Bowen Li
Bowen Li
 
Learn How Dell Improved Postgres/Greenplum Performance 20x with a Database Pr...
Learn How Dell Improved Postgres/Greenplum Performance 20x with a Database Pr...Learn How Dell Improved Postgres/Greenplum Performance 20x with a Database Pr...
Learn How Dell Improved Postgres/Greenplum Performance 20x with a Database Pr...
VMware Tanzu
 
Unify Enterprise Data Processing System Platform Level Integration of Flink a...
Unify Enterprise Data Processing System Platform Level Integration of Flink a...Unify Enterprise Data Processing System Platform Level Integration of Flink a...
Unify Enterprise Data Processing System Platform Level Integration of Flink a...
Flink Forward
 
Flink Forward Berlin 2017 Keynote: Ferd Scheepers - Taking away customer fric...
Flink Forward Berlin 2017 Keynote: Ferd Scheepers - Taking away customer fric...Flink Forward Berlin 2017 Keynote: Ferd Scheepers - Taking away customer fric...
Flink Forward Berlin 2017 Keynote: Ferd Scheepers - Taking away customer fric...
Flink Forward
 
206560 p6 analytics 3 1
206560 p6 analytics 3 1206560 p6 analytics 3 1
206560 p6 analytics 3 1
p6academy
 
Integrated Planning Using Enterprise Planning & Budgeting Cloud Service (EPBC...
Integrated Planning Using Enterprise Planning & Budgeting Cloud Service (EPBC...Integrated Planning Using Enterprise Planning & Budgeting Cloud Service (EPBC...
Integrated Planning Using Enterprise Planning & Budgeting Cloud Service (EPBC...
Alithya
 
Nagios Conference 2013 - Rodrigue Chakode - Effective Monitoring for Demanding
Nagios Conference 2013 - Rodrigue Chakode - Effective Monitoring for DemandingNagios Conference 2013 - Rodrigue Chakode - Effective Monitoring for Demanding
Nagios Conference 2013 - Rodrigue Chakode - Effective Monitoring for Demanding
Nagios
 
The Wright Way into the Cloud: The Argument for ARCS
The Wright Way into the Cloud:  The Argument for ARCSThe Wright Way into the Cloud:  The Argument for ARCS
The Wright Way into the Cloud: The Argument for ARCS
Alithya
 
#GeodeSummit - Wall St. Derivative Risk Solutions Using Geode
#GeodeSummit - Wall St. Derivative Risk Solutions Using Geode#GeodeSummit - Wall St. Derivative Risk Solutions Using Geode
#GeodeSummit - Wall St. Derivative Risk Solutions Using Geode
PivotalOpenSourceHub
 
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
HostedbyConfluent
 
What’s New in Assure MIMIX 10
What’s New in Assure MIMIX 10What’s New in Assure MIMIX 10
What’s New in Assure MIMIX 10
Precisely
 
Introduction to Real-Time Data Processing
Introduction to Real-Time Data ProcessingIntroduction to Real-Time Data Processing
Introduction to Real-Time Data Processing
Apache Apex
 
Row #9: An architecture overview of APNIC's RDAP deployment to the cloud
Row #9: An architecture overview of APNIC's RDAP deployment to the cloudRow #9: An architecture overview of APNIC's RDAP deployment to the cloud
Row #9: An architecture overview of APNIC's RDAP deployment to the cloud
APNIC
 
Towards Apache Flink 2.0 - Unified Data Processing and Beyond, Bowen Li
Towards Apache Flink 2.0 - Unified Data Processing and Beyond, Bowen LiTowards Apache Flink 2.0 - Unified Data Processing and Beyond, Bowen Li
Towards Apache Flink 2.0 - Unified Data Processing and Beyond, Bowen Li
Bowen Li
 
Learn How Dell Improved Postgres/Greenplum Performance 20x with a Database Pr...
Learn How Dell Improved Postgres/Greenplum Performance 20x with a Database Pr...Learn How Dell Improved Postgres/Greenplum Performance 20x with a Database Pr...
Learn How Dell Improved Postgres/Greenplum Performance 20x with a Database Pr...
VMware Tanzu
 
Unify Enterprise Data Processing System Platform Level Integration of Flink a...
Unify Enterprise Data Processing System Platform Level Integration of Flink a...Unify Enterprise Data Processing System Platform Level Integration of Flink a...
Unify Enterprise Data Processing System Platform Level Integration of Flink a...
Flink Forward
 
Flink Forward Berlin 2017 Keynote: Ferd Scheepers - Taking away customer fric...
Flink Forward Berlin 2017 Keynote: Ferd Scheepers - Taking away customer fric...Flink Forward Berlin 2017 Keynote: Ferd Scheepers - Taking away customer fric...
Flink Forward Berlin 2017 Keynote: Ferd Scheepers - Taking away customer fric...
Flink Forward
 

Similar to Hadoop Summit 2014: Processing Complex Workflows in Advertising Using Hadoop (20)

Growing into a proactive Data Platform
Growing into a proactive Data PlatformGrowing into a proactive Data Platform
Growing into a proactive Data Platform
LivePerson
 
Big data – can it deliver speed and accuracy v1
Big data – can it deliver speed and accuracy v1Big data – can it deliver speed and accuracy v1
Big data – can it deliver speed and accuracy v1
GurinderG
 
Guide to Application Performance: Planning to Continued Optimization
Guide to Application Performance: Planning to Continued OptimizationGuide to Application Performance: Planning to Continued Optimization
Guide to Application Performance: Planning to Continued Optimization
MuleSoft
 
HBaseCon 2013: Near Real Time Indexing for eBay Search
HBaseCon 2013: Near Real Time Indexing for eBay SearchHBaseCon 2013: Near Real Time Indexing for eBay Search
HBaseCon 2013: Near Real Time Indexing for eBay Search
Cloudera, Inc.
 
real time data processing is a tsubtopic in the topic in the domain bigdata
real time data processing is a tsubtopic in the topic in the domain bigdatareal time data processing is a tsubtopic in the topic in the domain bigdata
real time data processing is a tsubtopic in the topic in the domain bigdata
ArasuVishnu
 
Optimize Your Reporting In Less Than 10 Minutes
Optimize Your Reporting In Less Than 10 MinutesOptimize Your Reporting In Less Than 10 Minutes
Optimize Your Reporting In Less Than 10 Minutes
Alexandra Sasha Blumenfeld
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Precisely
 
Gcp dataflow
Gcp dataflowGcp dataflow
Gcp dataflow
Igor Roiter
 
Batch Processing vs Stream Processing Difference
Batch Processing vs Stream Processing DifferenceBatch Processing vs Stream Processing Difference
Batch Processing vs Stream Processing Difference
jeetendra mandal
 
Big Data + PeopleSoft = BIG WIN!
Big Data + PeopleSoft = BIG WIN!Big Data + PeopleSoft = BIG WIN!
Big Data + PeopleSoft = BIG WIN!
Smart ERP Solutions, Inc.
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
ssuserd3a367
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
Guido Schmutz
 
Using Perforce Data in Development at Tableau
Using Perforce Data in Development at TableauUsing Perforce Data in Development at Tableau
Using Perforce Data in Development at Tableau
Perforce
 
Data Management Workshop - ETOT 2016
Data Management Workshop - ETOT 2016Data Management Workshop - ETOT 2016
Data Management Workshop - ETOT 2016
DataGenic Ltd
 
Data Care, Feeding, and Maintenance
Data Care, Feeding, and MaintenanceData Care, Feeding, and Maintenance
Data Care, Feeding, and Maintenance
Mercedes Coyle
 
Operational-Analytics
Operational-AnalyticsOperational-Analytics
Operational-Analytics
Niloy Mukherjee
 
Architecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an exampleArchitecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an example
hadooparchbook
 
DC Migration and Hadoop Scale For Big Billion Days
DC Migration and Hadoop Scale For Big Billion DaysDC Migration and Hadoop Scale For Big Billion Days
DC Migration and Hadoop Scale For Big Billion Days
Rahul Agarwal
 
Analysing and Troubleshooting Performance Issues in SAP BusinessObjects BI Re...
Analysing and Troubleshooting Performance Issues in SAP BusinessObjects BI Re...Analysing and Troubleshooting Performance Issues in SAP BusinessObjects BI Re...
Analysing and Troubleshooting Performance Issues in SAP BusinessObjects BI Re...
BI Brainz
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
Michael Hiskey
 
Growing into a proactive Data Platform
Growing into a proactive Data PlatformGrowing into a proactive Data Platform
Growing into a proactive Data Platform
LivePerson
 
Big data – can it deliver speed and accuracy v1
Big data – can it deliver speed and accuracy v1Big data – can it deliver speed and accuracy v1
Big data – can it deliver speed and accuracy v1
GurinderG
 
Guide to Application Performance: Planning to Continued Optimization
Guide to Application Performance: Planning to Continued OptimizationGuide to Application Performance: Planning to Continued Optimization
Guide to Application Performance: Planning to Continued Optimization
MuleSoft
 
HBaseCon 2013: Near Real Time Indexing for eBay Search
HBaseCon 2013: Near Real Time Indexing for eBay SearchHBaseCon 2013: Near Real Time Indexing for eBay Search
HBaseCon 2013: Near Real Time Indexing for eBay Search
Cloudera, Inc.
 
real time data processing is a tsubtopic in the topic in the domain bigdata
real time data processing is a tsubtopic in the topic in the domain bigdatareal time data processing is a tsubtopic in the topic in the domain bigdata
real time data processing is a tsubtopic in the topic in the domain bigdata
ArasuVishnu
 
Optimize Your Reporting In Less Than 10 Minutes
Optimize Your Reporting In Less Than 10 MinutesOptimize Your Reporting In Less Than 10 Minutes
Optimize Your Reporting In Less Than 10 Minutes
Alexandra Sasha Blumenfeld
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Precisely
 
Batch Processing vs Stream Processing Difference
Batch Processing vs Stream Processing DifferenceBatch Processing vs Stream Processing Difference
Batch Processing vs Stream Processing Difference
jeetendra mandal
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
ssuserd3a367
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
Guido Schmutz
 
Using Perforce Data in Development at Tableau
Using Perforce Data in Development at TableauUsing Perforce Data in Development at Tableau
Using Perforce Data in Development at Tableau
Perforce
 
Data Management Workshop - ETOT 2016
Data Management Workshop - ETOT 2016Data Management Workshop - ETOT 2016
Data Management Workshop - ETOT 2016
DataGenic Ltd
 
Data Care, Feeding, and Maintenance
Data Care, Feeding, and MaintenanceData Care, Feeding, and Maintenance
Data Care, Feeding, and Maintenance
Mercedes Coyle
 
Architecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an exampleArchitecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an example
hadooparchbook
 
DC Migration and Hadoop Scale For Big Billion Days
DC Migration and Hadoop Scale For Big Billion DaysDC Migration and Hadoop Scale For Big Billion Days
DC Migration and Hadoop Scale For Big Billion Days
Rahul Agarwal
 
Analysing and Troubleshooting Performance Issues in SAP BusinessObjects BI Re...
Analysing and Troubleshooting Performance Issues in SAP BusinessObjects BI Re...Analysing and Troubleshooting Performance Issues in SAP BusinessObjects BI Re...
Analysing and Troubleshooting Performance Issues in SAP BusinessObjects BI Re...
BI Brainz
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
Michael Hiskey
 
Ad

Recently uploaded (20)

SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Heap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and DeletionHeap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and Deletion
Jaydeep Kale
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfComplete Guide to Advanced Logistics Management Software in Riyadh.pdf
Complete Guide to Advanced Logistics Management Software in Riyadh.pdf
Software Company
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Heap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and DeletionHeap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and Deletion
Jaydeep Kale
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Ad

Hadoop Summit 2014: Processing Complex Workflows in Advertising Using Hadoop

  • 1. S M A R T V I D E O A D V E R T I S I N G Processing Complex Workflows in Advertising Using Hadoop June 3rd, 2014
  • 2. Who we are Rahul Ravindran Bernardo de Seabra Data Team Data Team [email protected] [email protected] @bseabra
  • 3. Agenda • Introduction to BrightRoll • Data Consumer Requirements • Motivation • Design – Streaming log data into HDFS – Anatomy of an event – Event de-duplication – Configuration driven processing – Auditing • Future
  • 4. Introduction: BrightRoll • Largest Online Video Advertisement Platform • BrightRoll builds technology that improves and automates video advertising globally • Reaching 53.9% of US audience, 168MM unique viewers • 3+ Billion video ads / month • 20+ Billion events processed / day
  • 5. Data Consumer Requirements • Processing results – Campaign delivery – Analytics – Telemetry • Consumers of processed data – Delivery algorithms to augment decision behavior – Campaign managers to monitor/tweak campaigns – Billing system – Forecasting/planning tools – Business Analysts: long/short term analysis
  • 6. Motivation – legacy data pipeline • Not linearly scale-able • Unit of processing was single campaign • Not HA • Lots of moving parts, no centralized control and monitoring • Failure recovery was time consuming
  • 7. Motivation – legacy data pipeline • Lots of boilerplate code – hard to onboard new data/computations • Interval based processing – 2 hour sliding window – Inherited delay – Inefficient use of resources • All data must be retrieved prior to processing
  • 8. Performance requirements • Low end-to-end delivery of aggregated metrics – Feedback loop into delivery algorithm – Campaign managers can react faster to their campaign performance • Linearly scalable
  • 9. Design decisions • Streaming model – Data is continuously being written – Process data once – Checkpoint states – Low end-to-end latency (5 mins) • Idempotent – Jobs can fail, tasks can fail, allows repeatability • Configuration driven join semantics – Ease of on-boarding new data/computations
  • 10. Overview Data Processing Pipeline ProcessDe-duplicate Store HDFS M/R HBase Data Data Producers Flume NG Data Warehouse
  • 11. Stream log data into HDFS using Flume Adser v Adser v Adser v HDFS Flume • Flume rolls files every 2 minutes • Files lexicographically ordered • Treat files written from flume to be a stream • Maintain a marker which points to current location in the input stream • Enables us to always process new data logs logs logs
  • 13. Event Header Event payload Event ID Event timestamp Event type Machine id Anatomy of an event
  • 14. De-duplication Raw logs Hbase table • We load raw logs into an hbase table • We use hbase table as a stream • We keep track of a time-based marker per table which represents a point in time up to which we have processed data
  • 15. Hbase table Start time End time • Next run will read data which was inserted from start time to end time (window of TO_BE_PROCESSED data) • Rowkey is <salt, event timestamp, event id>
  • 16. Chunk 1 Chunk 3 Chunk 2 • Break up data in WINDOW_TO_BE_PROCESSED into chunks • Each chunk has same salt and contiguous event timestamp • Each chunk is sorted – artifact of hbase storage Salt time id 4 1234 Foobar 1 4 1234 Foobar 2 4 1235 Foobar 3 6 1234 Foobar 4 7 1235 Foobar 5 7 1236 foobar6
  • 17. StartRow EndRow Historical Scan without time range , multi-versions • New Scan object gives historical view • Perform de-duplication of data in chunk based on historical view Key Event payload 4,1234, foobar1 4,1234, foobar2 4,1235, foobar3
  • 18. De-duplication performance • High Dedup throughput – 1.2+ million events per second • Dedup across 4 days of historical data StartRow/EndRow scan TimeRange scan Compaction co- processor to compact files older than table start Time
  • 19. Processing - Joins Impression Auction Computation
  • 20. Arbitrary joins • Use of an mechanism very similar to the de- duplication previously described • Historical scan now checks for other events specified in the join • Business level de-duplication – duplicate impressions for same auction performed here as well • “Session debugging”
  • 22. What we have now • All the stuff we have talked about plus system which – Scales linearly – HA within our data center – HA across data centers (by switching traffic) – Allows us to on-board new computations easily – Provide guarantees on consumption on data in pipeline
  • 23. Future • Move to HBase 0.98/1.x • Further improvements to De-duplication algorithm • Dynamic definition of join semantics • HDFS Federation

Editor's Notes

  • #2: Good afternoon everyone. Thanks for joining us on this talk named Processing Complex Workflows in Advertising Using Hadoop.
  • #3: My name is Bernardo de Seabra, this is Rahul Ravindran and we are part of the Data Team at BrightRoll. Our team is responsible for all Big Data related things in the company including the most recent project we undertook to rebuild the data processing pipeline that powers a lot of the critical components of the BrightRoll technology stack. That’s data processing pipeline will be the focus of this talk.
  • #4: In order to give the audience some more context we’ll take a minute to explain what BrightRoll does for those of you that are not familiar with the company. We will then cover the requirements of all the different consumers of data throughout the platform, the motivation to develop a new data processing pipeline and the design decisions made to respond to such requirements.
  • #9: Smaller chances of underdelivery or overdelivery which costs us money
  • #13: Bernardo to cover up to this slide. All files until File.1235 have been processed The arrows between files represent time. Older file which was followed by the next file
  • #14: Global unique event id for each event generated at the point when the event is logged
  • #15: We have a requirement to consume all logs. We have a separate audit mechanism to verify if all logs were consumed by the pipeline. We automatically replay log lines if we find missing log lines. On replay, we may have duplicates which need to be de-duped.
  • #16: Historical perspective: We began with a naïve dedup algo where we would look up each event id to check if it exists, if so, it is duplicate else we would emit. This as too slow as large number of such random lookup were slow and each lookup went over the entire keyspace. We needed a mechanism to constrain the key space and perform a range query but with event IDs being random, this was hard. Hence, we needed the event timestamp at the beginning of the rowkey, but this would result in hosspotting, so, we used added a one byte salt, generated from the hash of the event id as the prefix to distribute load across all the regions.
  • #18: StartRow and EndRow of each chunk are used to construct a new scan object with no constraints on time. This scan constraints the keyspace in the query using startRow and endRow
  • #19: As number of hfiles increase, timerange scans benefit, since a lot of hfiles outside the timerange are ignored. However, as number of hfiles increase, the historical scan gets slower as all the hfiles need to be scanned. In the other scenario, if we have one giant hfile(say after a major compaction run), then, the timerange scan has to scan the entire hfile which is slow. So, we use co-processor which enables us to use #hfiles as a coarse index on time for recent data (where we will do a timerange scan) and older large hfiles which provide an index on the rowkey
  • #20: Allow arbitrary event joins Events to be joined, along with fields to be used are defined in a configuration file to allow ease of adding new computations All financial computations expressed via config. Currently, 24 different computations exist On-boarding new computations are changes to config file Each computation is an entry in a different hbase table
  • #21: Also, allows us to perform joins across events generated over arbitrary and possibly long time windows (currently at 2 hours) since mobile clients frequently cache the auction results and show an ad later (as much as 2 hours from auction time). Hence, impression generated 2 hours after auction. This does not need us to compare with all the old data. Old pipeline required us to load all the data for 2 hours to perform join Last event type which is part of the join triggers a computation Since we have a view into joined data, this allows other engineering teams to query into this data to allow for better debugging at large scale. This allows for arbitrary joins across event types which enables engineering to deal with new events.
  • #22: Auditor processes the deduped stream and then uses that to compare with the meta data it has received from the adserving machines. If they do not match, we force a replay of the files from the adserv box, which would get deduped, thereby removing all the duplicates and ensuring that all the data makes it through to the processing pipeline
  • #23: If we can provide something about how this has impacted business