SlideShare a Scribd company logo
Hourglass: a Library for Incremental Processing on
Hadoop
IEEE BigData 2013
October 9th
Matthew Hayes
©2013 LinkedIn Corporation. All Rights Reserved.
Matthew Hayes
Staff Software Engineer
www.linkedin.com/in/matthewterencehayes/
©2013 LinkedIn Corporation. All Rights Reserved.
• 3+ Years on Applied Data Team at LinkedIn
• Skills
• Endorsements
• DataFu
• White Elephant
Agenda
 Motivation
 Design
 Experiments
 Q&A
©2013 LinkedIn Corporation. All Rights Reserved. 3
Motivation
©2013 LinkedIn Corporation. All Rights Reserved. 4
Event Collection in an Online System
 Typically online websites have
instrumented services that collect
events
 Events stored in an offline system
(such as Hadoop) for later analysis
 Using events, can build dashboards
with metrics such as:
– # of page views over last month
– # of active users over last month
 Metrics derived from events can also
be useful in recommendation pipelines
– e.g. impression discounting
©2013 LinkedIn Corporation. All Rights Reserved. 5
Event Storage
 Events can be categorized into topics, for example:
– page view
– user login
– ad impression/click
 Store events by topic and by day:
– /data/page_view/daily/2013/10/08
– /data/page_view/daily/2013/10/09
– ...
– /data/ad_click/daily/2013/10/08
 Now can perform computation over specific time windows
©2013 LinkedIn Corporation. All Rights Reserved. 6
Computation Over Time Windows
 In practice, many of our computations over time windows use
either:
©2013 LinkedIn Corporation. All Rights Reserved. 7
Recognizing Inefficiencies
 But, typically jobs compute these daily
 From one day to next, input changes little
 Fixed-start window includes one new day:
©2013 LinkedIn Corporation. All Rights Reserved. 8
Recognizing Inefficiencies
 Fixed-length window includes one new day, minus oldest day
©2013 LinkedIn Corporation. All Rights Reserved. 9
Recognizing Inefficiencies
 Repeatedly processing same input data
 This wastes cluster resources
 Better to process new data only
 How can we do better?
©2013 LinkedIn Corporation. All Rights Reserved. 10
Hourglass Design
©2013 LinkedIn Corporation. All Rights Reserved. 11
Design Goals
 Address use cases:
– Fixed-start and fixed-length window computations
– Daily partitioned data
 Reduce resource usage
 Reduce wall clock time
 Run on standard Hadoop
©2013 LinkedIn Corporation. All Rights Reserved. 12
Improving Fixed-Start Computations
 Suppose we must compute page view counts per member
 The job consumes all days of available input, producing one output.
 We call this a partition-collapsing job.
 But, if the job runs tomorrow it has to reprocess the same data.
©2013 LinkedIn Corporation. All Rights Reserved. 13
Improving Fixed-Start Computations
 Solution: Merge new data with previous output
 We can do this because this is an arithmetic operation
 Hourglass provides a partition-collapsing job that supports output
reuse.
©2013 LinkedIn Corporation. All Rights Reserved. 14
Partition-Collapsing Job Architecture (Fixed-Start)
 When applied to a fixed-start window computation:
©2013 LinkedIn Corporation. All Rights Reserved. 15
Improving Fixed-Length Computations
 For a fixed-length job, can reuse output using a similar trick:
– Add new day to previous output
– Subtract old day from result
 We can subtract the old day since this is arithmetic
©2013 LinkedIn Corporation. All Rights Reserved. 16
Partition-Collapsing Job Architecture (Fixed-Length)
 When applied to a fixed-length window computation:
©2013 LinkedIn Corporation. All Rights Reserved. 17
Improving Fixed-Length Computations
 But, for some operations, cannot subtract old data
– example: max(), min()
 Cannot reuse previous output, so how do we reduce computation?
 Solution: partition-preserving job
 Partitioned input data, partitioned output data
 Essentially: aggregate the data in advance
 Aggregating in advance can be useful even when you can reuse
output
©2013 LinkedIn Corporation. All Rights Reserved. 18
Partition-Preserving Job Architecture
©2013 LinkedIn Corporation. All Rights Reserved. 19
MapReduce in Hourglass
 MapReduce is a fairly general programming model
 Hourglass requires:
– reduce() must output (key,value) pair
– reduce() must produce at most one value
– reduce() implemented by an accumulator
©2013 LinkedIn Corporation. All Rights Reserved. 20
Building Blocks
 Two types of jobs:
– Partition-preserving: consume partitioned input data, produce
partitioned output data
– Partition-collapsing: consume partitioned input data, produce single
output
 Must provide to jobs:
– Inputs and output paths
– Desired time range
 Must implement:
– map()
– accumulate()
 May implement if necessary:
– merge()
– unmerge()
©2013 LinkedIn Corporation. All Rights Reserved. 21
Experiments
©2013 LinkedIn Corporation. All Rights Reserved. 22
Metrics for Evaluation
 Wall clock time
– Amount of time that elapses until job completes
 Total task time
– Sum of execution times for all tasks
– Represents usage of cluster resources
 Compare each against baseline non-incremental job
©2013 LinkedIn Corporation. All Rights Reserved. 23
Experiment: Page Views per Member
 Goal: Count page views per member over last n days
 Chain partition-preserving and partition-collapsing
 Can reuse previous output:
©2013 LinkedIn Corporation. All Rights Reserved. 24
Experiment: Page Views per Member
©2013 LinkedIn Corporation. All Rights Reserved. 25
Member Count Estimation
 Goal: Estimate number of members visiting site over past n days
 Use HyperLogLog cardinality estimation (space vs. accuracy)
 Can't reuse output, but with partition-preserving can save state:
©2013 LinkedIn Corporation. All Rights Reserved. 26
Member Count Estimation: Results
©2013 LinkedIn Corporation. All Rights Reserved. 27
Conclusion
 Computations over sliding windows are quite common
 Implementations are typically inefficient
 Incrementalizing Hadoop jobs can in some cases yield:
– 95-98% reductions in total task time
– 20-40% reductions in wall clock time
©2013 LinkedIn Corporation. All Rights Reserved. 28
datafu.org
Learning More
©2013 LinkedIn Corporation. All Rights Reserved. 29
Ad

More Related Content

What's hot (20)

High Performance Data Lake with Apache Hudi and Alluxio at T3Go
High Performance Data Lake with Apache Hudi and Alluxio at T3GoHigh Performance Data Lake with Apache Hudi and Alluxio at T3Go
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
Alluxio, Inc.
 
What's new in SQL on Hadoop and Beyond
What's new in SQL on Hadoop and BeyondWhat's new in SQL on Hadoop and Beyond
What's new in SQL on Hadoop and Beyond
DataWorks Summit/Hadoop Summit
 
Hotel inspection data set analysis copy
Hotel inspection data set analysis   copyHotel inspection data set analysis   copy
Hotel inspection data set analysis copy
Sharon Moses
 
GCP Data Engineer cheatsheet
GCP Data Engineer cheatsheetGCP Data Engineer cheatsheet
GCP Data Engineer cheatsheet
Guang Xu
 
HBaseCon 2015: Apache Kylin - Extreme OLAP Engine for Hadoop
HBaseCon 2015: Apache Kylin - Extreme OLAP  Engine for HadoopHBaseCon 2015: Apache Kylin - Extreme OLAP  Engine for Hadoop
HBaseCon 2015: Apache Kylin - Extreme OLAP Engine for Hadoop
HBaseCon
 
Hadoop Everywhere
Hadoop EverywhereHadoop Everywhere
Hadoop Everywhere
DataWorks Summit/Hadoop Summit
 
Big Data Processing
Big Data ProcessingBig Data Processing
Big Data Processing
Michael Ming Lei
 
Show me the Money! Cost & Resource Tracking for Hadoop and Storm
Show me the Money! Cost & Resource  Tracking for Hadoop and Storm Show me the Money! Cost & Resource  Tracking for Hadoop and Storm
Show me the Money! Cost & Resource Tracking for Hadoop and Storm
DataWorks Summit/Hadoop Summit
 
Stream Processing and Real-Time Data Pipelines
Stream Processing and Real-Time Data PipelinesStream Processing and Real-Time Data Pipelines
Stream Processing and Real-Time Data Pipelines
Vladimír Schreiner
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Streaming analytics state of the art
Streaming analytics state of the artStreaming analytics state of the art
Streaming analytics state of the art
Stavros Kontopoulos
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
KrishnenduKrishh
 
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
VMware Tanzu
 
Mixing Analytic Workloads with Greenplum and Apache Spark
Mixing Analytic Workloads with Greenplum and Apache SparkMixing Analytic Workloads with Greenplum and Apache Spark
Mixing Analytic Workloads with Greenplum and Apache Spark
VMware Tanzu
 
Greenplum-Spark November 2018
Greenplum-Spark November 2018Greenplum-Spark November 2018
Greenplum-Spark November 2018
KongYew Chan, MBA
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
The Past, Present and Future of Big Data @LinkedIn
The Past, Present and Future of Big Data @LinkedInThe Past, Present and Future of Big Data @LinkedIn
The Past, Present and Future of Big Data @LinkedIn
Suja Viswesan
 
Bigdata Hadoop project payment gateway domain
Bigdata Hadoop project payment gateway domainBigdata Hadoop project payment gateway domain
Bigdata Hadoop project payment gateway domain
Kamal A
 
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...
DataWorks Summit/Hadoop Summit
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
Rahul Agarwal
 
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
High Performance Data Lake with Apache Hudi and Alluxio at T3GoHigh Performance Data Lake with Apache Hudi and Alluxio at T3Go
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
Alluxio, Inc.
 
Hotel inspection data set analysis copy
Hotel inspection data set analysis   copyHotel inspection data set analysis   copy
Hotel inspection data set analysis copy
Sharon Moses
 
GCP Data Engineer cheatsheet
GCP Data Engineer cheatsheetGCP Data Engineer cheatsheet
GCP Data Engineer cheatsheet
Guang Xu
 
HBaseCon 2015: Apache Kylin - Extreme OLAP Engine for Hadoop
HBaseCon 2015: Apache Kylin - Extreme OLAP  Engine for HadoopHBaseCon 2015: Apache Kylin - Extreme OLAP  Engine for Hadoop
HBaseCon 2015: Apache Kylin - Extreme OLAP Engine for Hadoop
HBaseCon
 
Show me the Money! Cost & Resource Tracking for Hadoop and Storm
Show me the Money! Cost & Resource  Tracking for Hadoop and Storm Show me the Money! Cost & Resource  Tracking for Hadoop and Storm
Show me the Money! Cost & Resource Tracking for Hadoop and Storm
DataWorks Summit/Hadoop Summit
 
Stream Processing and Real-Time Data Pipelines
Stream Processing and Real-Time Data PipelinesStream Processing and Real-Time Data Pipelines
Stream Processing and Real-Time Data Pipelines
Vladimír Schreiner
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Streaming analytics state of the art
Streaming analytics state of the artStreaming analytics state of the art
Streaming analytics state of the art
Stavros Kontopoulos
 
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
VMware Tanzu
 
Mixing Analytic Workloads with Greenplum and Apache Spark
Mixing Analytic Workloads with Greenplum and Apache SparkMixing Analytic Workloads with Greenplum and Apache Spark
Mixing Analytic Workloads with Greenplum and Apache Spark
VMware Tanzu
 
Greenplum-Spark November 2018
Greenplum-Spark November 2018Greenplum-Spark November 2018
Greenplum-Spark November 2018
KongYew Chan, MBA
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
The Past, Present and Future of Big Data @LinkedIn
The Past, Present and Future of Big Data @LinkedInThe Past, Present and Future of Big Data @LinkedIn
The Past, Present and Future of Big Data @LinkedIn
Suja Viswesan
 
Bigdata Hadoop project payment gateway domain
Bigdata Hadoop project payment gateway domainBigdata Hadoop project payment gateway domain
Bigdata Hadoop project payment gateway domain
Kamal A
 
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...
DataWorks Summit/Hadoop Summit
 

Viewers also liked (20)

Hw09 Practical HBase Getting The Most From Your H Base Install
Hw09   Practical HBase  Getting The Most From Your H Base InstallHw09   Practical HBase  Getting The Most From Your H Base Install
Hw09 Practical HBase Getting The Most From Your H Base Install
Cloudera, Inc.
 
Chicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionChicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An Introduction
Cloudera, Inc.
 
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation Buffers
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation BuffersHBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation Buffers
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation Buffers
Cloudera, Inc.
 
React.js: Beyond the Browser
React.js: Beyond the BrowserReact.js: Beyond the Browser
React.js: Beyond the Browser
garbles
 
Apache Mesos at Twitter (Texas LinuxFest 2014)
Apache Mesos at Twitter (Texas LinuxFest 2014)Apache Mesos at Twitter (Texas LinuxFest 2014)
Apache Mesos at Twitter (Texas LinuxFest 2014)
Chris Aniszczyk
 
Keynote: Apache HBase at Yahoo! Scale
Keynote: Apache HBase at Yahoo! ScaleKeynote: Apache HBase at Yahoo! Scale
Keynote: Apache HBase at Yahoo! Scale
HBaseCon
 
HBase Read High Availability Using Timeline Consistent Region Replicas
HBase  Read High Availability Using Timeline Consistent Region ReplicasHBase  Read High Availability Using Timeline Consistent Region Replicas
HBase Read High Availability Using Timeline Consistent Region Replicas
enissoz
 
Danger Of Free
Danger Of FreeDanger Of Free
Danger Of Free
Alex Iskold
 
Enforcing Your Code of Conduct: effective incident response
Enforcing Your Code of Conduct: effective incident responseEnforcing Your Code of Conduct: effective incident response
Enforcing Your Code of Conduct: effective incident response
Audrey Eschright
 
An Abusive Relationship with AngularJS
An Abusive Relationship with AngularJSAn Abusive Relationship with AngularJS
An Abusive Relationship with AngularJS
Mario Heiderich
 
What the F**K is Social Media: One Year Later
What the F**K is Social Media: One Year LaterWhat the F**K is Social Media: One Year Later
What the F**K is Social Media: One Year Later
Martafy!
 
Paginas ampliadas
Paginas ampliadasPaginas ampliadas
Paginas ampliadas
Gabriel Steindorff
 
Tecnologia eduativa
Tecnologia eduativaTecnologia eduativa
Tecnologia eduativa
miguelsanchezz1
 
Opendataday
OpendatadayOpendataday
Opendataday
Sandra Troia
 
Verden lige nu
Verden lige nuVerden lige nu
Verden lige nu
persloth
 
Valtek MK1 Rebuild
Valtek MK1 RebuildValtek MK1 Rebuild
Valtek MK1 Rebuild
Patrick Gladd
 
Las 48 leyes del poder
Las 48 leyes del poderLas 48 leyes del poder
Las 48 leyes del poder
Orlando Escudero
 
Decimales: Valor Posicional
Decimales: Valor PosicionalDecimales: Valor Posicional
Decimales: Valor Posicional
Computer Learning Centers
 
Mag One Products Inc. Investor Presentation
Mag One Products Inc. Investor PresentationMag One Products Inc. Investor Presentation
Mag One Products Inc. Investor Presentation
RedChip Companies, Inc.
 
Hw09 Practical HBase Getting The Most From Your H Base Install
Hw09   Practical HBase  Getting The Most From Your H Base InstallHw09   Practical HBase  Getting The Most From Your H Base Install
Hw09 Practical HBase Getting The Most From Your H Base Install
Cloudera, Inc.
 
Chicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionChicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An Introduction
Cloudera, Inc.
 
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation Buffers
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation BuffersHBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation Buffers
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation Buffers
Cloudera, Inc.
 
React.js: Beyond the Browser
React.js: Beyond the BrowserReact.js: Beyond the Browser
React.js: Beyond the Browser
garbles
 
Apache Mesos at Twitter (Texas LinuxFest 2014)
Apache Mesos at Twitter (Texas LinuxFest 2014)Apache Mesos at Twitter (Texas LinuxFest 2014)
Apache Mesos at Twitter (Texas LinuxFest 2014)
Chris Aniszczyk
 
Keynote: Apache HBase at Yahoo! Scale
Keynote: Apache HBase at Yahoo! ScaleKeynote: Apache HBase at Yahoo! Scale
Keynote: Apache HBase at Yahoo! Scale
HBaseCon
 
HBase Read High Availability Using Timeline Consistent Region Replicas
HBase  Read High Availability Using Timeline Consistent Region ReplicasHBase  Read High Availability Using Timeline Consistent Region Replicas
HBase Read High Availability Using Timeline Consistent Region Replicas
enissoz
 
Enforcing Your Code of Conduct: effective incident response
Enforcing Your Code of Conduct: effective incident responseEnforcing Your Code of Conduct: effective incident response
Enforcing Your Code of Conduct: effective incident response
Audrey Eschright
 
An Abusive Relationship with AngularJS
An Abusive Relationship with AngularJSAn Abusive Relationship with AngularJS
An Abusive Relationship with AngularJS
Mario Heiderich
 
What the F**K is Social Media: One Year Later
What the F**K is Social Media: One Year LaterWhat the F**K is Social Media: One Year Later
What the F**K is Social Media: One Year Later
Martafy!
 
Verden lige nu
Verden lige nuVerden lige nu
Verden lige nu
persloth
 
Mag One Products Inc. Investor Presentation
Mag One Products Inc. Investor PresentationMag One Products Inc. Investor Presentation
Mag One Products Inc. Investor Presentation
RedChip Companies, Inc.
 
Ad

Similar to Hourglass: a Library for Incremental Processing on Hadoop (20)

Building a Modern Enterprise SOA at LinkedIn
Building a Modern Enterprise SOA at LinkedInBuilding a Modern Enterprise SOA at LinkedIn
Building a Modern Enterprise SOA at LinkedIn
Jens Pillgram-Larsen
 
InfoSphere BigInsights
InfoSphere BigInsightsInfoSphere BigInsights
InfoSphere BigInsights
Wilfried Hoge
 
Hive at LinkedIn
Hive at LinkedIn Hive at LinkedIn
Hive at LinkedIn
mislam77
 
Building a data-driven application
Building a data-driven applicationBuilding a data-driven application
Building a data-driven application
wgyn
 
Apigee Insights: Data & Context-Driven Actions
Apigee Insights: Data & Context-Driven ActionsApigee Insights: Data & Context-Driven Actions
Apigee Insights: Data & Context-Driven Actions
Apigee | Google Cloud
 
Building MuleSoft Applications with Google BigQuery Meetup 4
Building MuleSoft Applications with Google BigQuery Meetup 4Building MuleSoft Applications with Google BigQuery Meetup 4
Building MuleSoft Applications with Google BigQuery Meetup 4
MannaAkpan
 
The Pennsylvania State University: Modernizing and Standardizing the Penn Sta...
The Pennsylvania State University: Modernizing and Standardizing the Penn Sta...The Pennsylvania State University: Modernizing and Standardizing the Penn Sta...
The Pennsylvania State University: Modernizing and Standardizing the Penn Sta...
Software AG
 
Decrease build time and application size
Decrease build time and application sizeDecrease build time and application size
Decrease build time and application size
Keval Patel
 
Software Engineering in the Age of SaaS and Cloud Computing - SERA 2013 - MFF...
Software Engineering in the Age of SaaS and Cloud Computing - SERA 2013 - MFF...Software Engineering in the Age of SaaS and Cloud Computing - SERA 2013 - MFF...
Software Engineering in the Age of SaaS and Cloud Computing - SERA 2013 - MFF...
Jaroslav Gergic
 
Fast track RTC Innovate India 2013
Fast track  RTC Innovate India 2013Fast track  RTC Innovate India 2013
Fast track RTC Innovate India 2013
Daniel Leroux
 
Enterprise Metadata Integration, Cloudera
Enterprise Metadata Integration, ClouderaEnterprise Metadata Integration, Cloudera
Enterprise Metadata Integration, Cloudera
Neo4j
 
B04 06 0918
B04 06 0918B04 06 0918
B04 06 0918
International Journal of Engineering Inventions www.ijeijournal.com
 
IMCSummit 2015 - Day 1 Developer Track - Implementing Operational Intelligenc...
IMCSummit 2015 - Day 1 Developer Track - Implementing Operational Intelligenc...IMCSummit 2015 - Day 1 Developer Track - Implementing Operational Intelligenc...
IMCSummit 2015 - Day 1 Developer Track - Implementing Operational Intelligenc...
In-Memory Computing Summit
 
Journey to Containerized Application / Google Container Engine
Journey to Containerized Application / Google Container EngineJourney to Containerized Application / Google Container Engine
Journey to Containerized Application / Google Container Engine
Google Cloud Platform - Japan
 
Real-time analysis using an in-memory data grid - Cloud Expo 2013
Real-time analysis using an in-memory data grid - Cloud Expo 2013Real-time analysis using an in-memory data grid - Cloud Expo 2013
Real-time analysis using an in-memory data grid - Cloud Expo 2013
ScaleOut Software
 
How a Time Series Database Contributes to a Decentralized Cloud Object Storag...
How a Time Series Database Contributes to a Decentralized Cloud Object Storag...How a Time Series Database Contributes to a Decentralized Cloud Object Storag...
How a Time Series Database Contributes to a Decentralized Cloud Object Storag...
InfluxData
 
VisiQuate: Azure cloud migration case study
VisiQuate: Azure cloud migration case studyVisiQuate: Azure cloud migration case study
VisiQuate: Azure cloud migration case study
Leonid Nekhymchuk
 
Software Engineering for Startups (University of St Andrews, 2013)
Software Engineering for Startups (University of St Andrews, 2013)Software Engineering for Startups (University of St Andrews, 2013)
Software Engineering for Startups (University of St Andrews, 2013)
RightScale
 
Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with ...
Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with ...Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with ...
Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with ...
David Chen
 
Building a Self-Service Hadoop Platform at Linkedin with Azkaban
Building a Self-Service Hadoop Platform at Linkedin with AzkabanBuilding a Self-Service Hadoop Platform at Linkedin with Azkaban
Building a Self-Service Hadoop Platform at Linkedin with Azkaban
DataWorks Summit
 
Building a Modern Enterprise SOA at LinkedIn
Building a Modern Enterprise SOA at LinkedInBuilding a Modern Enterprise SOA at LinkedIn
Building a Modern Enterprise SOA at LinkedIn
Jens Pillgram-Larsen
 
InfoSphere BigInsights
InfoSphere BigInsightsInfoSphere BigInsights
InfoSphere BigInsights
Wilfried Hoge
 
Hive at LinkedIn
Hive at LinkedIn Hive at LinkedIn
Hive at LinkedIn
mislam77
 
Building a data-driven application
Building a data-driven applicationBuilding a data-driven application
Building a data-driven application
wgyn
 
Apigee Insights: Data & Context-Driven Actions
Apigee Insights: Data & Context-Driven ActionsApigee Insights: Data & Context-Driven Actions
Apigee Insights: Data & Context-Driven Actions
Apigee | Google Cloud
 
Building MuleSoft Applications with Google BigQuery Meetup 4
Building MuleSoft Applications with Google BigQuery Meetup 4Building MuleSoft Applications with Google BigQuery Meetup 4
Building MuleSoft Applications with Google BigQuery Meetup 4
MannaAkpan
 
The Pennsylvania State University: Modernizing and Standardizing the Penn Sta...
The Pennsylvania State University: Modernizing and Standardizing the Penn Sta...The Pennsylvania State University: Modernizing and Standardizing the Penn Sta...
The Pennsylvania State University: Modernizing and Standardizing the Penn Sta...
Software AG
 
Decrease build time and application size
Decrease build time and application sizeDecrease build time and application size
Decrease build time and application size
Keval Patel
 
Software Engineering in the Age of SaaS and Cloud Computing - SERA 2013 - MFF...
Software Engineering in the Age of SaaS and Cloud Computing - SERA 2013 - MFF...Software Engineering in the Age of SaaS and Cloud Computing - SERA 2013 - MFF...
Software Engineering in the Age of SaaS and Cloud Computing - SERA 2013 - MFF...
Jaroslav Gergic
 
Fast track RTC Innovate India 2013
Fast track  RTC Innovate India 2013Fast track  RTC Innovate India 2013
Fast track RTC Innovate India 2013
Daniel Leroux
 
Enterprise Metadata Integration, Cloudera
Enterprise Metadata Integration, ClouderaEnterprise Metadata Integration, Cloudera
Enterprise Metadata Integration, Cloudera
Neo4j
 
IMCSummit 2015 - Day 1 Developer Track - Implementing Operational Intelligenc...
IMCSummit 2015 - Day 1 Developer Track - Implementing Operational Intelligenc...IMCSummit 2015 - Day 1 Developer Track - Implementing Operational Intelligenc...
IMCSummit 2015 - Day 1 Developer Track - Implementing Operational Intelligenc...
In-Memory Computing Summit
 
Journey to Containerized Application / Google Container Engine
Journey to Containerized Application / Google Container EngineJourney to Containerized Application / Google Container Engine
Journey to Containerized Application / Google Container Engine
Google Cloud Platform - Japan
 
Real-time analysis using an in-memory data grid - Cloud Expo 2013
Real-time analysis using an in-memory data grid - Cloud Expo 2013Real-time analysis using an in-memory data grid - Cloud Expo 2013
Real-time analysis using an in-memory data grid - Cloud Expo 2013
ScaleOut Software
 
How a Time Series Database Contributes to a Decentralized Cloud Object Storag...
How a Time Series Database Contributes to a Decentralized Cloud Object Storag...How a Time Series Database Contributes to a Decentralized Cloud Object Storag...
How a Time Series Database Contributes to a Decentralized Cloud Object Storag...
InfluxData
 
VisiQuate: Azure cloud migration case study
VisiQuate: Azure cloud migration case studyVisiQuate: Azure cloud migration case study
VisiQuate: Azure cloud migration case study
Leonid Nekhymchuk
 
Software Engineering for Startups (University of St Andrews, 2013)
Software Engineering for Startups (University of St Andrews, 2013)Software Engineering for Startups (University of St Andrews, 2013)
Software Engineering for Startups (University of St Andrews, 2013)
RightScale
 
Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with ...
Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with ...Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with ...
Hadoop Summit 2014: Building a Self-Service Hadoop Platform at LinkedIn with ...
David Chen
 
Building a Self-Service Hadoop Platform at Linkedin with Azkaban
Building a Self-Service Hadoop Platform at Linkedin with AzkabanBuilding a Self-Service Hadoop Platform at Linkedin with Azkaban
Building a Self-Service Hadoop Platform at Linkedin with Azkaban
DataWorks Summit
 
Ad

Recently uploaded (20)

Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
AI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global TrendsAI and Data Privacy in 2025: Global Trends
AI and Data Privacy in 2025: Global Trends
InData Labs
 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
Mobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi ArabiaMobile App Development Company in Saudi Arabia
Mobile App Development Company in Saudi Arabia
Steve Jonas
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 

Hourglass: a Library for Incremental Processing on Hadoop

  • 1. Hourglass: a Library for Incremental Processing on Hadoop IEEE BigData 2013 October 9th Matthew Hayes ©2013 LinkedIn Corporation. All Rights Reserved.
  • 2. Matthew Hayes Staff Software Engineer www.linkedin.com/in/matthewterencehayes/ ©2013 LinkedIn Corporation. All Rights Reserved. • 3+ Years on Applied Data Team at LinkedIn • Skills • Endorsements • DataFu • White Elephant
  • 3. Agenda  Motivation  Design  Experiments  Q&A ©2013 LinkedIn Corporation. All Rights Reserved. 3
  • 5. Event Collection in an Online System  Typically online websites have instrumented services that collect events  Events stored in an offline system (such as Hadoop) for later analysis  Using events, can build dashboards with metrics such as: – # of page views over last month – # of active users over last month  Metrics derived from events can also be useful in recommendation pipelines – e.g. impression discounting ©2013 LinkedIn Corporation. All Rights Reserved. 5
  • 6. Event Storage  Events can be categorized into topics, for example: – page view – user login – ad impression/click  Store events by topic and by day: – /data/page_view/daily/2013/10/08 – /data/page_view/daily/2013/10/09 – ... – /data/ad_click/daily/2013/10/08  Now can perform computation over specific time windows ©2013 LinkedIn Corporation. All Rights Reserved. 6
  • 7. Computation Over Time Windows  In practice, many of our computations over time windows use either: ©2013 LinkedIn Corporation. All Rights Reserved. 7
  • 8. Recognizing Inefficiencies  But, typically jobs compute these daily  From one day to next, input changes little  Fixed-start window includes one new day: ©2013 LinkedIn Corporation. All Rights Reserved. 8
  • 9. Recognizing Inefficiencies  Fixed-length window includes one new day, minus oldest day ©2013 LinkedIn Corporation. All Rights Reserved. 9
  • 10. Recognizing Inefficiencies  Repeatedly processing same input data  This wastes cluster resources  Better to process new data only  How can we do better? ©2013 LinkedIn Corporation. All Rights Reserved. 10
  • 11. Hourglass Design ©2013 LinkedIn Corporation. All Rights Reserved. 11
  • 12. Design Goals  Address use cases: – Fixed-start and fixed-length window computations – Daily partitioned data  Reduce resource usage  Reduce wall clock time  Run on standard Hadoop ©2013 LinkedIn Corporation. All Rights Reserved. 12
  • 13. Improving Fixed-Start Computations  Suppose we must compute page view counts per member  The job consumes all days of available input, producing one output.  We call this a partition-collapsing job.  But, if the job runs tomorrow it has to reprocess the same data. ©2013 LinkedIn Corporation. All Rights Reserved. 13
  • 14. Improving Fixed-Start Computations  Solution: Merge new data with previous output  We can do this because this is an arithmetic operation  Hourglass provides a partition-collapsing job that supports output reuse. ©2013 LinkedIn Corporation. All Rights Reserved. 14
  • 15. Partition-Collapsing Job Architecture (Fixed-Start)  When applied to a fixed-start window computation: ©2013 LinkedIn Corporation. All Rights Reserved. 15
  • 16. Improving Fixed-Length Computations  For a fixed-length job, can reuse output using a similar trick: – Add new day to previous output – Subtract old day from result  We can subtract the old day since this is arithmetic ©2013 LinkedIn Corporation. All Rights Reserved. 16
  • 17. Partition-Collapsing Job Architecture (Fixed-Length)  When applied to a fixed-length window computation: ©2013 LinkedIn Corporation. All Rights Reserved. 17
  • 18. Improving Fixed-Length Computations  But, for some operations, cannot subtract old data – example: max(), min()  Cannot reuse previous output, so how do we reduce computation?  Solution: partition-preserving job  Partitioned input data, partitioned output data  Essentially: aggregate the data in advance  Aggregating in advance can be useful even when you can reuse output ©2013 LinkedIn Corporation. All Rights Reserved. 18
  • 19. Partition-Preserving Job Architecture ©2013 LinkedIn Corporation. All Rights Reserved. 19
  • 20. MapReduce in Hourglass  MapReduce is a fairly general programming model  Hourglass requires: – reduce() must output (key,value) pair – reduce() must produce at most one value – reduce() implemented by an accumulator ©2013 LinkedIn Corporation. All Rights Reserved. 20
  • 21. Building Blocks  Two types of jobs: – Partition-preserving: consume partitioned input data, produce partitioned output data – Partition-collapsing: consume partitioned input data, produce single output  Must provide to jobs: – Inputs and output paths – Desired time range  Must implement: – map() – accumulate()  May implement if necessary: – merge() – unmerge() ©2013 LinkedIn Corporation. All Rights Reserved. 21
  • 22. Experiments ©2013 LinkedIn Corporation. All Rights Reserved. 22
  • 23. Metrics for Evaluation  Wall clock time – Amount of time that elapses until job completes  Total task time – Sum of execution times for all tasks – Represents usage of cluster resources  Compare each against baseline non-incremental job ©2013 LinkedIn Corporation. All Rights Reserved. 23
  • 24. Experiment: Page Views per Member  Goal: Count page views per member over last n days  Chain partition-preserving and partition-collapsing  Can reuse previous output: ©2013 LinkedIn Corporation. All Rights Reserved. 24
  • 25. Experiment: Page Views per Member ©2013 LinkedIn Corporation. All Rights Reserved. 25
  • 26. Member Count Estimation  Goal: Estimate number of members visiting site over past n days  Use HyperLogLog cardinality estimation (space vs. accuracy)  Can't reuse output, but with partition-preserving can save state: ©2013 LinkedIn Corporation. All Rights Reserved. 26
  • 27. Member Count Estimation: Results ©2013 LinkedIn Corporation. All Rights Reserved. 27
  • 28. Conclusion  Computations over sliding windows are quite common  Implementations are typically inefficient  Incrementalizing Hadoop jobs can in some cases yield: – 95-98% reductions in total task time – 20-40% reductions in wall clock time ©2013 LinkedIn Corporation. All Rights Reserved. 28
  • 29. datafu.org Learning More ©2013 LinkedIn Corporation. All Rights Reserved. 29