SlideShare a Scribd company logo
LAMBDA ARCHITECTURES IN
PRACTICE
KAFKA · HADOOP · STORM · DRUID
GIAN MERLINO
DRUID COMMITTER · SOFTWARE ENGINEER @ METAMARKETS
InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
http://www.infoq.com/presentations
/lambda-arch-case-study
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon San Francisco
www.qconsf.com
PROBLEM STREAMING DATA PIPELINES
CHALLENGES NO DATA LEFT BEHIND
INFRASTRUCTURE THE “RAD”-STACK
PLATFORM DEVELOPMENT AND OPERATIONS
TOOLS
OVERVIEW
THE PROBLEM
THE PROBLEM
THE PROBLEM
THE PROBLEM
‣ Business intelligence for ad-tech
‣ Arbitrary and interactive exploration
‣ Multi-tenancy: thousands of concurrent users
‣ Recency: explore current data, alert on major changes
‣ Efficiency: each event is individually very low-value
‣ Data model: must join impressions with clicks
2013
FINDING A SOLUTION
‣ Load all your data into Hadoop. Query it. Done!
‣ Good job guys, let’s go home
2013
HADOOP BENEFITS AND
DRAWBACKS
‣ HDFS is a scalable, reliable storage technology
‣ MapReduce is great for data-parallel computation at scale
‣ But Hadoop MapReduce is not optimized for low latency
‣ To optimize queries, we need a query layer
‣ To load data quickly, we need streaming ingestion
2013
FINDING A SOLUTION
Query Layer
Hadoop
EventStreams
Insight
Streaming data pipeline
2013
FINDING A SOLUTION
Streaming data pipeline RDBMS
Hadoop
EventStreams
Insight
2013
FINDING A SOLUTION
Streaming data pipeline
NoSQL K/V
Stores
Hadoop
EventStreams
Insight
2013
FINDING A SOLUTION
Streaming data pipeline
Commercial
Databases
Hadoop
EventStreams
Insight
2013
FINDING A SOLUTION
Streaming data pipeline
Hadoop
EventStreams
Insight
DRUID
‣ Druid project started in 2011, open sourced in Oct. 2012
‣ Designed for low latency ingestion and slice-and-dice
aggregation
‣ Growing Community
• ~45 contributors
• Used in production at numerous large and small organizations
‣ Cluster vitals
‣ 10+ trillion events, 200TB of compressed queryable data
‣ Ingesting over 450,000 events/sec on average
‣ 90th/95th/99th percentile queries within 1s/2s/10s
STREAMING DATA PIPELINES
TIME SERIES DATA
‣ Unifying feature: some notion of “event timestamp”
‣ Questions are typically also time-oriented
‣ Monitoring: Plot CPU usage over the past 3 days, in 5-min
buckets
‣ BI: Which accounts brought in the most revenue this week?
‣ Web analytics: How many unique users today? By OS? By
page?
GOALS
‣ Low-latency results
‣ Strong guarantees for historical data
DATA PIPELINE
DATA PIPELINE
‣ Data bus
‣ Decouples data acquisition from
processing
‣ Can buffer as many unprocessed
messages as you have disk
DATA PIPELINE
‣ Stream processor
‣ Join impressions/clicks
‣ Transform data
DATA PIPELINE
DATA PIPELINE
‣ Store hour-partitioned in S3
‣ Can run Hadoop jobs on it
DATA PIPELINE
λ
DEFINITION
‣ Hybrid batch/streaming data pipeline
‣ Batch technologies
• Hadoop MapReduce
• Spark
‣ Streaming technologies
• Storm
• Spark Streaming
• Samza
WHY HYBRID?
‣ This sounds insane
‣ Need to develop for both
systems
‣ Need to operate both systems
‣ Nobody really wants to do this
WHY HYBRID?
‣ We want low-latency results
‣ We also want strong guarantees for historical data
‣ Many popular streaming systems are still immature
‣ The state of things will improve in the future
‣ …but that doesn’t help you right now
FAULTS
Processor Processor Processor
Data
Store
FAULTS
Processor Processor Processor
Data
Store
FAULTS
Processor Processor Processor
FAULTS
Processor Processor Processor
FAULTS
Processor Processor Processor
FAULTS
Processor Processor :(
FAULTS
Processor Processor Processor
FAULTS
Processor Processor Processor
FAULTS
Processor Processor Processor
FAULTS
Processor Processor Processor
COPING WITH FAULTS
‣ “Exactly once” semantics
‣ Transactions
‣ Idempotency
‣ Adding an element to a set
‣ Some kinds of sketches (HyperLogLog)
‣ Doesn’t work well for counters
LATE DATA
‣ Timeline-oriented operations are common in stream processing
‣ Windowed aggregates
‣ top pages per hour
‣ unique users per hour
‣ request counts per minute
‣ Group-by-key of related events
‣ user session analysis
‣ impression/click association
LATE DATA
‣ Similar challenges with both
‣ When can we be sure we have all the data?
‣ Normally, within a minute
‣ Or if a server is slow, a few minutes…
‣ Or if a server is down, a few hours or days…
‣ We don’t want to compromise between data quality and latency
REPROCESSING
‣ Something is broken!
‣ Your data was revised
‣ Or your code had a bug
WHY HYBRID?
Late Data
Reprocessin
g
Transactions
* Streaming data store, not actually a stream processo
*
WHY HYBRID?
Globally ordered
micro-batches
Internal– yes;
external– no
Work in
progress
Late Data
Reprocessin
g
Batch– yes;
Streaming– no
Transactions
* Streaming data store, not actually a stream processo
*
WHY HYBRID?
Globally ordered
micro-batches
Internal– yes;
external– no
Work in
progress
Late Data
Reprocessin
g
Depends on
user code
Windows based
on received
time, not actual
time
Depends on
user code
Batch– yes;
Streaming– no
Batch–
unlimited;
Streaming–
windowPeriod
Transactions
* Streaming data store, not actually a stream processo
*
WHY HYBRID?
Globally ordered
micro-batches
Internal– yes;
external– no
Work in
progress
Transactions Late Data
Reprocessin
g
Rewind with
fresh state
Can use non-
streaming Spark
Rewind with
fresh state
Depends on
user code
Windows based
on received
time, not actual
time
Depends on
user code
Batch– yes;
Streaming– no
Can use batch
ingestion
Batch–
unlimited;
Streaming–
windowPeriod* Streaming data store, not actually a stream processo
*
WHY NOT HYBRID?
‣ Batch-only?
‣ If ingestion latencies are good enough, that’s great!
‣ Streaming-only?
‣ OK if you have transactions and a way to deal with late
data
‣ Current tools do require you to be careful
HYBRID PIPELINE
DEVELOPMENT
DEVELOPMENT
‣ Need code to run on two very different systems
‣ Maintaining two codebases is perilous
‣ Productivity loss
‣ Code drift
‣ Difficulty training new developers
PROGRAMMING MODEL
‣ “Query language,” if you prefer
‣ Write once, run… at least twice
‣ Open-source options
‣ Spark + Spark streaming (if you’re in the Spark ecosystem)
‣ Summingbird (key/value aggregation oriented)
‣ Or develop in-house for your use cases
‣ This investment can make sense if you have more pipeline
developers than infrastructure developers
PROGRAMMING MODEL
‣ We built “Starfire,” a Scala library for stream transformation
‣ Built around operators
‣ map, flatMap, filter
‣ groupBy, join
‣ lookup
‣ sample
‣ union
PROGRAMMING MODEL
‣ Load two data streams
‣ Join streams on shared key
‣ Produce combined records
‣ Export data
PROGRAMMING MODEL
Save
FlatMap
Cogrou
p
Load Load
EXECUTION
‣ User code generates a system-agnostic computation
graph
‣ Drivers optimize and compile the graph for each system
‣ Current drivers:
‣ Local (for unit tests)
‣ Hadoop (using Cascading)
‣ Storm
‣ Samza (beta)
BENEFITS
‣ Hedge your bets around infrastructure
‣ New drivers can run all existing user code
‣ Separate system optimization from code development
‣ Improved grouping engine without any user code changes
‣ We have many more pipeline developers than infrastructure
developers
‣ Developer productivity increased
‣ We use Starfire even for batch-only data pipelines
HYBRID PIPELINE OPERATIONS
OPERATIONS
‣ Need to operate two very different systems
‣ Must maintain “no data left behind”
‣ Cornerstones of operations
‣ Tools
‣ Metrics
‣ Alerts
TOOLS
‣ Hadoop job service
‣ Watch S3 for new data from Kafka
‣ Automatically process data after a few hours
‣ Retry failed jobs, alerts on repeated failures
TOOLS
‣ Storm job service
‣ Submit jobs to Nimbus, the Storm scheduler
‣ Monitor for failed jobs
‣ Alert on deployment failures
TOOLS
‣ Backfill tool
‣ Run backfills for particular intervals
‣ Can be used for data changes or algorithm changes
‣ Ensures alignment on Druid segment granularity
TOOLS
‣ Configuration management tool
‣ Control which version of the code should be running
‣ Manage schemas for each pipeline
‣ Manage quotas for each pipeline
‣ Central audit logs for configuration changes
STREAM METRICS
STREAM METRICS
DO TRY THIS AT HOME
2013
CORNERSTONES
‣ Druid - druid.io - @druidio
‣ Storm - storm.incubator.apache.org - @stormprocessor
‣ Hadoop - hadoop.apache.org
‣ Kafka - kafka.apache.org - @apachekafka
GLUE
storm-kafka Tranquility
Camus / Secor Druid Hadoop indexer
TAKE AWAYS
‣ Lambda architectures may not be with us forever
‣ But they solve a real problem: eventually consistent data
pipelines using popular open-source technologies
‣ Complexity can be managed with tools and practices
THANK YOU
@DRUIDIO
@METAMARK
ETS
Watch the video with slide synchronization on
InfoQ.com!
http://www.infoq.com/presentations/lambda-
arch-case-study

More Related Content

What's hot (20)

PDF
Data Analytics with Druid
Yousun Jeong
 
PPTX
Using Multiple Persistence Layers in Spark to Build a Scalable Prediction Eng...
StampedeCon
 
PPTX
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
Hortonworks
 
PDF
A real-time architecture using Hadoop and Storm @ JAX London
Nathan Bijnens
 
PDF
A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...
Nathan Bijnens
 
PPT
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
DataWorks Summit
 
PDF
Lambda architecture @ Indix
Rajesh Muppalla
 
PDF
Azure + DataStax Enterprise Powers Office 365 Per User Store
DataStax Academy
 
PPTX
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Tugdual Grall
 
PDF
Virdata: lessons learned from the Internet of Things and M2M Cloud Services @...
Nathan Bijnens
 
PDF
Real-time Analytics with Apache Flink and Druid
Jan Graßegger
 
PPTX
Implementing the Lambda Architecture efficiently with Apache Spark
DataWorks Summit
 
PPTX
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Tony Ng
 
PDF
Hadoop application architectures - using Customer 360 as an example
hadooparchbook
 
PDF
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
DataWorks Summit
 
PDF
Druid
Dori Waldman
 
PDF
The Last Pickle: Distributed Tracing from Application to Database
DataStax Academy
 
PPT
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...
StampedeCon
 
PPTX
Lessons Learned - Monitoring the Data Pipeline at Hulu
DataWorks Summit
 
PDF
Data Stream Processing - Concepts and Frameworks
Matthias Niehoff
 
Data Analytics with Druid
Yousun Jeong
 
Using Multiple Persistence Layers in Spark to Build a Scalable Prediction Eng...
StampedeCon
 
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
Hortonworks
 
A real-time architecture using Hadoop and Storm @ JAX London
Nathan Bijnens
 
A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne...
Nathan Bijnens
 
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
DataWorks Summit
 
Lambda architecture @ Indix
Rajesh Muppalla
 
Azure + DataStax Enterprise Powers Office 365 Per User Store
DataStax Academy
 
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Tugdual Grall
 
Virdata: lessons learned from the Internet of Things and M2M Cloud Services @...
Nathan Bijnens
 
Real-time Analytics with Apache Flink and Druid
Jan Graßegger
 
Implementing the Lambda Architecture efficiently with Apache Spark
DataWorks Summit
 
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Tony Ng
 
Hadoop application architectures - using Customer 360 as an example
hadooparchbook
 
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
DataWorks Summit
 
The Last Pickle: Distributed Tracing from Application to Database
DataStax Academy
 
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...
StampedeCon
 
Lessons Learned - Monitoring the Data Pipeline at Hulu
DataWorks Summit
 
Data Stream Processing - Concepts and Frameworks
Matthias Niehoff
 

Viewers also liked (20)

PPTX
Druid at Hadoop Ecosystem
Slim Bouguerra
 
PPTX
Scalable Real-time analytics using Druid
DataWorks Summit/Hadoop Summit
 
PPTX
Url Shortening Services
Altan Khendup
 
PPTX
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...
Altan Khendup
 
PDF
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015
NoSQLmatters
 
PPTX
Monitoring @ scale over diverse data sources @ PayPal - Druid, TSDB, Hadoop
Senthil Pandurangan
 
PPTX
July 2014 HUG : Pushing the limits of Realtime Analytics using Druid
Yahoo Developer Network
 
PDF
Apache Spark Use case for Education Industry
Vinayak Agrawal
 
PPTX
Using druid for interactive count distinct queries at scale @ nmc
Ido Shilon
 
PDF
Cancer Outlier Pro file Analysis using Apache Spark
Mahmoud Parsian
 
PDF
Helio, a Continues Real-Time Fraud Detection and Monitoring Solution
Amir Sedighi
 
PDF
Strata lightening-talk
Danny Yuan
 
PPTX
How Totango uses Apache Spark
Oren Raboy
 
PPTX
Getting Apache Spark Customers to Production
Cloudera, Inc.
 
PDF
Interactive analytics at scale with druid
Julien Lavigne du Cadet
 
PPTX
Apache Kylin Streaming
hongbin ma
 
PPTX
Kodu Game Lab e Project Spark
Fabrício Catae
 
PPTX
Programmatic Bidding Data Streams & Druid
Charles Allen
 
PDF
Lessons Learned on How to Secure Petabytes of Data
DataWorks Summit
 
PPTX
Apache Kylin @ Big Data Europe 2015
Seshu Adunuthula
 
Druid at Hadoop Ecosystem
Slim Bouguerra
 
Scalable Real-time analytics using Druid
DataWorks Summit/Hadoop Summit
 
Url Shortening Services
Altan Khendup
 
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...
Altan Khendup
 
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015
NoSQLmatters
 
Monitoring @ scale over diverse data sources @ PayPal - Druid, TSDB, Hadoop
Senthil Pandurangan
 
July 2014 HUG : Pushing the limits of Realtime Analytics using Druid
Yahoo Developer Network
 
Apache Spark Use case for Education Industry
Vinayak Agrawal
 
Using druid for interactive count distinct queries at scale @ nmc
Ido Shilon
 
Cancer Outlier Pro file Analysis using Apache Spark
Mahmoud Parsian
 
Helio, a Continues Real-Time Fraud Detection and Monitoring Solution
Amir Sedighi
 
Strata lightening-talk
Danny Yuan
 
How Totango uses Apache Spark
Oren Raboy
 
Getting Apache Spark Customers to Production
Cloudera, Inc.
 
Interactive analytics at scale with druid
Julien Lavigne du Cadet
 
Apache Kylin Streaming
hongbin ma
 
Kodu Game Lab e Project Spark
Fabrício Catae
 
Programmatic Bidding Data Streams & Druid
Charles Allen
 
Lessons Learned on How to Secure Petabytes of Data
DataWorks Summit
 
Apache Kylin @ Big Data Europe 2015
Seshu Adunuthula
 
Ad

Similar to Lambda Architectures in Practice (20)

PPTX
Improving Clinical Data Accuracy: How to Streamline a Data Pipeline Using Nod...
InfluxData
 
PDF
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
Spark Summit
 
PDF
Elastic Data Analytics Platform @Datadog
C4Media
 
PPTX
Overview and Walkthrough of the Application Programming Model with SAP Cloud ...
SAP Cloud Platform
 
PDF
InterCon 2016 - SLA vs Agilidade: uso de microserviços e monitoramento de cloud
iMasters
 
PDF
GraphQL vs. (the) REST
coliquio GmbH
 
PDF
SaaS - Software as a Service - Charles University - Prague - March 2013
Jaroslav Gergic
 
PDF
Big Data Ready Enterprise
DataWorks Summit/Hadoop Summit
 
PPTX
Geode Meetup Apachecon
upthewaterspout
 
PDF
Partner Connect APAC - 2022 - April
confluent
 
PPTX
Streaming in the Wild with Apache Flink
DataWorks Summit/Hadoop Summit
 
PPTX
Monitor OpenStack Environments from the bottom up and front to back
Icinga
 
PDF
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
ScyllaDB
 
PDF
Cloud Big Data Architectures
Lynn Langit
 
PDF
Elasticsearch + Cascading for Scalable Log Processing
Cascading
 
PDF
Developer Experience at Zalando - Handelsblatt Strategisches IT-Management 2019
Henning Jacobs
 
PDF
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
confluent
 
PPTX
Streaming in the Wild with Apache Flink
Kostas Tzoumas
 
PPTX
Data Engineer's Lunch #68: DevOps Fundamentals
Anant Corporation
 
PDF
Thinking DevOps in the era of the Cloud - Demi Ben-Ari
Demi Ben-Ari
 
Improving Clinical Data Accuracy: How to Streamline a Data Pipeline Using Nod...
InfluxData
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
Spark Summit
 
Elastic Data Analytics Platform @Datadog
C4Media
 
Overview and Walkthrough of the Application Programming Model with SAP Cloud ...
SAP Cloud Platform
 
InterCon 2016 - SLA vs Agilidade: uso de microserviços e monitoramento de cloud
iMasters
 
GraphQL vs. (the) REST
coliquio GmbH
 
SaaS - Software as a Service - Charles University - Prague - March 2013
Jaroslav Gergic
 
Big Data Ready Enterprise
DataWorks Summit/Hadoop Summit
 
Geode Meetup Apachecon
upthewaterspout
 
Partner Connect APAC - 2022 - April
confluent
 
Streaming in the Wild with Apache Flink
DataWorks Summit/Hadoop Summit
 
Monitor OpenStack Environments from the bottom up and front to back
Icinga
 
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
ScyllaDB
 
Cloud Big Data Architectures
Lynn Langit
 
Elasticsearch + Cascading for Scalable Log Processing
Cascading
 
Developer Experience at Zalando - Handelsblatt Strategisches IT-Management 2019
Henning Jacobs
 
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
confluent
 
Streaming in the Wild with Apache Flink
Kostas Tzoumas
 
Data Engineer's Lunch #68: DevOps Fundamentals
Anant Corporation
 
Thinking DevOps in the era of the Cloud - Demi Ben-Ari
Demi Ben-Ari
 
Ad

More from C4Media (20)

PDF
Streaming a Million Likes/Second: Real-Time Interactions on Live Video
C4Media
 
PDF
Next Generation Client APIs in Envoy Mobile
C4Media
 
PDF
Software Teams and Teamwork Trends Report Q1 2020
C4Media
 
PDF
Understand the Trade-offs Using Compilers for Java Applications
C4Media
 
PDF
Kafka Needs No Keeper
C4Media
 
PDF
High Performing Teams Act Like Owners
C4Media
 
PDF
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
C4Media
 
PDF
Service Meshes- The Ultimate Guide
C4Media
 
PDF
Shifting Left with Cloud Native CI/CD
C4Media
 
PDF
CI/CD for Machine Learning
C4Media
 
PDF
Fault Tolerance at Speed
C4Media
 
PDF
Architectures That Scale Deep - Regaining Control in Deep Systems
C4Media
 
PDF
ML in the Browser: Interactive Experiences with Tensorflow.js
C4Media
 
PDF
Build Your Own WebAssembly Compiler
C4Media
 
PDF
User & Device Identity for Microservices @ Netflix Scale
C4Media
 
PDF
Scaling Patterns for Netflix's Edge
C4Media
 
PDF
Make Your Electron App Feel at Home Everywhere
C4Media
 
PDF
The Talk You've Been Await-ing For
C4Media
 
PDF
Future of Data Engineering
C4Media
 
PDF
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
C4Media
 
Streaming a Million Likes/Second: Real-Time Interactions on Live Video
C4Media
 
Next Generation Client APIs in Envoy Mobile
C4Media
 
Software Teams and Teamwork Trends Report Q1 2020
C4Media
 
Understand the Trade-offs Using Compilers for Java Applications
C4Media
 
Kafka Needs No Keeper
C4Media
 
High Performing Teams Act Like Owners
C4Media
 
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
C4Media
 
Service Meshes- The Ultimate Guide
C4Media
 
Shifting Left with Cloud Native CI/CD
C4Media
 
CI/CD for Machine Learning
C4Media
 
Fault Tolerance at Speed
C4Media
 
Architectures That Scale Deep - Regaining Control in Deep Systems
C4Media
 
ML in the Browser: Interactive Experiences with Tensorflow.js
C4Media
 
Build Your Own WebAssembly Compiler
C4Media
 
User & Device Identity for Microservices @ Netflix Scale
C4Media
 
Scaling Patterns for Netflix's Edge
C4Media
 
Make Your Electron App Feel at Home Everywhere
C4Media
 
The Talk You've Been Await-ing For
C4Media
 
Future of Data Engineering
C4Media
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
C4Media
 

Recently uploaded (20)

PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
Advancing WebDriver BiDi support in WebKit
Igalia
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PPTX
Designing Production-Ready AI Agents
Kunal Rai
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
Advancing WebDriver BiDi support in WebKit
Igalia
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
Designing Production-Ready AI Agents
Kunal Rai
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 

Lambda Architectures in Practice

  • 1. LAMBDA ARCHITECTURES IN PRACTICE KAFKA · HADOOP · STORM · DRUID GIAN MERLINO DRUID COMMITTER · SOFTWARE ENGINEER @ METAMARKETS
  • 2. InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /lambda-arch-case-study
  • 3. Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide Presented at QCon San Francisco www.qconsf.com
  • 4. PROBLEM STREAMING DATA PIPELINES CHALLENGES NO DATA LEFT BEHIND INFRASTRUCTURE THE “RAD”-STACK PLATFORM DEVELOPMENT AND OPERATIONS TOOLS OVERVIEW
  • 8. THE PROBLEM ‣ Business intelligence for ad-tech ‣ Arbitrary and interactive exploration ‣ Multi-tenancy: thousands of concurrent users ‣ Recency: explore current data, alert on major changes ‣ Efficiency: each event is individually very low-value ‣ Data model: must join impressions with clicks
  • 9. 2013 FINDING A SOLUTION ‣ Load all your data into Hadoop. Query it. Done! ‣ Good job guys, let’s go home
  • 10. 2013 HADOOP BENEFITS AND DRAWBACKS ‣ HDFS is a scalable, reliable storage technology ‣ MapReduce is great for data-parallel computation at scale ‣ But Hadoop MapReduce is not optimized for low latency ‣ To optimize queries, we need a query layer ‣ To load data quickly, we need streaming ingestion
  • 11. 2013 FINDING A SOLUTION Query Layer Hadoop EventStreams Insight Streaming data pipeline
  • 12. 2013 FINDING A SOLUTION Streaming data pipeline RDBMS Hadoop EventStreams Insight
  • 13. 2013 FINDING A SOLUTION Streaming data pipeline NoSQL K/V Stores Hadoop EventStreams Insight
  • 14. 2013 FINDING A SOLUTION Streaming data pipeline Commercial Databases Hadoop EventStreams Insight
  • 15. 2013 FINDING A SOLUTION Streaming data pipeline Hadoop EventStreams Insight
  • 16. DRUID ‣ Druid project started in 2011, open sourced in Oct. 2012 ‣ Designed for low latency ingestion and slice-and-dice aggregation ‣ Growing Community • ~45 contributors • Used in production at numerous large and small organizations ‣ Cluster vitals ‣ 10+ trillion events, 200TB of compressed queryable data ‣ Ingesting over 450,000 events/sec on average ‣ 90th/95th/99th percentile queries within 1s/2s/10s
  • 18. TIME SERIES DATA ‣ Unifying feature: some notion of “event timestamp” ‣ Questions are typically also time-oriented ‣ Monitoring: Plot CPU usage over the past 3 days, in 5-min buckets ‣ BI: Which accounts brought in the most revenue this week? ‣ Web analytics: How many unique users today? By OS? By page?
  • 19. GOALS ‣ Low-latency results ‣ Strong guarantees for historical data
  • 21. DATA PIPELINE ‣ Data bus ‣ Decouples data acquisition from processing ‣ Can buffer as many unprocessed messages as you have disk
  • 22. DATA PIPELINE ‣ Stream processor ‣ Join impressions/clicks ‣ Transform data
  • 24. DATA PIPELINE ‣ Store hour-partitioned in S3 ‣ Can run Hadoop jobs on it
  • 26. λ
  • 27. DEFINITION ‣ Hybrid batch/streaming data pipeline ‣ Batch technologies • Hadoop MapReduce • Spark ‣ Streaming technologies • Storm • Spark Streaming • Samza
  • 28. WHY HYBRID? ‣ This sounds insane ‣ Need to develop for both systems ‣ Need to operate both systems ‣ Nobody really wants to do this
  • 29. WHY HYBRID? ‣ We want low-latency results ‣ We also want strong guarantees for historical data ‣ Many popular streaming systems are still immature ‣ The state of things will improve in the future ‣ …but that doesn’t help you right now
  • 40. COPING WITH FAULTS ‣ “Exactly once” semantics ‣ Transactions ‣ Idempotency ‣ Adding an element to a set ‣ Some kinds of sketches (HyperLogLog) ‣ Doesn’t work well for counters
  • 41. LATE DATA ‣ Timeline-oriented operations are common in stream processing ‣ Windowed aggregates ‣ top pages per hour ‣ unique users per hour ‣ request counts per minute ‣ Group-by-key of related events ‣ user session analysis ‣ impression/click association
  • 42. LATE DATA ‣ Similar challenges with both ‣ When can we be sure we have all the data? ‣ Normally, within a minute ‣ Or if a server is slow, a few minutes… ‣ Or if a server is down, a few hours or days… ‣ We don’t want to compromise between data quality and latency
  • 43. REPROCESSING ‣ Something is broken! ‣ Your data was revised ‣ Or your code had a bug
  • 44. WHY HYBRID? Late Data Reprocessin g Transactions * Streaming data store, not actually a stream processo *
  • 45. WHY HYBRID? Globally ordered micro-batches Internal– yes; external– no Work in progress Late Data Reprocessin g Batch– yes; Streaming– no Transactions * Streaming data store, not actually a stream processo *
  • 46. WHY HYBRID? Globally ordered micro-batches Internal– yes; external– no Work in progress Late Data Reprocessin g Depends on user code Windows based on received time, not actual time Depends on user code Batch– yes; Streaming– no Batch– unlimited; Streaming– windowPeriod Transactions * Streaming data store, not actually a stream processo *
  • 47. WHY HYBRID? Globally ordered micro-batches Internal– yes; external– no Work in progress Transactions Late Data Reprocessin g Rewind with fresh state Can use non- streaming Spark Rewind with fresh state Depends on user code Windows based on received time, not actual time Depends on user code Batch– yes; Streaming– no Can use batch ingestion Batch– unlimited; Streaming– windowPeriod* Streaming data store, not actually a stream processo *
  • 48. WHY NOT HYBRID? ‣ Batch-only? ‣ If ingestion latencies are good enough, that’s great! ‣ Streaming-only? ‣ OK if you have transactions and a way to deal with late data ‣ Current tools do require you to be careful
  • 50. DEVELOPMENT ‣ Need code to run on two very different systems ‣ Maintaining two codebases is perilous ‣ Productivity loss ‣ Code drift ‣ Difficulty training new developers
  • 51. PROGRAMMING MODEL ‣ “Query language,” if you prefer ‣ Write once, run… at least twice ‣ Open-source options ‣ Spark + Spark streaming (if you’re in the Spark ecosystem) ‣ Summingbird (key/value aggregation oriented) ‣ Or develop in-house for your use cases ‣ This investment can make sense if you have more pipeline developers than infrastructure developers
  • 52. PROGRAMMING MODEL ‣ We built “Starfire,” a Scala library for stream transformation ‣ Built around operators ‣ map, flatMap, filter ‣ groupBy, join ‣ lookup ‣ sample ‣ union
  • 53. PROGRAMMING MODEL ‣ Load two data streams ‣ Join streams on shared key ‣ Produce combined records ‣ Export data
  • 55. EXECUTION ‣ User code generates a system-agnostic computation graph ‣ Drivers optimize and compile the graph for each system ‣ Current drivers: ‣ Local (for unit tests) ‣ Hadoop (using Cascading) ‣ Storm ‣ Samza (beta)
  • 56. BENEFITS ‣ Hedge your bets around infrastructure ‣ New drivers can run all existing user code ‣ Separate system optimization from code development ‣ Improved grouping engine without any user code changes ‣ We have many more pipeline developers than infrastructure developers ‣ Developer productivity increased ‣ We use Starfire even for batch-only data pipelines
  • 58. OPERATIONS ‣ Need to operate two very different systems ‣ Must maintain “no data left behind” ‣ Cornerstones of operations ‣ Tools ‣ Metrics ‣ Alerts
  • 59. TOOLS ‣ Hadoop job service ‣ Watch S3 for new data from Kafka ‣ Automatically process data after a few hours ‣ Retry failed jobs, alerts on repeated failures
  • 60. TOOLS ‣ Storm job service ‣ Submit jobs to Nimbus, the Storm scheduler ‣ Monitor for failed jobs ‣ Alert on deployment failures
  • 61. TOOLS ‣ Backfill tool ‣ Run backfills for particular intervals ‣ Can be used for data changes or algorithm changes ‣ Ensures alignment on Druid segment granularity
  • 62. TOOLS ‣ Configuration management tool ‣ Control which version of the code should be running ‣ Manage schemas for each pipeline ‣ Manage quotas for each pipeline ‣ Central audit logs for configuration changes
  • 65. DO TRY THIS AT HOME
  • 66. 2013 CORNERSTONES ‣ Druid - druid.io - @druidio ‣ Storm - storm.incubator.apache.org - @stormprocessor ‣ Hadoop - hadoop.apache.org ‣ Kafka - kafka.apache.org - @apachekafka
  • 67. GLUE storm-kafka Tranquility Camus / Secor Druid Hadoop indexer
  • 68. TAKE AWAYS ‣ Lambda architectures may not be with us forever ‣ But they solve a real problem: eventually consistent data pipelines using popular open-source technologies ‣ Complexity can be managed with tools and practices
  • 70. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/lambda- arch-case-study