SlideShare a Scribd company logo
Rethinking
The Storm 2.0 Worker
Roshan Naik
HORTONWORKS
April 2017, Storm & Kafka Meetup
Santa Clara, CA
Present : Storm 1.x
• Has matured into a stable and reliable system
• Widely deployed and holding up well in production
• Lots of new competition
• Differentiating on Features, Performance, Ease of Use etc.
Performance in 2.0
• How do we know if a streaming system is “fast”?
• Faster than another system ?
• What about Hardware potential ?
• See analysis in STORM-2284
• Dimensions
• Throughput
• Latency
• Resource utilization: CPU/Network/Memory/Disk/Power
• STORM-2284
• https://issues.apache.org/jira/browse/STORM-2284
Overview of Proposed Enhancements
• https://issues.apache.org/jira/browse/STORM-2284
Areas critical to Performance
• Messaging System
• Need Bounded Concurrent Queues that operate as fast as hardware allows
• Lock based queues not an option
• Lock free queues or preferably Wait-free queues
• Threading Model
• Fewer Threads. Less synchronization.
• Dedicated threads instead of pooled threads.
• CPU Pinning.
• Memory Model
• Lowering GC Pressure: Recycling Objects
• Reducing CPU cache faults: Controlling Object Layout (contiguous allocation)
Messaging Architecture
Messaging - Current Architecture
ArrayList: Current Batch
CLQ : OVERFLOW
BATCHER
Disruptor Q
Flusher
Thread
Send
Thread
SEND QRECEIVE Q
ArrayList: Current Batch
CLQ : OVERFLOW
BATCHER
Disruptor Q
Bolt
Executor
Thread
(user logic)
publish
Flusher
Thread
ArrayList
ArrayList
Worker’s
Outbound Q
A local Executor’s
RECEIVE Q
S
E
N
D
T
H
R
E
A
D
local
remote
Messaging - New Architecture
ArrayList: Current Batch
CLQ : OVERFLOW
BATCHER
Disruptor Q
Flusher
Thread
Send
Thread
SEND QRECEIVE Q
ArrayList: Current Batch
CLQ : OVERFLOW
BATCHER
Disruptor Q
Bolt
Executor
Thread
(user logic)
publish
Flusher
Thread
ArrayList
ArrayList
Worker’s
Outbound Q
A local Executor’s
RECEIVE Q
S
E
N
D
T
H
R
E
A
D
local
remote
Messaging - New Architecture
RECEIVE Q
ArrayList: Current Batch
BATCHER
JCTools Q
Bolt
Executor
Thread
(user logic)
publish
A local Executor’s
RECEIVE Q
Worker’s
Outbound Q
local
remote
Preliminary Numbers
LATENCY
• 1 spout --> 1 bolt with 1 ACKer (all in same worker)
• v1.0.1 : 3.4 milliseconds
• v2.0 master: 7 milliseconds
• v2.0 redesigned : 60-100 micro seconds (116x improvement)
Preliminary Numbers
THROUGHPUT
• 1 spout --> 1 bolt [w/o ACKing]
• v1.0.1 : ?
• v2.0 master: 3.3 million /sec
• v2.0 redesigned : 5 million /sec (50% improvement)
• 1 spout --> 1 bolt [with ACKing]
• v1.0 : 233 K /sec
• v2.0 master: 900 k/sec
• v2.0 redesigned : 1 million /sec (no change)
Observations
• Latency: Dramatically improved.
• Throughput: Discovered multiple bottlenecks preventing significantly
higher throughput.
• Grouping: Bottlenecks in LocalShuffle & FieldsGrouping if addressed along
with some others, throughput can reach ~7 million/sec.
• TumpleImpl : If inefficiencies here are addressed, throughput can reach 15
mill/sec.
• ACK-ing : ACKer bolt currently maxing out at ~ 2.5 million ACKs / sec.
Limitation with implementation not with concept. I see room for ACKer
specific fixes that can also substantially improve its throughput.
WORKER THD
• Start/Stop/Monitor
Executors
• Manage Metrics
• Topology Reconfig
• Heartbeat
Executor (Thd)
grouper
Task
(Bolt)Q
counters
Executor (Thd)
System Task
(Inter host
Input)
Executor (Thd)
Sys Task
(Outbound
Msgs)
Q
counters
New Threading & Execution Model
(STORM-2307)
Executor (Thd)
System Task
(Intra host
Input)
Executor (Thd)
(grouper)
(Bolt)
Task
(Spout/Bolt)Q
counters
WORKER PROCESS
Questions
• References
• https://issues.apache.org/jira/browse/STORM-2284

More Related Content

PPTX
Apache Storm In Retail Context
Karthik Deivasigamani
 
PPTX
Building Streaming Applications with Apache Storm 1.1
Hugo Louro
 
PPTX
Streaming and Messaging
Xin Wang
 
PPTX
Architecture of a Kafka camus infrastructure
mattlieber
 
KEY
Data Models and Consumer Idioms Using Apache Kafka for Continuous Data Stream...
Erik Onnen
 
PDF
Introduction to Apache Kafka
Shiao-An Yuan
 
PPTX
Introduction Apache Kafka
Joe Stein
 
PPTX
Apache Kafka at LinkedIn
Discover Pinterest
 
Apache Storm In Retail Context
Karthik Deivasigamani
 
Building Streaming Applications with Apache Storm 1.1
Hugo Louro
 
Streaming and Messaging
Xin Wang
 
Architecture of a Kafka camus infrastructure
mattlieber
 
Data Models and Consumer Idioms Using Apache Kafka for Continuous Data Stream...
Erik Onnen
 
Introduction to Apache Kafka
Shiao-An Yuan
 
Introduction Apache Kafka
Joe Stein
 
Apache Kafka at LinkedIn
Discover Pinterest
 

What's hot (20)

PPTX
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
Christopher Curtin
 
PPTX
Capture the Streams of Database Changes
confluent
 
PDF
Multi cluster, multitenant and hierarchical kafka messaging service slideshare
Allen (Xiaozhong) Wang
 
PDF
A la rencontre de Kafka, le log distribué par Florian GARCIA
La Cuisine du Web
 
PDF
Deploying Kafka at Dropbox, Mark Smith, Sean Fellows
confluent
 
PDF
Power of the Log: LSM & Append Only Data Structures
confluent
 
PPTX
I Heart Log: Real-time Data and Apache Kafka
Jay Kreps
 
PPTX
Apache Kafka
Joe Stein
 
PDF
Openzipkin conf: Zipkin at Yelp
Prateek Agarwal
 
PPTX
Copy of Kafka-Camus
Deep Shah
 
PDF
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019
confluent
 
PPTX
Kafka - Linkedin's messaging backbone
Ayyappadas Ravindran (Appu)
 
PPTX
Discover Kafka on OpenShift: Processing Real-Time Financial Events at Scale (...
confluent
 
PPTX
Introducing Exactly Once Semantics To Apache Kafka
Apurva Mehta
 
PPTX
Tuning kafka pipelines
Sumant Tambe
 
PPTX
AMIS SIG - Introducing Apache Kafka - Scalable, reliable Event Bus & Message ...
Lucas Jellema
 
PDF
Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si...
confluent
 
PPTX
Current and Future of Apache Kafka
Joe Stein
 
PDF
Error Resilient Design: Building Scalable & Fault-Tolerant Microservices with...
HostedbyConfluent
 
PPTX
Developing with the Go client for Apache Kafka
Joe Stein
 
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
Christopher Curtin
 
Capture the Streams of Database Changes
confluent
 
Multi cluster, multitenant and hierarchical kafka messaging service slideshare
Allen (Xiaozhong) Wang
 
A la rencontre de Kafka, le log distribué par Florian GARCIA
La Cuisine du Web
 
Deploying Kafka at Dropbox, Mark Smith, Sean Fellows
confluent
 
Power of the Log: LSM & Append Only Data Structures
confluent
 
I Heart Log: Real-time Data and Apache Kafka
Jay Kreps
 
Apache Kafka
Joe Stein
 
Openzipkin conf: Zipkin at Yelp
Prateek Agarwal
 
Copy of Kafka-Camus
Deep Shah
 
What's the time? ...and why? (Mattias Sax, Confluent) Kafka Summit SF 2019
confluent
 
Kafka - Linkedin's messaging backbone
Ayyappadas Ravindran (Appu)
 
Discover Kafka on OpenShift: Processing Real-Time Financial Events at Scale (...
confluent
 
Introducing Exactly Once Semantics To Apache Kafka
Apurva Mehta
 
Tuning kafka pipelines
Sumant Tambe
 
AMIS SIG - Introducing Apache Kafka - Scalable, reliable Event Bus & Message ...
Lucas Jellema
 
Achieving a 50% Reduction in Cross-AZ Network Costs from Kafka (Uday Sagar Si...
confluent
 
Current and Future of Apache Kafka
Joe Stein
 
Error Resilient Design: Building Scalable & Fault-Tolerant Microservices with...
HostedbyConfluent
 
Developing with the Go client for Apache Kafka
Joe Stein
 
Ad

Similar to Storm worker redesign (20)

PPTX
Next Generation Execution Engine for Apache Storm
DataWorks Summit
 
PDF
Next Generation Execution for Apache Storm
DataWorks Summit
 
PDF
Storm at Forter
Re'em Bensimhon
 
PDF
Storm at spider.io - London Storm Meetup 2013-06-18
Ashley Brown
 
PPTX
Apache Storm
masifqadri
 
PDF
Real-time streams and logs with Storm and Kafka
Andrew Montalenti
 
PDF
Real-time Streams & Logs with Storm and Kafka by Andrew Montalenti and Keith ...
PyData
 
PDF
Dominic Maes - Testing "slow flows" Fast, Automated End-2-End Testing using i...
TEST Huddle
 
PPTX
From Gust To Tempest: Scaling Storm
DataWorks Summit
 
PDF
Real world akka recepies v3
shinolajla
 
PPT
High Performance Computing - Cloud Point of View
aragozin
 
PPTX
Cassandra Lunch #88: Cadence
Anant Corporation
 
PDF
Towards Improved Data Dissemination of Publish-Subscribe Systems
Srinath Perera
 
PPTX
Retargeting Embedded Software Stack for Many-Core Systems
Sumant Tambe
 
PDF
Stream processing using Apache Storm - Big Data Meetup Athens 2016
Adrianos Dadis
 
PDF
4th Athens Big Data Meetup - 1st Talk - Big Data Streaming Processing Using A...
Athens Big Data
 
PPTX
Captial One: Why Stream Data as Part of Data Transformation?
ScyllaDB
 
PPT
iiwas 2010
steccami
 
PDF
OTN Tour 2013: What's new in java EE 7
Bruno Borges
 
PDF
Next Generation of Hadoop MapReduce
huguk
 
Next Generation Execution Engine for Apache Storm
DataWorks Summit
 
Next Generation Execution for Apache Storm
DataWorks Summit
 
Storm at Forter
Re'em Bensimhon
 
Storm at spider.io - London Storm Meetup 2013-06-18
Ashley Brown
 
Apache Storm
masifqadri
 
Real-time streams and logs with Storm and Kafka
Andrew Montalenti
 
Real-time Streams & Logs with Storm and Kafka by Andrew Montalenti and Keith ...
PyData
 
Dominic Maes - Testing "slow flows" Fast, Automated End-2-End Testing using i...
TEST Huddle
 
From Gust To Tempest: Scaling Storm
DataWorks Summit
 
Real world akka recepies v3
shinolajla
 
High Performance Computing - Cloud Point of View
aragozin
 
Cassandra Lunch #88: Cadence
Anant Corporation
 
Towards Improved Data Dissemination of Publish-Subscribe Systems
Srinath Perera
 
Retargeting Embedded Software Stack for Many-Core Systems
Sumant Tambe
 
Stream processing using Apache Storm - Big Data Meetup Athens 2016
Adrianos Dadis
 
4th Athens Big Data Meetup - 1st Talk - Big Data Streaming Processing Using A...
Athens Big Data
 
Captial One: Why Stream Data as Part of Data Transformation?
ScyllaDB
 
iiwas 2010
steccami
 
OTN Tour 2013: What's new in java EE 7
Bruno Borges
 
Next Generation of Hadoop MapReduce
huguk
 
Ad

Recently uploaded (20)

PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
PDF
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
PDF
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
PPTX
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
PDF
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
PPTX
The Future of AI & Machine Learning.pptx
pritsen4700
 
PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
PDF
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
PDF
Doc9.....................................
SofiaCollazos
 
PDF
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
PPTX
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
PDF
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
PDF
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
PDF
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
PPTX
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
PDF
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
PDF
Brief History of Internet - Early Days of Internet
sutharharshit158
 
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Neo4j
 
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
Get More from Fiori Automation - What’s New, What Works, and What’s Next.pdf
Precisely
 
Introduction to Flutter by Ayush Desai.pptx
ayushdesai204
 
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
The Future of AI & Machine Learning.pptx
pritsen4700
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
AI Unleashed - Shaping the Future -Starting Today - AIOUG Yatra 2025 - For Co...
Sandesh Rao
 
Doc9.....................................
SofiaCollazos
 
A Strategic Analysis of the MVNO Wave in Emerging Markets.pdf
IPLOOK Networks
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
AI in Daily Life: How Artificial Intelligence Helps Us Every Day
vanshrpatil7
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
GDG Cloud Munich - Intro - Luiz Carneiro - #BuildWithAI - July - Abdel.pdf
Luiz Carneiro
 
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
Accelerating Oracle Database 23ai Troubleshooting with Oracle AHF Fleet Insig...
Sandesh Rao
 
Brief History of Internet - Early Days of Internet
sutharharshit158
 

Storm worker redesign

  • 1. Rethinking The Storm 2.0 Worker Roshan Naik HORTONWORKS April 2017, Storm & Kafka Meetup Santa Clara, CA
  • 2. Present : Storm 1.x • Has matured into a stable and reliable system • Widely deployed and holding up well in production • Lots of new competition • Differentiating on Features, Performance, Ease of Use etc.
  • 3. Performance in 2.0 • How do we know if a streaming system is “fast”? • Faster than another system ? • What about Hardware potential ? • See analysis in STORM-2284 • Dimensions • Throughput • Latency • Resource utilization: CPU/Network/Memory/Disk/Power • STORM-2284 • https://issues.apache.org/jira/browse/STORM-2284
  • 4. Overview of Proposed Enhancements • https://issues.apache.org/jira/browse/STORM-2284
  • 5. Areas critical to Performance • Messaging System • Need Bounded Concurrent Queues that operate as fast as hardware allows • Lock based queues not an option • Lock free queues or preferably Wait-free queues • Threading Model • Fewer Threads. Less synchronization. • Dedicated threads instead of pooled threads. • CPU Pinning. • Memory Model • Lowering GC Pressure: Recycling Objects • Reducing CPU cache faults: Controlling Object Layout (contiguous allocation)
  • 7. Messaging - Current Architecture ArrayList: Current Batch CLQ : OVERFLOW BATCHER Disruptor Q Flusher Thread Send Thread SEND QRECEIVE Q ArrayList: Current Batch CLQ : OVERFLOW BATCHER Disruptor Q Bolt Executor Thread (user logic) publish Flusher Thread ArrayList ArrayList Worker’s Outbound Q A local Executor’s RECEIVE Q S E N D T H R E A D local remote
  • 8. Messaging - New Architecture ArrayList: Current Batch CLQ : OVERFLOW BATCHER Disruptor Q Flusher Thread Send Thread SEND QRECEIVE Q ArrayList: Current Batch CLQ : OVERFLOW BATCHER Disruptor Q Bolt Executor Thread (user logic) publish Flusher Thread ArrayList ArrayList Worker’s Outbound Q A local Executor’s RECEIVE Q S E N D T H R E A D local remote
  • 9. Messaging - New Architecture RECEIVE Q ArrayList: Current Batch BATCHER JCTools Q Bolt Executor Thread (user logic) publish A local Executor’s RECEIVE Q Worker’s Outbound Q local remote
  • 10. Preliminary Numbers LATENCY • 1 spout --> 1 bolt with 1 ACKer (all in same worker) • v1.0.1 : 3.4 milliseconds • v2.0 master: 7 milliseconds • v2.0 redesigned : 60-100 micro seconds (116x improvement)
  • 11. Preliminary Numbers THROUGHPUT • 1 spout --> 1 bolt [w/o ACKing] • v1.0.1 : ? • v2.0 master: 3.3 million /sec • v2.0 redesigned : 5 million /sec (50% improvement) • 1 spout --> 1 bolt [with ACKing] • v1.0 : 233 K /sec • v2.0 master: 900 k/sec • v2.0 redesigned : 1 million /sec (no change)
  • 12. Observations • Latency: Dramatically improved. • Throughput: Discovered multiple bottlenecks preventing significantly higher throughput. • Grouping: Bottlenecks in LocalShuffle & FieldsGrouping if addressed along with some others, throughput can reach ~7 million/sec. • TumpleImpl : If inefficiencies here are addressed, throughput can reach 15 mill/sec. • ACK-ing : ACKer bolt currently maxing out at ~ 2.5 million ACKs / sec. Limitation with implementation not with concept. I see room for ACKer specific fixes that can also substantially improve its throughput.
  • 13. WORKER THD • Start/Stop/Monitor Executors • Manage Metrics • Topology Reconfig • Heartbeat Executor (Thd) grouper Task (Bolt)Q counters Executor (Thd) System Task (Inter host Input) Executor (Thd) Sys Task (Outbound Msgs) Q counters New Threading & Execution Model (STORM-2307) Executor (Thd) System Task (Intra host Input) Executor (Thd) (grouper) (Bolt) Task (Spout/Bolt)Q counters WORKER PROCESS