SlideShare a Scribd company logo
@apachepinot | @KishoreBytes
Apache Pinot Case Study
Building distributed analytics systems
using Apache Kafka
@apachepinot | @KishoreBytes
@apachepinot | @KishoreBytes
Pinot @LinkedIn
@apachepinot | @KishoreBytes
70+
Products
Pinot @ LinkedIn
User Facing Analytics
120k+
queries/sec
ms - 1s
latency
@apachepinot | @KishoreBytes
Pinot @ LinkedIn
Business Metrics Analytics
10k+
Metrics
50k+
Dimensions
@apachepinot | @KishoreBytes
Pinot @ LinkedIn
ThirdEye: Anomaly detection and root cause analysis
50+
Teams
100K
Time Series
@apachepinot | @KishoreBytes
Apache Pinot @
Other Companies
2.7k
Github StarsSlack UsersCompanies
400+20+
Community has tripled in the last two quarters
Join our growing community on the Apache Pinot Slack Channel
https://communityinviter.com/apps/apache-pinot/apache-pinot
@apachepinot | @KishoreBytes
User Facing
Applications
Business Facing
Metrics
Anomaly Detection
Time Series
Multiple Use Cases:
One Platform
Kafka
70+
10k
100k
120k
Queries/secEvents/sec
1M+
@apachepinot | @KishoreBytes
Challenges of User facing real-time analytics
Velocity of
ingestion
High
Dimensionality
1000s of QPS
Milliseconds
Latency
Seconds
Freshness
Highly
Available Scalable
Cost
Effective
User-facing
real-time
analytics
system
@apachepinot | @KishoreBytes
Pinot Real-time Ingestion
Deep Dive
@apachepinot | @KishoreBytes
Pinot Architecture
Servers
Brokers
Queries
Scatter Gather
● Servers - Consuming,
indexing, serving
● Brokers - Scatter gather
@apachepinot | @KishoreBytes
Server 1
Deep Store
Pinot Realtime Ingestion Basics
● Kafka Consumer on Pinot Server
● Periodically create “Pinot segment”
● Persist to deep store
● In memory data - queryable
● Continue consumption
@apachepinot | @KishoreBytes
Kafka Consumer Groups
Approach 1
@apachepinot | @KishoreBytes
Kafka Consumer Group based design
● Each consumer consumes
from 1 or more partitions
Server 2Server 1
time
3 partitions
Consumer Group
Kafka
Consumer
Kafka
Consumer
● Periodic checkpointing
● Kafka Rebalancer
Server1 starts
consuming from
0 and 2
Checkpoint 350
Checkpoint 400
seg1 seg2
Kafka
Rebalancer
● Fault tolerant consumption
@apachepinot | @KishoreBytes
Challenges with Capacity Expansion
Server 2S1
Add Server3
Partition 2 moves
to Server 3
Server3 begins consumption from 400time
Server 3
Duplicate Data!
3 partitions
Kafka
Consumer
Kafka
Consumer
Consumer Group
Kafka
Consumer
Checkpoint 350
Checkpoint 400
seg1 seg2
Kafka
Rebalancer
Server1 starts
consuming from
0 and 2
@apachepinot | @KishoreBytes
Deep store
Multiple Consumer Groups
Consumer Group 1
Consumer Group 2
3 partitions
2 replicas
● No control over partitions
assigned to consumer
● No control over checkpointing
● Segment disparity
Queries
Fault tolerant
● Storage inefficient
@apachepinot | @KishoreBytes
Operational Complexity
Queries
Consumer Group 1
Consumer Group 2
3 partitions
2 replicas
● Disable consumer group for
node failure/capacity changes
@apachepinot | @KishoreBytes
Server 4
Scalability limitation
Queries
Consumer Group 1
Consumer Group 2
3 partitions
2 replicas
● Scalability limited by #partitions
Idle
● Cost inefficient
@apachepinot | @KishoreBytes
Single node in a Consumer Group
● Eliminates incorrect results
● Reduced operational complexity
Server 1
Server 2
● Limited by capacity of 1 node
● Storage overhead
● Scalability limitation
Consumer
Group 1
Consumer
Group 2
3 partitions
2 replicas
The only deployment model that worked
@apachepinot | @KishoreBytes
Incorrect
Results
Operational
Complexity
Storage
overhead
Limited
scalability
Expensive
Multi-node
Consumer
Group
Y Y Y Y Y
Single-node
Consumer
Group
Y Y Y
Issues with
Kafka Consumer Group based solution
@apachepinot | @KishoreBytes
Problem 1
Lack of control with Kafka Rebalancer
Solution
Take control of partition assignment
@apachepinot | @KishoreBytes
Problem 2
Segment Disparity due to checkpointing mechanism
Solution
Take control of checkpointing
@apachepinot | @KishoreBytes
Partition Level Consumption
Approach 2
@apachepinot | @KishoreBytes
S1 S3
Partition Level Consumption
Controller
S23 partitions
2 replicas
Partition Server State Start
offset
End
offset
S1
S2
CONSUMING
CONSUMING 20
S3
S1
CONSUMING
CONSUMING 20
S2
S3
CONSUMING
CONSUMING 20
0
1
2
Cluster State
● Single coordinator across all
replicas
● All actions determined by
cluster state
@apachepinot | @KishoreBytes
Deep Store
S1 S3
Partition Level Consumption
Controller
S23 partitions
2 replicas
Partition Server State Start
offset
End
offset
0
S1
S2
CONSUMING
CONSUMING 20
1
S3
S1
CONSUMING
CONSUMING 20
2
S2
S3
CONSUMING
CONSUMING 20
Cluster State
Commit
80
110
110ONLINE
ONLINE
● Only 1 server persists
segment to deep store
● Only 1 copy stored
@apachepinot | @KishoreBytes
Deep Store
S1 S3
Partition Level Consumption
Controller
S23 partitions
2 replicas
Partition Server State Start
offset
End
offset
0
S1
S2 20
1
S3
S1
CONSUMING
CONSUMING 20
2
S2
S3
CONSUMING
CONSUMING 20
Cluster State
110
ONLINE
ONLINE
● All other replicas
○ Download from deep
store
● Segment equivalence
@apachepinot | @KishoreBytes
Deep Store
S1 S3
Partition Level Consumption
Controller
S23 partitions
2 replicas
Partition Server State Start
offset
End
offset
0
S1
S2
ONLINE
ONLINE
20 110
1
S3
S1
CONSUMING
CONSUMING
20
2
S2
S3
CONSUMING
CONSUMING
20
Cluster State
0
S1
S2
CONSUMING
CONSUMING
110
● New segment state created
● Start where previous segment left off
@apachepinot | @KishoreBytes
Deep Store
S1 S3
Partition Level Consumption
Controller
S23 partitions
2 replicas
Partition Server State Start
offset
End
offset
0
S1
S2
ONLINE
ONLINE
20 110
1
S3
S1
ONLINE
ONLINE
20 120
2
S2
S3
ONLINE
ONLINE
20 100
Cluster State
0
S1
S2
CONSUMING
CONSUMING
110
1
S3
S1
CONSUMING
CONSUMING
120
2
S2
S3
CONSUMING
CONSUMING
100
● Each partition independent
of others
@apachepinot | @KishoreBytes
Deep Store
S1 S3
Capacity expansion
Controller
S23 partitions
2 replicas
S4
● Consuming segment - Restart consumption
using offset in cluster state
● Pinot segment - Download from deep store
● Easy to handle changes in
replication/partitions
● No duplicates!
● Cluster state table updated
@apachepinot | @KishoreBytes
S1 S3
Node failures
Controller
S23 partitions
2 replicas
S4
● At least 1 replica still alive
● No complex operations
@apachepinot | @KishoreBytes
S1 S3
Scalability
Controller
S23 partitions
2 replicas
S4
● Easily add nodes
● Segment equivalence =
Smart segment assignment
+ Smart query routing
S6 S5
Completed
Servers
Consuming
Servers
@apachepinot | @KishoreBytes
Incorrect
Results
Operational
Complexity
Storage
overhead
Limited
scalability
Expensive
Multi-node
Consumer
Group
Y Y Y Y Y
Single-node
Consumer
Group
Y Y Y
Partition
Level
Consumers
Summary
@apachepinot | @KishoreBytes
Q&A
pinot.apache.org
@apachepinot

More Related Content

What's hot (20)

PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PPTX
An Introduction to Elastic Search.
Jurriaan Persyn
 
PDF
KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...
confluent
 
PDF
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue
 
PPTX
Espresso Database Replication with Kafka, Tom Quiggle
confluent
 
PDF
Introducing Change Data Capture with Debezium
ChengKuan Gan
 
PPTX
Strata NY 2018: The deconstructed database
Julien Le Dem
 
PDF
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Databricks
 
PDF
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Kai Wähner
 
PDF
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
HostedbyConfluent
 
PDF
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
PDF
Apache kafka performance(latency)_benchmark_v0.3
SANG WON PARK
 
PPTX
Reshape Data Lake (as of 2020.07)
Eric Sun
 
PPTX
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
Flink Forward
 
PDF
Introduction SQL Analytics on Lakehouse Architecture
Databricks
 
PDF
Hudi architecture, fundamentals and capabilities
Nishith Agarwal
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
confluent
 
PDF
Owning Your Own (Data) Lake House
Data Con LA
 
PDF
Redis vs Infinispan | DevNation Tech Talk
Red Hat Developers
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
An Introduction to Elastic Search.
Jurriaan Persyn
 
KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...
confluent
 
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue
 
Espresso Database Replication with Kafka, Tom Quiggle
confluent
 
Introducing Change Data Capture with Debezium
ChengKuan Gan
 
Strata NY 2018: The deconstructed database
Julien Le Dem
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Databricks
 
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Kai Wähner
 
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
HostedbyConfluent
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
Apache kafka performance(latency)_benchmark_v0.3
SANG WON PARK
 
Reshape Data Lake (as of 2020.07)
Eric Sun
 
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
Flink Forward
 
Introduction SQL Analytics on Lakehouse Architecture
Databricks
 
Hudi architecture, fundamentals and capabilities
Nishith Agarwal
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
confluent
 
Owning Your Own (Data) Lake House
Data Con LA
 
Redis vs Infinispan | DevNation Tech Talk
Red Hat Developers
 

Similar to Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache Kafka (Neha Pawar, Stealth Mode Startup) Kafka Summit 2020 (20)

PDF
Look how easy it is to go from events to blazing-fast analytics! | Neha Pawar...
HostedbyConfluent
 
PDF
History of Apache Pinot
Kishore Gopalakrishna
 
PDF
Streaming analytics better than batch when and why - (Big Data Tech 2017)
GetInData
 
PDF
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
HostedbyConfluent
 
PDF
Apache Kafka - Martin Podval
Martin Podval
 
PDF
Streaming analytics better than batch – when and why by Dawid Wysakowicz and ...
Big Data Spain
 
PDF
Etl, esb, mq? no! es Apache Kafka®
confluent
 
PPTX
Streaming in Practice - Putting Apache Kafka in Production
confluent
 
PDF
Apache Kafka Use Cases_ When To Use It_ When Not To Use_.pdf
Noman Shaikh
 
PDF
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
HostedbyConfluent
 
PPTX
Building big data pipelines with Kafka and Kubernetes
Venu Ryali
 
PDF
Building a Streaming Pipeline on Kubernetes Using Kafka Connect, KSQLDB & Apa...
HostedbyConfluent
 
PDF
Kafka at the Edge: an IoT scenario with OpenShift Streams for Apache Kafka | ...
Red Hat Developers
 
PPTX
Multi-Datacenter Kafka - Strata San Jose 2017
Gwen (Chen) Shapira
 
PDF
Query Your Streaming Data on Kafka using SQL: Why, How, and What
HostedbyConfluent
 
PPTX
DataEngConf: Apache Kafka at Rocana: a scalable, distributed log for machine ...
Hakka Labs
 
PDF
Kafka used at scale to deliver real-time notifications
Sérgio Nunes
 
PPTX
Stream data from Apache Kafka for processing with Apache Apex
Apache Apex
 
PDF
Tokyo AK Meetup Speedtest - Share.pdf
ssuser2ae721
 
PDF
Making Apache Kafka Even Faster And More Scalable
PaulBrebner2
 
Look how easy it is to go from events to blazing-fast analytics! | Neha Pawar...
HostedbyConfluent
 
History of Apache Pinot
Kishore Gopalakrishna
 
Streaming analytics better than batch when and why - (Big Data Tech 2017)
GetInData
 
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
HostedbyConfluent
 
Apache Kafka - Martin Podval
Martin Podval
 
Streaming analytics better than batch – when and why by Dawid Wysakowicz and ...
Big Data Spain
 
Etl, esb, mq? no! es Apache Kafka®
confluent
 
Streaming in Practice - Putting Apache Kafka in Production
confluent
 
Apache Kafka Use Cases_ When To Use It_ When Not To Use_.pdf
Noman Shaikh
 
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
HostedbyConfluent
 
Building big data pipelines with Kafka and Kubernetes
Venu Ryali
 
Building a Streaming Pipeline on Kubernetes Using Kafka Connect, KSQLDB & Apa...
HostedbyConfluent
 
Kafka at the Edge: an IoT scenario with OpenShift Streams for Apache Kafka | ...
Red Hat Developers
 
Multi-Datacenter Kafka - Strata San Jose 2017
Gwen (Chen) Shapira
 
Query Your Streaming Data on Kafka using SQL: Why, How, and What
HostedbyConfluent
 
DataEngConf: Apache Kafka at Rocana: a scalable, distributed log for machine ...
Hakka Labs
 
Kafka used at scale to deliver real-time notifications
Sérgio Nunes
 
Stream data from Apache Kafka for processing with Apache Apex
Apache Apex
 
Tokyo AK Meetup Speedtest - Share.pdf
ssuser2ae721
 
Making Apache Kafka Even Faster And More Scalable
PaulBrebner2
 
Ad

More from HostedbyConfluent (20)

PDF
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
HostedbyConfluent
 
PDF
Renaming a Kafka Topic | Kafka Summit London
HostedbyConfluent
 
PDF
Evolution of NRT Data Ingestion Pipeline at Trendyol
HostedbyConfluent
 
PDF
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
HostedbyConfluent
 
PDF
Exactly-once Stream Processing with Arroyo and Kafka
HostedbyConfluent
 
PDF
Fish Plays Pokemon | Kafka Summit London
HostedbyConfluent
 
PDF
Tiered Storage 101 | Kafla Summit London
HostedbyConfluent
 
PDF
Building a Self-Service Stream Processing Portal: How And Why
HostedbyConfluent
 
PDF
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
HostedbyConfluent
 
PDF
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
HostedbyConfluent
 
PDF
Navigating Private Network Connectivity Options for Kafka Clusters
HostedbyConfluent
 
PDF
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
HostedbyConfluent
 
PDF
Explaining How Real-Time GenAI Works in a Noisy Pub
HostedbyConfluent
 
PDF
TL;DR Kafka Metrics | Kafka Summit London
HostedbyConfluent
 
PDF
A Window Into Your Kafka Streams Tasks | KSL
HostedbyConfluent
 
PDF
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
HostedbyConfluent
 
PDF
Data Contracts Management: Schema Registry and Beyond
HostedbyConfluent
 
PDF
Code-First Approach: Crafting Efficient Flink Apps
HostedbyConfluent
 
PDF
Debezium vs. the World: An Overview of the CDC Ecosystem
HostedbyConfluent
 
PDF
Beyond Tiered Storage: Serverless Kafka with No Local Disks
HostedbyConfluent
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
HostedbyConfluent
 
Renaming a Kafka Topic | Kafka Summit London
HostedbyConfluent
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
HostedbyConfluent
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
HostedbyConfluent
 
Exactly-once Stream Processing with Arroyo and Kafka
HostedbyConfluent
 
Fish Plays Pokemon | Kafka Summit London
HostedbyConfluent
 
Tiered Storage 101 | Kafla Summit London
HostedbyConfluent
 
Building a Self-Service Stream Processing Portal: How And Why
HostedbyConfluent
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
HostedbyConfluent
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
HostedbyConfluent
 
Navigating Private Network Connectivity Options for Kafka Clusters
HostedbyConfluent
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
HostedbyConfluent
 
Explaining How Real-Time GenAI Works in a Noisy Pub
HostedbyConfluent
 
TL;DR Kafka Metrics | Kafka Summit London
HostedbyConfluent
 
A Window Into Your Kafka Streams Tasks | KSL
HostedbyConfluent
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
HostedbyConfluent
 
Data Contracts Management: Schema Registry and Beyond
HostedbyConfluent
 
Code-First Approach: Crafting Efficient Flink Apps
HostedbyConfluent
 
Debezium vs. the World: An Overview of the CDC Ecosystem
HostedbyConfluent
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
HostedbyConfluent
 
Ad

Recently uploaded (20)

PDF
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
PPTX
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PPTX
Digital Circuits, important subject in CS
contactparinay1
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PDF
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PPTX
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
PDF
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
PDF
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
PPTX
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PPTX
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
Digital Circuits, important subject in CS
contactparinay1
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 

Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache Kafka (Neha Pawar, Stealth Mode Startup) Kafka Summit 2020