Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache Kafka (Neha Pawar, Stealth Mode Startup) Kafka Summit 2020

@apachepinot | @KishoreBytes
Apache Pinot Case Study
Building distributed analytics systems
using Apache Kafka

Pinot @LinkedIn

70+
Products
Pinot @ LinkedIn
User Facing Analytics
120k+
queries/sec
ms - 1s
latency

Pinot @ LinkedIn
Business Metrics Analytics
10k+
Metrics
50k+
Dimensions

Pinot @ LinkedIn
ThirdEye: Anomaly detection and root cause analysis
50+
Teams
100K
Time Series

Apache Pinot @
Other Companies
2.7k
Github StarsSlack UsersCompanies
400+20+
Community has tripled in the last two quarters
Join our growing community on the Apache Pinot Slack Channel
https://communityinviter.com/apps/apache-pinot/apache-pinot

User Facing
Applications
Business Facing
Metrics
Anomaly Detection
Time Series
Multiple Use Cases:
One Platform
Kafka
70+
10k
100k
120k
Queries/secEvents/sec
1M+

Challenges of User facing real-time analytics
Velocity of
ingestion
High
Dimensionality
1000s of QPS
Milliseconds
Latency
Seconds
Freshness
Highly
Available Scalable
Cost
Effective
User-facing
real-time
analytics
system

Pinot Real-time Ingestion
Deep Dive

Pinot Architecture
Servers
Brokers
Queries
Scatter Gather
● Servers - Consuming,
indexing, serving
● Brokers - Scatter gather

Server 1
Deep Store
Pinot Realtime Ingestion Basics
● Kafka Consumer on Pinot Server
● Periodically create “Pinot segment”
● Persist to deep store
● In memory data - queryable
● Continue consumption

Kafka Consumer Groups
Approach 1

Kafka Consumer Group based design
● Each consumer consumes
from 1 or more partitions
Server 2Server 1
time
3 partitions
Consumer Group
Kafka
Consumer
Kafka
Consumer
● Periodic checkpointing
● Kafka Rebalancer
Server1 starts
consuming from
0 and 2
Checkpoint 350
Checkpoint 400
seg1 seg2
Kafka
Rebalancer
● Fault tolerant consumption

Challenges with Capacity Expansion
Server 2S1
Add Server3
Partition 2 moves
to Server 3
Server3 begins consumption from 400time
Server 3
Duplicate Data!
3 partitions
Kafka
Consumer
Kafka
Consumer
Consumer Group
Kafka
Consumer
Checkpoint 350
Checkpoint 400
seg1 seg2
Kafka
Rebalancer
Server1 starts
consuming from
0 and 2

Deep store
Multiple Consumer Groups
Consumer Group 1
Consumer Group 2
3 partitions
2 replicas
● No control over partitions
assigned to consumer
● No control over checkpointing
● Segment disparity
Queries
Fault tolerant
● Storage inefficient

Operational Complexity
Queries
Consumer Group 1
Consumer Group 2
3 partitions
2 replicas
● Disable consumer group for
node failure/capacity changes

Server 4
Scalability limitation
Queries
Consumer Group 1
Consumer Group 2
3 partitions
2 replicas
● Scalability limited by #partitions
Idle
● Cost inefficient

Single node in a Consumer Group
● Eliminates incorrect results
● Reduced operational complexity
Server 1
Server 2
● Limited by capacity of 1 node
● Storage overhead
● Scalability limitation
Consumer
Group 1
Consumer
Group 2
3 partitions
2 replicas
The only deployment model that worked

Incorrect
Results
Operational
Complexity
Storage
overhead
Limited
scalability
Expensive
Multi-node
Consumer
Group
Y Y Y Y Y
Single-node
Consumer
Group
Y Y Y
Issues with
Kafka Consumer Group based solution

Problem 1
Lack of control with Kafka Rebalancer
Solution
Take control of partition assignment

Problem 2
Segment Disparity due to checkpointing mechanism
Solution
Take control of checkpointing

Partition Level Consumption
Approach 2

S1 S3
Controller
S23 partitions
2 replicas
Partition Server State Start
offset
End
offset
S1
S2
CONSUMING
CONSUMING 20
S3
S1
CONSUMING
CONSUMING 20
S2
S3
CONSUMING
CONSUMING 20
0
1
2
Cluster State
● Single coordinator across all
replicas
● All actions determined by
cluster state

Deep Store
S1 S3
Controller
S23 partitions
2 replicas
offset
End
offset
0
S1
S2
CONSUMING
CONSUMING 20
1
S3
S1
CONSUMING
CONSUMING 20
2
S2
S3
CONSUMING
CONSUMING 20
Cluster State
Commit
80
110
110ONLINE
ONLINE
● Only 1 server persists
segment to deep store
● Only 1 copy stored

Deep Store
S1 S3
Controller
S23 partitions
2 replicas
offset
End
offset
0
S1
S2 20
1
S3
S1
CONSUMING
CONSUMING 20
2
S2
S3
CONSUMING
CONSUMING 20
Cluster State
110
ONLINE
ONLINE
● All other replicas
○ Download from deep
store
● Segment equivalence

Deep Store
S1 S3
Controller
S23 partitions
2 replicas
offset
End
offset
0
S1
S2
ONLINE
ONLINE
20 110
1
S3
S1
CONSUMING
CONSUMING
20
2
S2
S3
CONSUMING
CONSUMING
20
Cluster State
0
S1
S2
CONSUMING
CONSUMING
110
● New segment state created
● Start where previous segment left off

Deep Store
S1 S3
Controller
S23 partitions
2 replicas
offset
End
offset
0
S1
S2
ONLINE
ONLINE
20 110
1
S3
S1
ONLINE
ONLINE
20 120
2
S2
S3
ONLINE
ONLINE
20 100
Cluster State
0
S1
S2
CONSUMING
CONSUMING
110
1
S3
S1
CONSUMING
CONSUMING
120
2
S2
S3
CONSUMING
CONSUMING
100
● Each partition independent
of others

Deep Store
S1 S3
Capacity expansion
Controller
S23 partitions
2 replicas
S4
● Consuming segment - Restart consumption
using offset in cluster state
● Pinot segment - Download from deep store
● Easy to handle changes in
replication/partitions
● No duplicates!
● Cluster state table updated

S1 S3
Node failures
Controller
S23 partitions
2 replicas
S4
● At least 1 replica still alive
● No complex operations

S1 S3
Scalability
Controller
S23 partitions
2 replicas
S4
● Easily add nodes
● Segment equivalence =
Smart segment assignment
+ Smart query routing
S6 S5
Completed
Servers
Consuming
Servers

Incorrect
Results
Operational
Complexity
Storage
overhead
Limited
scalability
Expensive
Multi-node
Consumer
Group
Y Y Y Y Y
Single-node
Consumer
Group
Y Y Y
Partition
Level
Consumers
Summary

Q&A
pinot.apache.org
@apachepinot

Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache Kafka (Neha Pawar, Stealth Mode Startup) Kafka Summit 2020

More Related Content

What's hot (20)

Similar to Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache Kafka (Neha Pawar, Stealth Mode Startup) Kafka Summit 2020 (20)

More from HostedbyConfluent (20)

Recently uploaded (20)

Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache Kafka (Neha Pawar, Stealth Mode Startup) Kafka Summit 2020