SlideShare a Scribd company logo
Interactive Realtime
Dashboards on Data Streams
Nishant Bangarwa
Hortonworks
Druid Committer, PMC
June 2017
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Sample Data Stream : Wikipedia Edits
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Demo: Wikipedia Real-Time Dashboard (Accelerated 30x)
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Step by Step Breakdown
Consume Events
Enrich / Transform
(Add Geolocation
from IP Address)
Store Events
Visualize Events
Sample Event : [[Eoghan Harris]] https://en.wikipedia.org/w/index.php?diff=792474242&oldid=787592607 * 7.114.169.238 * (+167) Added fact
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Required Components
 Event Flow
 Event Processing
 Data Store
 Visualization Layer
© Hortonworks Inc. 2011 – 2016. All Rights Reserved6
Event Flow
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Event Flow : Requirements
Event
Producers
Queue
Event
Consumers
 Low latency
 High Throughput
 Failure Handling
 Message delivery guarantees –
 Message Ordering
 Atleast Once, Exactly once, Atmost Once
 Scalability
 Fault tolerant
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Kafka
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Kafka
 Low Latency
 High Throughput
 Message Delivery guarantees
 At-least once
 Exactly Once (Fully introduced in apache kafka v0.11.0 June 2017)
 Reliable design to Handle Failures
 Message Acks between producers and brokers
 Data Replication on brokers
 Consumers can Read from any desired offset
 Handle multiple producers/consumers
 Scalable
© Hortonworks Inc. 2011 – 2016. All Rights Reserved10
Event Processing
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Event Processing : Requirements
 Consume-Process-Produce Pattern
 Enrich and Transform event streams
 Windowing
 Apply business logic
 Consume and Join multiple streams into single
 Failure Handling
 Scalability
Source Process Sink
Consume Produce
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Kafka Streams
 Rich Lightweight Stream processing library
 Event-at-a-time
 Stateful processing : windowing, joining, aggregation operators
 Local state using RocksDb
 Backed by changelog in kafka
 Highly scalable, distributed, fault tolerant
 Compared to a standard Kafka consumer:
 Higher level: faster to build a sophisticated app
 Less control for very fine-grained consumption
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Kafka Streams : Wikipedia Data Enrichment
© Hortonworks Inc. 2011 – 2016. All Rights Reserved14
Data Store
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data Store : Requirements
Processed
Events
Data Store Queries
 Ability to ingest Streaming data
 Power Interactive dashboards
 Sub-Second Query Response time
 Ad-hoc arbitrary slicing and dicing of data
 Data Freshness
 Summarized/aggregated data is queried
 Scalability
 High Availability
© Hortonworks Inc. 2011 – 2016. All Rights Reserved16
Druid
 Column-oriented distributed datastore
 Sub-Second query times
 Realtime streaming ingestion
 Arbitrary slicing and dicing of data
 Automatic Data Summarization
 Approximate algorithms (hyperLogLog, theta)
 Scalable to petabytes of data
 Highly available
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Suitable Use Cases
 Powering Interactive user facing applications
 Arbitrary slicing and dicing of large datasets
 User behavior analysis
 measuring distinct counts
 retention analysis
 funnel analysis
 A/B testing
 Exploratory analytics/root cause analysis
 Not interested in dumping entire dataset
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid: Segments
 Data in Druid is stored in Segment Files.
 Partitioned by time
 Ideally, segment files are each smaller than 1GB.
 If files are large, smaller time partitions are needed.
Time
Segment 1:
Monday
Segment 2:
Tuesday
Segment 3:
Wednesday
Segment 4:
Thursday
Segment 5_2:
Friday
Segment 5_1:
Friday
© Hortonworks Inc. 2011 – 2016. All Rights Reserved19
Example Wikipedia Edit Dataset
timestamp page language city country … added deleted
2011-01-01T00:01:35Z Justin Bieber en SF USA 10 65
2011-01-01T00:03:63Z Justin Bieber en SF USA 15 62
2011-01-01T00:04:51Z Justin Bieber en SF USA 32 45
2011-01-01T00:05:35Z Ke$ha en Calgary CA 17 87
2011-01-01T00:06:41Z Ke$ha en Calgary CA 43 99
2011-01-02T00:08:35Z Selena Gomes en Calgary CA 12 53
Timestamp Dimensions Metrics
© Hortonworks Inc. 2011 – 2016. All Rights Reserved20
Data Rollup
timestamp page language city country … added deleted
2011-01-01T00:01:35Z Justin Bieber en SF USA 10 65
2011-01-01T00:03:63Z Justin Bieber en SF USA 15 62
2011-01-01T00:04:51Z Justin Bieber en SF USA 32 45
2011-01-01T00:05:35Z Ke$ha en Calgary CA 17 87
2011-01-01T00:06:41Z Ke$ha en Calgary CA 43 99
2011-01-02T00:08:35Z Selena Gomes en Calgary CA 12 53
timestamp page language city country count sum_added sum_deleted min_added max_added ….
2011-01-01T00:00:00Z Justin Bieber en SF USA 3 57 172 10 32
2011-01-01T00:00:00Z Ke$ha en Calgary CA 2 60 186 17 43
2011-01-02T00:00:00Z Selena Gomes en Calgary CA 1 12 53 12 12
Rollup by hour
© Hortonworks Inc. 2011 – 2016. All Rights Reserved21
Dictionary Encoding
 Create and store Ids for each value
 e.g. page column
 Values - Justin Bieber, Ke$ha, Selena Gomes
 Encoding - Justin Bieber : 0, Ke$ha: 1, Selena Gomes: 2
 Column Data - [0 0 0 1 1 2]
 city column - [0 0 0 1 1 1]
timestamp page language city country … added deleted
2011-01-01T00:01:35Z Justin Bieber en SF USA 10 65
2011-01-01T00:03:63Z Justin Bieber en SF USA 15 62
2011-01-01T00:04:51Z Justin Bieber en SF USA 32 45
2011-01-01T00:05:35Z Ke$ha en Calgary CA 17 87
2011-01-01T00:06:41Z Ke$ha en Calgary CA 43 99
2011-01-02T00:08:35Z Selena Gomes en Calgary CA 12 53
© Hortonworks Inc. 2011 – 2016. All Rights Reserved22
Bitmap Indices
 Store Bitmap Indices for each value
 Justin Bieber -> [0, 1, 2] -> [1 1 1 0 0 0]
 Ke$ha -> [3, 4] -> [0 0 0 1 1 0]
 Selena Gomes -> [5] -> [0 0 0 0 0 1]
 Queries
 Justin Bieber or Ke$ha -> [1 1 1 0 0 0] OR [0 0 0 1 1 0] -> [1 1 1 1 1 0]
 language = en and country = CA -> [1 1 1 1 1 1] AND [0 0 0 1 1 1] -> [0 0 0 1 1 1]
 Indexes compressed with Concise or Roaring encoding
timestamp page language city country … added deleted
2011-01-01T00:01:35Z Justin Bieber en SF USA 10 65
2011-01-01T00:03:63Z Justin Bieber en SF USA 15 62
2011-01-01T00:04:51Z Justin Bieber en SF USA 32 45
2011-01-01T00:01:35Z Ke$ha en Calgary CA 17 87
2011-01-01T00:01:35Z Ke$ha en Calgary CA 43 99
2011-01-01T00:01:35Z Selena Gomes en Calgary CA 12 53
© Hortonworks Inc. 2011 – 2016. All Rights Reserved23
Approximate Sketch Columns
timestamp page userid language city country … added deleted
2011-01-01T00:01:35Z Justin Bieber user1111111 en SF USA 10 65
2011-01-01T00:03:63Z Justin Bieber user1111111 en SF USA 15 62
2011-01-01T00:04:51Z Justin Bieber user2222222 en SF USA 32 45
2011-01-01T00:05:35Z Ke$ha user3333333 en Calgary CA 17 87
2011-01-01T00:06:41Z Ke$ha user4444444 en Calgary CA 43 99
2011-01-02T00:08:35Z Selena Gomes user1111111 en Calgary CA 12 53
timestamp page language city country count sum_added sum_delete
d
min_added Userid_sket
ch
….
2011-01-01T00:00:00Z Justin Bieber en SF USA 3 57 172 10 {sketch}
2011-01-01T00:00:00Z Ke$ha en Calgary CA 2 60 186 17 {sketch}
2011-01-02T00:00:00Z Selena Gomes en Calgary CA 1 12 53 12 {sketch}
Rollup by hour
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Approximate Algorithms
 Store Sketch objects, instead of raw column values
 Better rollup for high cardinality columns e.g userid
 Reduced storage size
 Use Cases
 Fast approximate distinct counts
 Approximate histograms
 Funnel/retention analysis
 Limitation
 Not possible to do exact counts
 filter on individual row values
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Realtime
Nodes
Historical
Nodes
25
Druid Architecture
Batch Data
Event
Historical
Nodes
Broker
Nodes
Realtime
Index Tasks
Streaming
Data
Historical
Nodes
Handoff
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Performance and Scalability : Fast Facts
Most Events per Day
300 Billion Events / Day
(Metamarkets)
Most Computed Metrics
1 Billion Metrics / Min
(Jolata)
Largest Cluster
200 Nodes
(Metamarkets)
Largest Hourly Ingestion
2TB per Hour
(Netflix)
© Hortonworks Inc. 2011 – 2016. All Rights Reserved27
Companies Using Druid
© Hortonworks Inc. 2011 – 2016. All Rights Reserved28
Visualization Layer
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Visualization Layer : Requirements
 Rich dashboarding capabilities
 Work with multiple datasoucres
 Security/Access control
 Allow for extension
 Add custom visualizations
Data Store Visualization
Layer
User
Dashboards
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Superset
 Python backend
 Flask app builder
 Authentication
 Pandas for rich analytics
 SqlAlchemy for SQL toolkit
 Javascript frontend
 React, NVD3
 Deep integration with Druid
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Superset Rich Dashboarding Capabilities: Treemaps
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Superset Rich Dashboarding Capabilities: Sunburst
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Superset UI Provides Powerful Visualizations
Rich library of dashboard visualizations:
Basic:
• Bar Charts
• Pie Charts
• Line Charts
Advanced:
• Sankey Diagrams
• Treemaps
• Sunburst
• Heatmaps
And More!
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Wikipedia Real-Time Dashboard
Kafka
Connect
IP-to-
Geolocation
Processor
wikipedia-raw
topic
wikipedia-raw
topic
wikipedia-enriched
topic
wikipedia-enriched
topic
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Project Websites
 Kafka - http://kafka.apache.org
 Druid - http://druid.io
 Superset - http://superset.incubator.apache.org
© Hortonworks Inc. 2011 – 2016. All Rights Reserved36
Thank you ! Questions ?
 Twitter - @NishantBangarwa
 Email - nbangarwa@hortonworks.com
 Linkedin - https://www.linkedin.com/in/nishant-bangarwa
Off The Record (OTR) session
Experiences and challenges in working with Druid
at 03:25 PM - 04:10 PM on 28 July, 2017
in Room 1 MLR Convention Centre, Whitefield

More Related Content

What's hot (20)

PDF
Architect’s Open-Source Guide for a Data Mesh Architecture
Databricks
 
PPTX
Data Governance Initiative
DataWorks Summit
 
PDF
Data Mess to Data Mesh | Jay Kreps, CEO, Confluent | Kafka Summit Americas 20...
HostedbyConfluent
 
PDF
Big Data: Its Characteristics And Architecture Capabilities
Ashraf Uddin
 
PPTX
Azure DataBricks for Data Engineering by Eugene Polonichko
Dimko Zhluktenko
 
PPTX
Databricks Fundamentals
Dalibor Wijas
 
PDF
Snowflake Data Science and AI/ML at Scale
Adam Doyle
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Introduction to Data Science
Niko Vuokko
 
PDF
Summary introduction to data engineering
Novita Sari
 
PPTX
Big data and Hadoop
Rahul Agarwal
 
PDF
Architecting Agile Data Applications for Scale
Databricks
 
PPTX
Free Training: How to Build a Lakehouse
Databricks
 
PDF
Hyperspace for Delta Lake
Databricks
 
PPTX
Databricks on AWS.pptx
Wasm1953
 
PDF
Data Architecture Strategies: Data Architecture for Digital Transformation
DATAVERSITY
 
PPTX
Introduction to AWS Lake Formation.pptx
SwathiPonugumati
 
PDF
Modernizing to a Cloud Data Architecture
Databricks
 
PDF
Data Mesh
Piethein Strengholt
 
PDF
apache-spark-programming-with-databricks.pdf
Alfredo Lorie
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Databricks
 
Data Governance Initiative
DataWorks Summit
 
Data Mess to Data Mesh | Jay Kreps, CEO, Confluent | Kafka Summit Americas 20...
HostedbyConfluent
 
Big Data: Its Characteristics And Architecture Capabilities
Ashraf Uddin
 
Azure DataBricks for Data Engineering by Eugene Polonichko
Dimko Zhluktenko
 
Databricks Fundamentals
Dalibor Wijas
 
Snowflake Data Science and AI/ML at Scale
Adam Doyle
 
Learn to Use Databricks for Data Science
Databricks
 
Introduction to Data Science
Niko Vuokko
 
Summary introduction to data engineering
Novita Sari
 
Big data and Hadoop
Rahul Agarwal
 
Architecting Agile Data Applications for Scale
Databricks
 
Free Training: How to Build a Lakehouse
Databricks
 
Hyperspace for Delta Lake
Databricks
 
Databricks on AWS.pptx
Wasm1953
 
Data Architecture Strategies: Data Architecture for Digital Transformation
DATAVERSITY
 
Introduction to AWS Lake Formation.pptx
SwathiPonugumati
 
Modernizing to a Cloud Data Architecture
Databricks
 
apache-spark-programming-with-databricks.pdf
Alfredo Lorie
 

Similar to Interactive Realtime Dashboards on Data Streams using Kafka, Druid and Superset (20)

PPTX
Interactive real time dashboards on data streams using Kafka, Druid, and Supe...
DataWorks Summit
 
PPTX
Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...
DataWorks Summit
 
PPTX
An Introduction to Druid
DataWorks Summit
 
PPTX
Design Patterns For Real Time Streaming Data Analytics
DataWorks Summit
 
PPTX
Design Patterns For Real Time Streaming Data Analytics
DataWorks Summit
 
PPTX
Interactive Analytics at Scale in Apache Hive Using Druid
DataWorks Summit
 
PPTX
Scalable Real-time analytics using Druid
DataWorks Summit/Hadoop Summit
 
PPTX
Analyzing Hadoop Using Hadoop
DataWorks Summit
 
PPTX
Druid Scaling Realtime Analytics
Aaron Brooks
 
PPTX
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
Future of Data Meetup
 
PPTX
Webinar Series Part 5 New Features of HDF 5
Hortonworks
 
PPTX
Interactive Analytics at Scale in Apache Hive Using Druid
DataWorks Summit/Hadoop Summit
 
PPTX
Hadoop crashcourse v3
Hortonworks
 
PPTX
Enabling the Real Time Analytical Enterprise
Hortonworks
 
PPTX
Using Apache® NiFi to Empower Self-Organising Teams
Sebastian Carroll
 
PDF
Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks
 
PPTX
HDF Powered by Apache NiFi Introduction
Milind Pandit
 
PDF
Fast SQL on Hadoop, Really?
DataWorks Summit
 
PPTX
Hive Performance Dataworks Summit Melbourne February 2019
alanfgates
 
PPTX
Harnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFI
Haimo Liu
 
Interactive real time dashboards on data streams using Kafka, Druid, and Supe...
DataWorks Summit
 
Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...
DataWorks Summit
 
An Introduction to Druid
DataWorks Summit
 
Design Patterns For Real Time Streaming Data Analytics
DataWorks Summit
 
Design Patterns For Real Time Streaming Data Analytics
DataWorks Summit
 
Interactive Analytics at Scale in Apache Hive Using Druid
DataWorks Summit
 
Scalable Real-time analytics using Druid
DataWorks Summit/Hadoop Summit
 
Analyzing Hadoop Using Hadoop
DataWorks Summit
 
Druid Scaling Realtime Analytics
Aaron Brooks
 
Don't reengineer, reimagine: Hive buzzing with Druid's magic potion
Future of Data Meetup
 
Webinar Series Part 5 New Features of HDF 5
Hortonworks
 
Interactive Analytics at Scale in Apache Hive Using Druid
DataWorks Summit/Hadoop Summit
 
Hadoop crashcourse v3
Hortonworks
 
Enabling the Real Time Analytical Enterprise
Hortonworks
 
Using Apache® NiFi to Empower Self-Organising Teams
Sebastian Carroll
 
Hortonworks Technical Workshop: What's New in HDP 2.3
Hortonworks
 
HDF Powered by Apache NiFi Introduction
Milind Pandit
 
Fast SQL on Hadoop, Really?
DataWorks Summit
 
Hive Performance Dataworks Summit Melbourne February 2019
alanfgates
 
Harnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFI
Haimo Liu
 
Ad

Recently uploaded (20)

PPTX
Thermal runway and thermal stability.pptx
godow93766
 
PPTX
Big Data and Data Science hype .pptx
SUNEEL37
 
PPTX
fatigue in aircraft structures-221113192308-0ad6dc8c.pptx
aviatecofficial
 
PPTX
Mechanical Design of shell and tube heat exchangers as per ASME Sec VIII Divi...
shahveer210504
 
PPTX
The Role of Information Technology in Environmental Protectio....pptx
nallamillisriram
 
PDF
Biomechanics of Gait: Engineering Solutions for Rehabilitation (www.kiu.ac.ug)
publication11
 
PPTX
DATA BASE MANAGEMENT AND RELATIONAL DATA
gomathisankariv2
 
PDF
Halide Perovskites’ Multifunctional Properties: Coordination Engineering, Coo...
TaameBerhe2
 
PDF
Zilliz Cloud Demo for performance and scale
Zilliz
 
PDF
AI TECHNIQUES FOR IDENTIFYING ALTERATIONS IN THE HUMAN GUT MICROBIOME IN MULT...
vidyalalltv1
 
PDF
Electrical Engineer operation Supervisor
ssaruntatapower143
 
PDF
Pressure Measurement training for engineers and Technicians
AIESOLUTIONS
 
PPT
Carmon_Remote Sensing GIS by Mahesh kumar
DhananjayM6
 
PDF
Water Industry Process Automation & Control Monthly July 2025
Water Industry Process Automation & Control
 
PPTX
Shinkawa Proposal to meet Vibration API670.pptx
AchmadBashori2
 
PPTX
GitOps_Without_K8s_Training_detailed git repository
DanialHabibi2
 
DOCX
CS-802 (A) BDH Lab manual IPS Academy Indore
thegodhimself05
 
PDF
International Journal of Information Technology Convergence and services (IJI...
ijitcsjournal4
 
DOCX
8th International Conference on Electrical Engineering (ELEN 2025)
elelijjournal653
 
PPTX
MATLAB : Introduction , Features , Display Windows, Syntax, Operators, Graph...
Amity University, Patna
 
Thermal runway and thermal stability.pptx
godow93766
 
Big Data and Data Science hype .pptx
SUNEEL37
 
fatigue in aircraft structures-221113192308-0ad6dc8c.pptx
aviatecofficial
 
Mechanical Design of shell and tube heat exchangers as per ASME Sec VIII Divi...
shahveer210504
 
The Role of Information Technology in Environmental Protectio....pptx
nallamillisriram
 
Biomechanics of Gait: Engineering Solutions for Rehabilitation (www.kiu.ac.ug)
publication11
 
DATA BASE MANAGEMENT AND RELATIONAL DATA
gomathisankariv2
 
Halide Perovskites’ Multifunctional Properties: Coordination Engineering, Coo...
TaameBerhe2
 
Zilliz Cloud Demo for performance and scale
Zilliz
 
AI TECHNIQUES FOR IDENTIFYING ALTERATIONS IN THE HUMAN GUT MICROBIOME IN MULT...
vidyalalltv1
 
Electrical Engineer operation Supervisor
ssaruntatapower143
 
Pressure Measurement training for engineers and Technicians
AIESOLUTIONS
 
Carmon_Remote Sensing GIS by Mahesh kumar
DhananjayM6
 
Water Industry Process Automation & Control Monthly July 2025
Water Industry Process Automation & Control
 
Shinkawa Proposal to meet Vibration API670.pptx
AchmadBashori2
 
GitOps_Without_K8s_Training_detailed git repository
DanialHabibi2
 
CS-802 (A) BDH Lab manual IPS Academy Indore
thegodhimself05
 
International Journal of Information Technology Convergence and services (IJI...
ijitcsjournal4
 
8th International Conference on Electrical Engineering (ELEN 2025)
elelijjournal653
 
MATLAB : Introduction , Features , Display Windows, Syntax, Operators, Graph...
Amity University, Patna
 
Ad

Interactive Realtime Dashboards on Data Streams using Kafka, Druid and Superset

  • 1. Interactive Realtime Dashboards on Data Streams Nishant Bangarwa Hortonworks Druid Committer, PMC June 2017
  • 2. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Sample Data Stream : Wikipedia Edits
  • 3. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Demo: Wikipedia Real-Time Dashboard (Accelerated 30x)
  • 4. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Step by Step Breakdown Consume Events Enrich / Transform (Add Geolocation from IP Address) Store Events Visualize Events Sample Event : [[Eoghan Harris]] https://en.wikipedia.org/w/index.php?diff=792474242&oldid=787592607 * 7.114.169.238 * (+167) Added fact
  • 5. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Required Components  Event Flow  Event Processing  Data Store  Visualization Layer
  • 6. © Hortonworks Inc. 2011 – 2016. All Rights Reserved6 Event Flow
  • 7. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Event Flow : Requirements Event Producers Queue Event Consumers  Low latency  High Throughput  Failure Handling  Message delivery guarantees –  Message Ordering  Atleast Once, Exactly once, Atmost Once  Scalability  Fault tolerant
  • 8. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Kafka
  • 9. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Kafka  Low Latency  High Throughput  Message Delivery guarantees  At-least once  Exactly Once (Fully introduced in apache kafka v0.11.0 June 2017)  Reliable design to Handle Failures  Message Acks between producers and brokers  Data Replication on brokers  Consumers can Read from any desired offset  Handle multiple producers/consumers  Scalable
  • 10. © Hortonworks Inc. 2011 – 2016. All Rights Reserved10 Event Processing
  • 11. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Event Processing : Requirements  Consume-Process-Produce Pattern  Enrich and Transform event streams  Windowing  Apply business logic  Consume and Join multiple streams into single  Failure Handling  Scalability Source Process Sink Consume Produce
  • 12. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Kafka Streams  Rich Lightweight Stream processing library  Event-at-a-time  Stateful processing : windowing, joining, aggregation operators  Local state using RocksDb  Backed by changelog in kafka  Highly scalable, distributed, fault tolerant  Compared to a standard Kafka consumer:  Higher level: faster to build a sophisticated app  Less control for very fine-grained consumption
  • 13. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Kafka Streams : Wikipedia Data Enrichment
  • 14. © Hortonworks Inc. 2011 – 2016. All Rights Reserved14 Data Store
  • 15. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data Store : Requirements Processed Events Data Store Queries  Ability to ingest Streaming data  Power Interactive dashboards  Sub-Second Query Response time  Ad-hoc arbitrary slicing and dicing of data  Data Freshness  Summarized/aggregated data is queried  Scalability  High Availability
  • 16. © Hortonworks Inc. 2011 – 2016. All Rights Reserved16 Druid  Column-oriented distributed datastore  Sub-Second query times  Realtime streaming ingestion  Arbitrary slicing and dicing of data  Automatic Data Summarization  Approximate algorithms (hyperLogLog, theta)  Scalable to petabytes of data  Highly available
  • 17. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Suitable Use Cases  Powering Interactive user facing applications  Arbitrary slicing and dicing of large datasets  User behavior analysis  measuring distinct counts  retention analysis  funnel analysis  A/B testing  Exploratory analytics/root cause analysis  Not interested in dumping entire dataset
  • 18. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Druid: Segments  Data in Druid is stored in Segment Files.  Partitioned by time  Ideally, segment files are each smaller than 1GB.  If files are large, smaller time partitions are needed. Time Segment 1: Monday Segment 2: Tuesday Segment 3: Wednesday Segment 4: Thursday Segment 5_2: Friday Segment 5_1: Friday
  • 19. © Hortonworks Inc. 2011 – 2016. All Rights Reserved19 Example Wikipedia Edit Dataset timestamp page language city country … added deleted 2011-01-01T00:01:35Z Justin Bieber en SF USA 10 65 2011-01-01T00:03:63Z Justin Bieber en SF USA 15 62 2011-01-01T00:04:51Z Justin Bieber en SF USA 32 45 2011-01-01T00:05:35Z Ke$ha en Calgary CA 17 87 2011-01-01T00:06:41Z Ke$ha en Calgary CA 43 99 2011-01-02T00:08:35Z Selena Gomes en Calgary CA 12 53 Timestamp Dimensions Metrics
  • 20. © Hortonworks Inc. 2011 – 2016. All Rights Reserved20 Data Rollup timestamp page language city country … added deleted 2011-01-01T00:01:35Z Justin Bieber en SF USA 10 65 2011-01-01T00:03:63Z Justin Bieber en SF USA 15 62 2011-01-01T00:04:51Z Justin Bieber en SF USA 32 45 2011-01-01T00:05:35Z Ke$ha en Calgary CA 17 87 2011-01-01T00:06:41Z Ke$ha en Calgary CA 43 99 2011-01-02T00:08:35Z Selena Gomes en Calgary CA 12 53 timestamp page language city country count sum_added sum_deleted min_added max_added …. 2011-01-01T00:00:00Z Justin Bieber en SF USA 3 57 172 10 32 2011-01-01T00:00:00Z Ke$ha en Calgary CA 2 60 186 17 43 2011-01-02T00:00:00Z Selena Gomes en Calgary CA 1 12 53 12 12 Rollup by hour
  • 21. © Hortonworks Inc. 2011 – 2016. All Rights Reserved21 Dictionary Encoding  Create and store Ids for each value  e.g. page column  Values - Justin Bieber, Ke$ha, Selena Gomes  Encoding - Justin Bieber : 0, Ke$ha: 1, Selena Gomes: 2  Column Data - [0 0 0 1 1 2]  city column - [0 0 0 1 1 1] timestamp page language city country … added deleted 2011-01-01T00:01:35Z Justin Bieber en SF USA 10 65 2011-01-01T00:03:63Z Justin Bieber en SF USA 15 62 2011-01-01T00:04:51Z Justin Bieber en SF USA 32 45 2011-01-01T00:05:35Z Ke$ha en Calgary CA 17 87 2011-01-01T00:06:41Z Ke$ha en Calgary CA 43 99 2011-01-02T00:08:35Z Selena Gomes en Calgary CA 12 53
  • 22. © Hortonworks Inc. 2011 – 2016. All Rights Reserved22 Bitmap Indices  Store Bitmap Indices for each value  Justin Bieber -> [0, 1, 2] -> [1 1 1 0 0 0]  Ke$ha -> [3, 4] -> [0 0 0 1 1 0]  Selena Gomes -> [5] -> [0 0 0 0 0 1]  Queries  Justin Bieber or Ke$ha -> [1 1 1 0 0 0] OR [0 0 0 1 1 0] -> [1 1 1 1 1 0]  language = en and country = CA -> [1 1 1 1 1 1] AND [0 0 0 1 1 1] -> [0 0 0 1 1 1]  Indexes compressed with Concise or Roaring encoding timestamp page language city country … added deleted 2011-01-01T00:01:35Z Justin Bieber en SF USA 10 65 2011-01-01T00:03:63Z Justin Bieber en SF USA 15 62 2011-01-01T00:04:51Z Justin Bieber en SF USA 32 45 2011-01-01T00:01:35Z Ke$ha en Calgary CA 17 87 2011-01-01T00:01:35Z Ke$ha en Calgary CA 43 99 2011-01-01T00:01:35Z Selena Gomes en Calgary CA 12 53
  • 23. © Hortonworks Inc. 2011 – 2016. All Rights Reserved23 Approximate Sketch Columns timestamp page userid language city country … added deleted 2011-01-01T00:01:35Z Justin Bieber user1111111 en SF USA 10 65 2011-01-01T00:03:63Z Justin Bieber user1111111 en SF USA 15 62 2011-01-01T00:04:51Z Justin Bieber user2222222 en SF USA 32 45 2011-01-01T00:05:35Z Ke$ha user3333333 en Calgary CA 17 87 2011-01-01T00:06:41Z Ke$ha user4444444 en Calgary CA 43 99 2011-01-02T00:08:35Z Selena Gomes user1111111 en Calgary CA 12 53 timestamp page language city country count sum_added sum_delete d min_added Userid_sket ch …. 2011-01-01T00:00:00Z Justin Bieber en SF USA 3 57 172 10 {sketch} 2011-01-01T00:00:00Z Ke$ha en Calgary CA 2 60 186 17 {sketch} 2011-01-02T00:00:00Z Selena Gomes en Calgary CA 1 12 53 12 {sketch} Rollup by hour
  • 24. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Approximate Algorithms  Store Sketch objects, instead of raw column values  Better rollup for high cardinality columns e.g userid  Reduced storage size  Use Cases  Fast approximate distinct counts  Approximate histograms  Funnel/retention analysis  Limitation  Not possible to do exact counts  filter on individual row values
  • 25. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Realtime Nodes Historical Nodes 25 Druid Architecture Batch Data Event Historical Nodes Broker Nodes Realtime Index Tasks Streaming Data Historical Nodes Handoff
  • 26. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Performance and Scalability : Fast Facts Most Events per Day 300 Billion Events / Day (Metamarkets) Most Computed Metrics 1 Billion Metrics / Min (Jolata) Largest Cluster 200 Nodes (Metamarkets) Largest Hourly Ingestion 2TB per Hour (Netflix)
  • 27. © Hortonworks Inc. 2011 – 2016. All Rights Reserved27 Companies Using Druid
  • 28. © Hortonworks Inc. 2011 – 2016. All Rights Reserved28 Visualization Layer
  • 29. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Visualization Layer : Requirements  Rich dashboarding capabilities  Work with multiple datasoucres  Security/Access control  Allow for extension  Add custom visualizations Data Store Visualization Layer User Dashboards
  • 30. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Superset  Python backend  Flask app builder  Authentication  Pandas for rich analytics  SqlAlchemy for SQL toolkit  Javascript frontend  React, NVD3  Deep integration with Druid
  • 31. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Superset Rich Dashboarding Capabilities: Treemaps
  • 32. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Superset Rich Dashboarding Capabilities: Sunburst
  • 33. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Superset UI Provides Powerful Visualizations Rich library of dashboard visualizations: Basic: • Bar Charts • Pie Charts • Line Charts Advanced: • Sankey Diagrams • Treemaps • Sunburst • Heatmaps And More!
  • 34. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Wikipedia Real-Time Dashboard Kafka Connect IP-to- Geolocation Processor wikipedia-raw topic wikipedia-raw topic wikipedia-enriched topic wikipedia-enriched topic
  • 35. © Hortonworks Inc. 2011 – 2016. All Rights Reserved Project Websites  Kafka - http://kafka.apache.org  Druid - http://druid.io  Superset - http://superset.incubator.apache.org
  • 36. © Hortonworks Inc. 2011 – 2016. All Rights Reserved36 Thank you ! Questions ?  Twitter - @NishantBangarwa  Email - [email protected]  Linkedin - https://www.linkedin.com/in/nishant-bangarwa Off The Record (OTR) session Experiences and challenges in working with Druid at 03:25 PM - 04:10 PM on 28 July, 2017 in Room 1 MLR Convention Centre, Whitefield

Editor's Notes

  • #7: Druid Architecture
  • #18: Retention analysis
  • #29: Druid Architecture