SlideShare a Scribd company logo
1 © Hortonworks Inc. 2011 – 2017 All Rights Reserved1 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Future of Data Boston
2 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Agenda
 Networking, Food and drink
 Announcements
 Main Presentation
– Unlocking Insights in Streaming Data with Open Source
 Question and Answer
 Networking and Wrap up
3 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Announcements
 Thanks to our sponsors
– Hortonworks
– Pivotal Labs
 What topics would you like to hear about?
4 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
About Carolyn Duby
 Big Data Solutions Architect
 High performance data intensive systems
 Data science
 ScB ScM Computer Science, Brown University
 LinkedIn: https://www.linkedin.com/in/carolynduby/
 Twitter: @carolynduby Github: carolynduby
 Hortonworks
– Innovation through data
– Enterprise ready, 100% open source, modern data platforms
– Engineering, Technical Support, Professional Services, Training
5 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Streaming Analytics
 Streaming data is valuable
– Make decisions in real time
– Gain new understanding of business
– Detect/resolve/predict/warn of conditions
– Recommend at the right moment
6 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Not very easy
 Streams flows in at variable rates
– High and low points
– Data streams arrive with different latency
– Often can’t control input rate
 Lots of different choices of libraries
– Storm, Spark Streaming, Samza, Flink…. Oh my!
 Complex time series analytics
– Windowing
– Joining streams
7 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Components
 Schema registry
– AVRO schemas for streaming data
 Model registry
– Register machine learning models
 Streaming Analytics Manager
– Build and monitor streaming applications
 Superset
– Visualize streaming time series data
 Druid
– Store and aggregate time series data
 Streaming Substrate – Kafka, Storm, HDFS
8 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Reference Architecture: Real-time
Streaming Analytics
10 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Trucking company w/ large fleet of international trucks
A truck generates millions of events for a given route;
an event could be:
 'Normal' events: starting / stopping of the vehicle
 ‘Violation’ events: speeding, excessive acceleration and
breaking, unsafe tail distance
 ‘Speed’ Events: The speed of a driver that comes in every
minute.
Company uses an application that monitors truck
locations and violations from the truck/driver in real-
time
Route?
Truck?
Driver?
Analysts query a broad
history to understand if
today’s violations are
part of a larger problem
with specific routes,
trucks, or drivers
11 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Event Broker Cluster
Sensor Sources
Truck Sensors
Truck Sensors
Truck Sensors
Truck Sensors
Real-time Analytics Architecture with HDF
Flow Management
Clusters
Ingress
Gateway
Nifi
Site to Site
Protocol
Egress
Gateway
Cloud Instance in
Different Geo
Locations
China Cloud Instance
(IBM)
Germany Cloud Instance
(Azure)
US Cloud Instance
(Amazon)
Stream Analytics Cluster
Ingest
Streams
Generate
Insights
Real-Time Apps
Real-time
Apps &
Exploration Platform
Centralized Schema
Repository
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Enterprise Services:
Schema Registry
13 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Problem Statement:
 No centralized store to manage schemas for event data. Schema has to be hardcoded, passed with
data or inferred. Producers and Consumers cannot evolve at different rates
Solution:
 Introducing new component in HDF platform called: Hortonworks Schema Registry
 A shared repository of schemas that allows applications to flexibly interact with each other - in order to
save or retrieve schemas for the data they need to access
Why does this matter?
 Meets governance and operations requirements with a centralized registry to manage event schemas.
 Provides a reusable schema and avoids attaching a schema to every payload, allows consumers and
producers to evolve at different rates
#1: Schema Registry: The Problem & Solution Defined
14 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
How Schema Registry Work with the Rest of the Platform
 NiFi Processors for Schema Registry
– Schema aware flow management (record readers/writers)
– Live schema reference / automated schema conversion
 Streaming Analytics Manager processors for Schema Registry
– For example: Lookup a schema of a Kafka Topic
– Context/schema aware user experience eases time-to-market of building stream apps
 Atlas integration with Schema Registry (In a future release )
– Just like Atlas pulls schema info from Hive MetaStore, Atlas can now capture schema, format and
semantic metadata from events in HDF via the registry.
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Schema Registry Demo
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Stream Processing
17 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Why SAM?
Make it a delightful experience to build streaming analytics applications.
Provide the same experience for streaming analytics that developers
have today with Apache NiFi/MiniFi.
18 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Stream Processing – Introducing Streaming Analytics Manager (SAM)
Streaming Analytics Manager
 Design, develop, deploy and manage streaming analytics app with drag-and-drop ease
– Build streaming analytics applications that do event correlation, context enrichment , complex
pattern matching, analytical aggregations and creation of alerts/notifications when insights are
discovered.
– Supports multiple streaming substrates/engine (e.g: Storm, Spark Streaming, etc.)
– Extensibility is a first class citizen (add custom sinks, processors, spouts, etc..)
19 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
SAM’s 3 Modules for 3 Different Personas in the Enterprise
20 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Stream Ops Module for IT Operations
 Service Pool Abstraction
 Create and manage different environments in which
individual streaming applications will be built
 Environments consists of services such as HDFS, Kafka,
Storm from different service pools
 Save time and reduce operational overhead with same
drag and drop paradigm as the stream build module
 SAM takes away the complexity of deploying secure
streaming analytics on kerberized cluster
21 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Stream Builder Module for App Developers
 Builder components, shown on the canvas
palette, are the building blocks used by the app
developer to build streaming apps.
 Drag and drop to build a working streaming
application without writing a single line of code.
 4 Types of Components: Sources, Processors,
Sinks and Custom
22 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
SAM is All about Doing Real-Time Analytics on the Stream
Real-Time
Prescriptive
Analytics
Real-Time Analytics
Real-Time
Predictive
Analytics
Real-Time
Descriptive
Analytics
What should we do
right now?
What could happen
now/soon?
What is happening
right now?
23 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Real-Time Prescriptive Analytics
 Question: What should we do right
now?
 Context: It is rainy, the driver is
been on the road for 12 hours and
he has 30 high speeding alerts over
a 3 minute window in the last 2
hours.
 Answer: Dispatch a radio call to the
Driver to slow down
24 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Real-Time Predictive Analytics
 Question: No violation events but what might happen that I need to be worried about?
 My data science team has a model that can predict that based on
– Weather
– Roads
– Driver HR info like driver certification status, wagePlan
– Driver timesheet info like hours, and miles logged over the last week
25 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Real-Time Predictive Analytics
Use SAM’s enrich/custom processors to enrich the event
with the features required for the model2
Enrich with Features
Use SAM’s projection/custom processors to
transform/normalize the streaming event and the
features required for the model
3
Transform/Normalize
Use SAM’s PMML processor to score the model for each
stream event with its required features4
Score Model
Use SAM’s rule and notification processors to alert,
notify and take action using the results of the model5
Alert / Notify / Action
Export the Spark Mllib model and import into the HDF’s
Model Registry
1 Model
Registry
26 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Real-Time Prescriptive Analytics for Business Analysts
 A tool to create time-
series and real-time
analytics dashboards,
charts and graphs
 30+ visualization
charts out of the box
with customization
capability
 Druid is the Analytics
Engine that powers
the Stream Insight
Module.
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Druid
28 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
What Is Druid?
Druid is a distributed, real-time, column-oriented datastore
designed to quickly ingest and index large amounts of data
and make it available for real-time query.
Features:
• Streaming Data Ingestion
• Sub-Second Queries
• Merge Historical and Real-Time Data
• Approximate Computation
TECHNICAL PREVIEW: 2.6.2
GA: 2.6.3
29 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Cool stuff you can do with Druid (and pretty much nothing else)
 Real Time Ingest and Query At Scale
– Scale: 100m+ events per second with highly
concurrent queries.
– Stream data from Kafka to Druid and query it as it
arrives.
 Use cases:
– Real-time bidding / market making.
– Realtime analytics on clickstream data.
– IoT monitoring applications.
– Real-time dashboards and KPI tracking.
 Learn how PayPal hypercharged self-service
analytics:
– https://www.slideshare.net/anilmadan902/paypal-
business-intelligence-and-real-time-analytics
30 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Cool stuff you can do with Druid (and pretty much nothing else)
 Datasketches: Fast, multi-dimensional approximate set intersections at high scale.
– Question: Is a CPM of $7.00 a good deal for my iOS app, popular among middle-aged in the US?
– Decision Support: How many iOS users, in the US, age range 30-45 visited in the last week?
 Use cases:
– Targeted advertising / offer management.
– Personalized recommendations.
– Anything aimed at a segment or an individual.
 Learn how Nielsen Marketing Cloud takes Micro Targeting to the next level with Druid:
– https://www.slideshare.net/ItaiYaffe/using-druid-for-interactive-count-distinct-queries-at-scale
Age Range
Country
Mobile
Platforms
Small Intersection:
Not worth it!
31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Streaming Analytics Manager Demo
32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Extensibility:
SAM Software Development Kit
33 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Extensibility with SAM SDK
 Custom Processor - allows users to write their own business logic
34 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Extensibility with SAM SDK
 Multi-lang support (upcoming)
35 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Extensibility with SAM SDK
 UDAFs - compute aggregates within a window Built in functions
 STDDEV
 STDDEVP
 VARIANCE
 VARIANCEP
 MEAN
 MIN
 MAX
 SUM
 COUNT
 UPPER
 LOWER
 INITCAP
 SUBSTRING
 CHAR_LENGTH
 CONCAT
36 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Extensibility with SAM SDK
 UDFs - does simple transformations Built in functions
 STDDEV
 STDDEVP
 VARIANCE
 VARIANCEP
 MEAN
 MIN
 MAX
 SUM
 COUNT
 UPPER
 LOWER
 INITCAP
 SUBSTRING
 CHAR_LENGTH
 CONCAT
37 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Extensibility with SAM SDK
 Notifier - sends notifications such as Email, SMS or more complex ones that can invoke
external APIs
Built in notifiers
 Email
 More in future…
38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Questions?
39 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
Learn More about Streaming Analytics Manager (SAM)
 Download the tutorial – https://hortonworks.com/tutorial
– Real-Time Event Processing in NIFI, SAM, Schema Registry and Superset
 Blogs – https://hortonworks.com/blog
– Hortonworks Thoughts on Building A Successful Streaming Analytics Platform
 Hortonworks Community – https://community.hortonworks.com
 Github - https://github.com/hortonworks/streamline

More Related Content

What's hot (20)

PDF
Apache Hadoop Crash Course
DataWorks Summit
 
PPTX
Risk listening: monitoring for profitable growth
DataWorks Summit
 
PDF
10 Lessons Learned from Meeting with 150 Banks Across the Globe
DataWorks Summit
 
PPTX
The Implacable advance of the data
DataWorks Summit
 
PDF
Deep learning 101
DataWorks Summit
 
PDF
Hadoop Crash Course
DataWorks Summit/Hadoop Summit
 
PPTX
Data Science at Speed. At Scale.
DataWorks Summit
 
PPTX
Automatic Detection, Classification and Authorization of Sensitive Personal D...
DataWorks Summit/Hadoop Summit
 
PDF
HDF 3.1 : An Introduction to New Features
Timothy Spann
 
PDF
3 CTOs Discuss the Shift to Next-Gen Analytic Ecosystems
Hortonworks
 
PDF
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...
Hortonworks
 
PDF
Machine Learning Everywhere
DataWorks Summit
 
PDF
Apache Hadoop Crash Course - HS16SJ
DataWorks Summit/Hadoop Summit
 
PPTX
Trucking demo w Spark ML - Paul Hargis - Hortonworks
Kelly Kohlleffel
 
PDF
Hadoop Summit Tokyo HDP Sandbox Workshop
DataWorks Summit/Hadoop Summit
 
PPTX
Data Science Crash Course
DataWorks Summit
 
PDF
The Car of the Future - Autonomous, Connected, and Data Centric
DataWorks Summit
 
PPTX
Big Data Challenges in the Energy Sector
DataWorks Summit
 
PDF
Big Traffic, Big Trouble: Big Data Security Analytics
DataWorks Summit
 
PPTX
Compute-based sizing and system dashboard
DataWorks Summit
 
Apache Hadoop Crash Course
DataWorks Summit
 
Risk listening: monitoring for profitable growth
DataWorks Summit
 
10 Lessons Learned from Meeting with 150 Banks Across the Globe
DataWorks Summit
 
The Implacable advance of the data
DataWorks Summit
 
Deep learning 101
DataWorks Summit
 
Hadoop Crash Course
DataWorks Summit/Hadoop Summit
 
Data Science at Speed. At Scale.
DataWorks Summit
 
Automatic Detection, Classification and Authorization of Sensitive Personal D...
DataWorks Summit/Hadoop Summit
 
HDF 3.1 : An Introduction to New Features
Timothy Spann
 
3 CTOs Discuss the Shift to Next-Gen Analytic Ecosystems
Hortonworks
 
Optimizing your Modern Data Architecture - with Attunity, RCG Global Services...
Hortonworks
 
Machine Learning Everywhere
DataWorks Summit
 
Apache Hadoop Crash Course - HS16SJ
DataWorks Summit/Hadoop Summit
 
Trucking demo w Spark ML - Paul Hargis - Hortonworks
Kelly Kohlleffel
 
Hadoop Summit Tokyo HDP Sandbox Workshop
DataWorks Summit/Hadoop Summit
 
Data Science Crash Course
DataWorks Summit
 
The Car of the Future - Autonomous, Connected, and Data Centric
DataWorks Summit
 
Big Data Challenges in the Energy Sector
DataWorks Summit
 
Big Traffic, Big Trouble: Big Data Security Analytics
DataWorks Summit
 
Compute-based sizing and system dashboard
DataWorks Summit
 

Similar to Unlocking insights in streaming data (20)

PPTX
SAM - Streaming Analytics Made Easy
DataWorks Summit
 
PPTX
Streaming analytics manager
Sriharsha Chintalapani
 
PPTX
Streamline - Stream Analytics for Everyone
DataWorks Summit/Hadoop Summit
 
POTX
Schema Registry & Stream Analytics Manager
Sriharsha Chintalapani
 
PPTX
SAM—streaming analytics made easy
DataWorks Summit
 
PPTX
Its Finally Here! Building Complex Streaming Analytics Apps in under 10 min w...
DataWorks Summit
 
PPTX
Next gen tooling for building streaming analytics apps: code-less development...
DataWorks Summit
 
PPTX
Next Generation Tooling for building streaming analytics app
gvetticaden
 
PDF
Introduction to Streaming Analytics Manager
Yifeng Jiang
 
PPTX
Make Streaming Analytics work for you: The Devil is in the Details
DataWorks Summit/Hadoop Summit
 
PDF
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks
 
PDF
Curing the Kafka blindness—Streams Messaging Manager
DataWorks Summit
 
PPTX
Hortonworks Data In Motion Webinar Series Pt. 2
Hortonworks
 
PDF
Streaming Analytics and Internet of Things - Geesara Prathap
WithTheBest
 
PDF
Using Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
DataWorks Summit
 
PPTX
Data streaming fundamentals
Mohammed Fazuluddin
 
PDF
Solving Cybersecurity at Scale
DataWorks Summit
 
PPTX
Data Con LA 2018 - Streaming and IoT by Pat Alwell
Data Con LA
 
PPTX
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
Data Con LA
 
PDF
Introduction to Streaming Analytics
Guido Schmutz
 
SAM - Streaming Analytics Made Easy
DataWorks Summit
 
Streaming analytics manager
Sriharsha Chintalapani
 
Streamline - Stream Analytics for Everyone
DataWorks Summit/Hadoop Summit
 
Schema Registry & Stream Analytics Manager
Sriharsha Chintalapani
 
SAM—streaming analytics made easy
DataWorks Summit
 
Its Finally Here! Building Complex Streaming Analytics Apps in under 10 min w...
DataWorks Summit
 
Next gen tooling for building streaming analytics apps: code-less development...
DataWorks Summit
 
Next Generation Tooling for building streaming analytics app
gvetticaden
 
Introduction to Streaming Analytics Manager
Yifeng Jiang
 
Make Streaming Analytics work for you: The Devil is in the Details
DataWorks Summit/Hadoop Summit
 
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks
 
Curing the Kafka blindness—Streams Messaging Manager
DataWorks Summit
 
Hortonworks Data In Motion Webinar Series Pt. 2
Hortonworks
 
Streaming Analytics and Internet of Things - Geesara Prathap
WithTheBest
 
Using Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
DataWorks Summit
 
Data streaming fundamentals
Mohammed Fazuluddin
 
Solving Cybersecurity at Scale
DataWorks Summit
 
Data Con LA 2018 - Streaming and IoT by Pat Alwell
Data Con LA
 
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
Data Con LA
 
Introduction to Streaming Analytics
Guido Schmutz
 
Ad

Recently uploaded (20)

PPTX
Top Managed Service Providers in Los Angeles
Captain IT
 
PDF
Empowering Cloud Providers with Apache CloudStack and Stackbill
ShapeBlue
 
PPTX
Building a Production-Ready Barts Health Secure Data Environment Tooling, Acc...
Barts Health
 
PDF
Integrating IIoT with SCADA in Oil & Gas A Technical Perspective.pdf
Rejig Digital
 
PDF
NewMind AI Weekly Chronicles – July’25, Week III
NewMind AI
 
PDF
Market Insight : ETH Dominance Returns
CIFDAQ
 
PDF
UiPath vs Other Automation Tools Meeting Presentation.pdf
Tracy Dixon
 
PPTX
Lecture 5 - Agentic AI and model context protocol.pptx
Dr. LAM Yat-fai (林日辉)
 
PDF
Arcee AI - building and working with small language models (06/25)
Julien SIMON
 
PDF
Ampere Offers Energy-Efficient Future For AI And Cloud
ShapeBlue
 
PPTX
python advanced data structure dictionary with examples python advanced data ...
sprasanna11
 
PDF
Productivity Management Software | Workstatus
Lovely Baghel
 
PDF
CloudStack GPU Integration - Rohit Yadav
ShapeBlue
 
PDF
Building Resilience with Digital Twins : Lessons from Korea
SANGHEE SHIN
 
PPTX
AI Code Generation Risks (Ramkumar Dilli, CIO, Myridius)
Priyanka Aash
 
PDF
UiPath on Tour London Community Booth Deck
UiPathCommunity
 
PPTX
AVL ( audio, visuals or led ), technology.
Rajeshwri Panchal
 
PPTX
Simplifying End-to-End Apache CloudStack Deployment with a Web-Based Automati...
ShapeBlue
 
PPTX
Building and Operating a Private Cloud with CloudStack and LINBIT CloudStack ...
ShapeBlue
 
PDF
Novus Safe Lite- What is Novus Safe Lite.pdf
Novus Hi-Tech
 
Top Managed Service Providers in Los Angeles
Captain IT
 
Empowering Cloud Providers with Apache CloudStack and Stackbill
ShapeBlue
 
Building a Production-Ready Barts Health Secure Data Environment Tooling, Acc...
Barts Health
 
Integrating IIoT with SCADA in Oil & Gas A Technical Perspective.pdf
Rejig Digital
 
NewMind AI Weekly Chronicles – July’25, Week III
NewMind AI
 
Market Insight : ETH Dominance Returns
CIFDAQ
 
UiPath vs Other Automation Tools Meeting Presentation.pdf
Tracy Dixon
 
Lecture 5 - Agentic AI and model context protocol.pptx
Dr. LAM Yat-fai (林日辉)
 
Arcee AI - building and working with small language models (06/25)
Julien SIMON
 
Ampere Offers Energy-Efficient Future For AI And Cloud
ShapeBlue
 
python advanced data structure dictionary with examples python advanced data ...
sprasanna11
 
Productivity Management Software | Workstatus
Lovely Baghel
 
CloudStack GPU Integration - Rohit Yadav
ShapeBlue
 
Building Resilience with Digital Twins : Lessons from Korea
SANGHEE SHIN
 
AI Code Generation Risks (Ramkumar Dilli, CIO, Myridius)
Priyanka Aash
 
UiPath on Tour London Community Booth Deck
UiPathCommunity
 
AVL ( audio, visuals or led ), technology.
Rajeshwri Panchal
 
Simplifying End-to-End Apache CloudStack Deployment with a Web-Based Automati...
ShapeBlue
 
Building and Operating a Private Cloud with CloudStack and LINBIT CloudStack ...
ShapeBlue
 
Novus Safe Lite- What is Novus Safe Lite.pdf
Novus Hi-Tech
 
Ad

Unlocking insights in streaming data

  • 1. 1 © Hortonworks Inc. 2011 – 2017 All Rights Reserved1 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Future of Data Boston
  • 2. 2 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Agenda  Networking, Food and drink  Announcements  Main Presentation – Unlocking Insights in Streaming Data with Open Source  Question and Answer  Networking and Wrap up
  • 3. 3 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Announcements  Thanks to our sponsors – Hortonworks – Pivotal Labs  What topics would you like to hear about?
  • 4. 4 © Hortonworks Inc. 2011 – 2017 All Rights Reserved About Carolyn Duby  Big Data Solutions Architect  High performance data intensive systems  Data science  ScB ScM Computer Science, Brown University  LinkedIn: https://www.linkedin.com/in/carolynduby/  Twitter: @carolynduby Github: carolynduby  Hortonworks – Innovation through data – Enterprise ready, 100% open source, modern data platforms – Engineering, Technical Support, Professional Services, Training
  • 5. 5 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Streaming Analytics  Streaming data is valuable – Make decisions in real time – Gain new understanding of business – Detect/resolve/predict/warn of conditions – Recommend at the right moment
  • 6. 6 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Not very easy  Streams flows in at variable rates – High and low points – Data streams arrive with different latency – Often can’t control input rate  Lots of different choices of libraries – Storm, Spark Streaming, Samza, Flink…. Oh my!  Complex time series analytics – Windowing – Joining streams
  • 7. 7 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Components  Schema registry – AVRO schemas for streaming data  Model registry – Register machine learning models  Streaming Analytics Manager – Build and monitor streaming applications  Superset – Visualize streaming time series data  Druid – Store and aggregate time series data  Streaming Substrate – Kafka, Storm, HDFS
  • 8. 8 © Hortonworks Inc. 2011 – 2017 All Rights Reserved
  • 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Reference Architecture: Real-time Streaming Analytics
  • 10. 10 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Trucking company w/ large fleet of international trucks A truck generates millions of events for a given route; an event could be:  'Normal' events: starting / stopping of the vehicle  ‘Violation’ events: speeding, excessive acceleration and breaking, unsafe tail distance  ‘Speed’ Events: The speed of a driver that comes in every minute. Company uses an application that monitors truck locations and violations from the truck/driver in real- time Route? Truck? Driver? Analysts query a broad history to understand if today’s violations are part of a larger problem with specific routes, trucks, or drivers
  • 11. 11 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Event Broker Cluster Sensor Sources Truck Sensors Truck Sensors Truck Sensors Truck Sensors Real-time Analytics Architecture with HDF Flow Management Clusters Ingress Gateway Nifi Site to Site Protocol Egress Gateway Cloud Instance in Different Geo Locations China Cloud Instance (IBM) Germany Cloud Instance (Azure) US Cloud Instance (Amazon) Stream Analytics Cluster Ingest Streams Generate Insights Real-Time Apps Real-time Apps & Exploration Platform Centralized Schema Repository
  • 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Enterprise Services: Schema Registry
  • 13. 13 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Problem Statement:  No centralized store to manage schemas for event data. Schema has to be hardcoded, passed with data or inferred. Producers and Consumers cannot evolve at different rates Solution:  Introducing new component in HDF platform called: Hortonworks Schema Registry  A shared repository of schemas that allows applications to flexibly interact with each other - in order to save or retrieve schemas for the data they need to access Why does this matter?  Meets governance and operations requirements with a centralized registry to manage event schemas.  Provides a reusable schema and avoids attaching a schema to every payload, allows consumers and producers to evolve at different rates #1: Schema Registry: The Problem & Solution Defined
  • 14. 14 © Hortonworks Inc. 2011 – 2017 All Rights Reserved How Schema Registry Work with the Rest of the Platform  NiFi Processors for Schema Registry – Schema aware flow management (record readers/writers) – Live schema reference / automated schema conversion  Streaming Analytics Manager processors for Schema Registry – For example: Lookup a schema of a Kafka Topic – Context/schema aware user experience eases time-to-market of building stream apps  Atlas integration with Schema Registry (In a future release ) – Just like Atlas pulls schema info from Hive MetaStore, Atlas can now capture schema, format and semantic metadata from events in HDF via the registry.
  • 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Schema Registry Demo
  • 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Stream Processing
  • 17. 17 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Why SAM? Make it a delightful experience to build streaming analytics applications. Provide the same experience for streaming analytics that developers have today with Apache NiFi/MiniFi.
  • 18. 18 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Stream Processing – Introducing Streaming Analytics Manager (SAM) Streaming Analytics Manager  Design, develop, deploy and manage streaming analytics app with drag-and-drop ease – Build streaming analytics applications that do event correlation, context enrichment , complex pattern matching, analytical aggregations and creation of alerts/notifications when insights are discovered. – Supports multiple streaming substrates/engine (e.g: Storm, Spark Streaming, etc.) – Extensibility is a first class citizen (add custom sinks, processors, spouts, etc..)
  • 19. 19 © Hortonworks Inc. 2011 – 2017 All Rights Reserved SAM’s 3 Modules for 3 Different Personas in the Enterprise
  • 20. 20 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Stream Ops Module for IT Operations  Service Pool Abstraction  Create and manage different environments in which individual streaming applications will be built  Environments consists of services such as HDFS, Kafka, Storm from different service pools  Save time and reduce operational overhead with same drag and drop paradigm as the stream build module  SAM takes away the complexity of deploying secure streaming analytics on kerberized cluster
  • 21. 21 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Stream Builder Module for App Developers  Builder components, shown on the canvas palette, are the building blocks used by the app developer to build streaming apps.  Drag and drop to build a working streaming application without writing a single line of code.  4 Types of Components: Sources, Processors, Sinks and Custom
  • 22. 22 © Hortonworks Inc. 2011 – 2017 All Rights Reserved SAM is All about Doing Real-Time Analytics on the Stream Real-Time Prescriptive Analytics Real-Time Analytics Real-Time Predictive Analytics Real-Time Descriptive Analytics What should we do right now? What could happen now/soon? What is happening right now?
  • 23. 23 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Real-Time Prescriptive Analytics  Question: What should we do right now?  Context: It is rainy, the driver is been on the road for 12 hours and he has 30 high speeding alerts over a 3 minute window in the last 2 hours.  Answer: Dispatch a radio call to the Driver to slow down
  • 24. 24 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Real-Time Predictive Analytics  Question: No violation events but what might happen that I need to be worried about?  My data science team has a model that can predict that based on – Weather – Roads – Driver HR info like driver certification status, wagePlan – Driver timesheet info like hours, and miles logged over the last week
  • 25. 25 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Real-Time Predictive Analytics Use SAM’s enrich/custom processors to enrich the event with the features required for the model2 Enrich with Features Use SAM’s projection/custom processors to transform/normalize the streaming event and the features required for the model 3 Transform/Normalize Use SAM’s PMML processor to score the model for each stream event with its required features4 Score Model Use SAM’s rule and notification processors to alert, notify and take action using the results of the model5 Alert / Notify / Action Export the Spark Mllib model and import into the HDF’s Model Registry 1 Model Registry
  • 26. 26 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Real-Time Prescriptive Analytics for Business Analysts  A tool to create time- series and real-time analytics dashboards, charts and graphs  30+ visualization charts out of the box with customization capability  Druid is the Analytics Engine that powers the Stream Insight Module.
  • 27. 27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Druid
  • 28. 28 © Hortonworks Inc. 2011 – 2017 All Rights Reserved What Is Druid? Druid is a distributed, real-time, column-oriented datastore designed to quickly ingest and index large amounts of data and make it available for real-time query. Features: • Streaming Data Ingestion • Sub-Second Queries • Merge Historical and Real-Time Data • Approximate Computation TECHNICAL PREVIEW: 2.6.2 GA: 2.6.3
  • 29. 29 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Cool stuff you can do with Druid (and pretty much nothing else)  Real Time Ingest and Query At Scale – Scale: 100m+ events per second with highly concurrent queries. – Stream data from Kafka to Druid and query it as it arrives.  Use cases: – Real-time bidding / market making. – Realtime analytics on clickstream data. – IoT monitoring applications. – Real-time dashboards and KPI tracking.  Learn how PayPal hypercharged self-service analytics: – https://www.slideshare.net/anilmadan902/paypal- business-intelligence-and-real-time-analytics
  • 30. 30 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Cool stuff you can do with Druid (and pretty much nothing else)  Datasketches: Fast, multi-dimensional approximate set intersections at high scale. – Question: Is a CPM of $7.00 a good deal for my iOS app, popular among middle-aged in the US? – Decision Support: How many iOS users, in the US, age range 30-45 visited in the last week?  Use cases: – Targeted advertising / offer management. – Personalized recommendations. – Anything aimed at a segment or an individual.  Learn how Nielsen Marketing Cloud takes Micro Targeting to the next level with Druid: – https://www.slideshare.net/ItaiYaffe/using-druid-for-interactive-count-distinct-queries-at-scale Age Range Country Mobile Platforms Small Intersection: Not worth it!
  • 31. 31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Streaming Analytics Manager Demo
  • 32. 32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Extensibility: SAM Software Development Kit
  • 33. 33 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Extensibility with SAM SDK  Custom Processor - allows users to write their own business logic
  • 34. 34 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Extensibility with SAM SDK  Multi-lang support (upcoming)
  • 35. 35 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Extensibility with SAM SDK  UDAFs - compute aggregates within a window Built in functions  STDDEV  STDDEVP  VARIANCE  VARIANCEP  MEAN  MIN  MAX  SUM  COUNT  UPPER  LOWER  INITCAP  SUBSTRING  CHAR_LENGTH  CONCAT
  • 36. 36 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Extensibility with SAM SDK  UDFs - does simple transformations Built in functions  STDDEV  STDDEVP  VARIANCE  VARIANCEP  MEAN  MIN  MAX  SUM  COUNT  UPPER  LOWER  INITCAP  SUBSTRING  CHAR_LENGTH  CONCAT
  • 37. 37 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Extensibility with SAM SDK  Notifier - sends notifications such as Email, SMS or more complex ones that can invoke external APIs Built in notifiers  Email  More in future…
  • 38. 38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Questions?
  • 39. 39 © Hortonworks Inc. 2011 – 2017 All Rights Reserved Learn More about Streaming Analytics Manager (SAM)  Download the tutorial – https://hortonworks.com/tutorial – Real-Time Event Processing in NIFI, SAM, Schema Registry and Superset  Blogs – https://hortonworks.com/blog – Hortonworks Thoughts on Building A Successful Streaming Analytics Platform  Hortonworks Community – https://community.hortonworks.com  Github - https://github.com/hortonworks/streamline

Editor's Notes

  • #2: TALK TRACK Hello, my name is [NAME] and I want to thank you for taking time to speak with me today. Hortonworks Powers the Future of Data: data-in-motion, data-at-rest, and Modern Data Applications. Today, I’ll tell you how we do that and how you can transform your business by managing your data with Hortonworks Connected Data platforms. [NEXT SLIDE]