SlideShare a Scribd company logo
1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache NiFi and Stream
Processing
Dhruv Kumar
Sr. Solutions Architect
Page2 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Simplistic View of Enterprise Data Flow
Store Data
Process and
Analyze Data
Acquire Data
Dataflow
Page3 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Realistic View of Enterprise Data Flow
?
?
?
?
?
?
?
Page4 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Basics of Connecting Systems
For every connection,
these must agree:
1. Protocol
2. Format
3. Schema
4. Priority
5. Size of event
6. Frequency of event
7. Authorization access
8. Relevance
P1
Producer
C1
Consumer
Page5 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Apache NiFi: The three key concepts
• Manage the flow of information
• Data Provenance
• Secure the control plane and
data plane
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Visual Command & Control
• Drag and drop processors to build a flow
• Start, stop, and configure components in real time
• View errors and corresponding error messages
• View statistics and health of data flow
• Create templates of common processor & connections
Page7 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Apache NiFi – Key Features
• Guaranteed delivery
• Data buffering
- Backpressure
- Pressure release
• Prioritized queuing
• Flow specific QoS
- Latency vs. throughput
- Loss tolerance
• Data provenance
• Recovery/recording
a rolling log of fine-
grained history
• Visual command and
control
• Flow templates
• Pluggable/multi-role
security
• Designed for extension
• Clustering
Page8 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Matured at NSA 2006-2014
Brief history of the Apache NiFi Community
Code developed
at NSA
2006
Today
Achieved TLP
status in just
7 months
July 2015
Dev mailing list
Users mailing list*
182 subscribers producing ~100 emails/week
165 subscribers producing ~40 emails/week
55
125
1170
Code contributors
Pull requests via Github
JIRAs Filed.
Code available
open source
ASL v2
December 2014
*Only 5 months old
In 11 months…
6Targeting a 6-8
week release cycle
Releases 153 new in last two months
With more in pipeline
Committers 13 PMC Members Affiliations
Hortonworks, Twitter, Cloudera, US
Government, Defense Contractors, etc.
Page9 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Flow Based Programming (FBP)
FBP Term NiFi Term Description
Information
Packet
FlowFile Each object moving through the system.
Black Box FlowFile
Processor
Performs the work, doing some combination of data routing,
transformation, or mediation between systems.
Bounded
Buffer
Connection The linkage between processors, acting as queues and allowing
various processes to interact at differing rates.
Scheduler Flow
Controller
Maintains the knowledge of how processes are connected, and
manages the threads and allocations thereof which all processes use.
Subnet Process
Group
A set of processes and their connections, which can receive and send
data via ports. A process group allows creation of entirely new
component simply by composition of its components.
Page10 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
OS/Host
JVM
Flow Controller
Web Server
Processor 1 Extension N
FlowFile
Repository
Content
Repository
Provenance
Repository
Local Storage
OS/Host
JVM
Flow Controller
Web Server
Processor 1 Extension N
FlowFile
Repository
Content
Repository
Provenance
Repository
Local Storage
Architecture
OS/Host
JVM
NiFi Cluster Manger – Request Replicator
Web Server
Master
NiFi Cluster
Manager (NCM)
OS/Host
JVM
Flow Controller
Web Server
Processor 1 Extension N
FlowFile
Repository
Content
Repository
Provenance
Repository
Local Storage
Slaves
NiFi Nodes
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache NiFi’s uses are many…
What is Apache NiFi used for?
• Reliable and secure transfer of data between systems
• Delivery of data from sources to analytic platforms
• Enrichment and preparation of data:
– Conversion between formats
– Extraction/Parsing
– Routing decisions
What is Apache NiFi NOT used for?
• Distributed Computation
• Complex Event Processing
• Joins / Complex Rolling Window Operations
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
HDF Powered by Apache NiFi Addresses Modern Data Flow Challenges
Aggregate all IoAT data from sensors, geo-location devices, machines, logs,
files, and feeds via a highly secure lightweight agent
Collect: Bring Together• Logs
• Files
• Feeds
• Sensors
Mediate point-to-point and bi-directional data flows, delivering data
reliably to real-time applications and storage platforms such as HDP
Conduct: Mediate the Data Flow• Deliver
• Secure
• Govern
• Audit
Parse, filter, join, transform, fork, and clone data in motion to
empower analytics and perishable insights
Curate: Gain Insights• Parse
• Filter
• Transform
• Fork
• Clone
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
HDP + HDF Create Modern Data Apps
DATA AT
REST
HDF DATA
IN MOTION
ACTIONABLE
INTELLIGENCE
MODERN DATA APPS
Real-Time Cyber Security
protects systems with superior threat detection
Smart Manufacturing
dramatically improves yields by managing more
variables in greater detail
Connected, Autonomous Cars
drive themselves and improve road safety
Future Farming
optimizing soil, seeds and equipment to measured
conditions on each square foot
Automatic Recommendation Engines
match products to preferences in milliseconds
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Streaming Architectures
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Drive Data to Core for Analysis
NiFi
Stream
Processing
MiNiFi
MiNiFi
• Drive data from sources to central data center for analysis
• Tiered collection approach at various locations, think regional data centers
Edge
Edge
Core
Batch
Analytics
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Dynamically Adjusting Data Flows
• Push contents back to core NiFi
• Push results back to edge locations/devices to change behavior
NiFi
MiNiFi
MiNiFi
Edge
Edge
Core
Batch
Analytics
Stream
Processing
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Retail Store
Gateway
Server
MiNiFi
Mobile
Client
Libraries
Freezer
Client
Libraries
Server Cluster
NiFi
Register
MiNiFi
Regional Center
NiFi NiFi
Kafka
Core Data Center
Server Cluster
NiFi NiFi NiFi
Others
Storm
Kafka
Spark/Flink/etc.
AWS
Azure
Google Cloud
Hortonworks DataFlow Reference Architecture
DB
Data WH
 Tiered processing framework
 Bi-directional communication
 Data prioritization
 Interactive command & control in the center, design & deploy on the edge
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Retail Store
Gateway
Server
MiNiFi
Mobile
Client
Libraries
Freezer
Client
Libraries
Server Cluster
NiFi
Register
MiNiFi
Regional Center
NiFi NiFi
Kafka
Storm
Hortonworks DataFlow Reference Architecture
 Campaign management: coupons/promotions/etc.
 Location based services
Core Data Center
Server Cluster
NiFi NiFi NiFi
Others
Kafka
Spark/Flink/etc.
AWS
Azure
Google Cloud
DB
Data WH
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Retail Store
Gateway
Server
MiNiFi
Mobile
Client
Libraries
Freezer
Client
Libraries
Server Cluster
NiFi
Register
MiNiFi
Regional Center
NiFi NiFi
Kafka
Storm
Hortonworks DataFlow Reference Architecture
 Transaction processing
 Fraud detection
Core Data Center
Server Cluster
NiFi NiFi NiFi
Others
Kafka
Spark/Flink/etc.
AWS
Azure
Google Cloud
DB
Data WH
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Retail Store
Gateway
Server
MiNiFi
Mobile
Client
Libraries
Freezer
Client
Libraries
Server Cluster
NiFi
Register
MiNiFi
Regional Center
NiFi NiFi
Kafka
Storm
Hortonworks DataFlow Reference Architecture
 Complex processing and cloud computing
 Historical data analytics based on nightly updates
Core Data Center
Server Cluster
NiFi NiFi NiFi
Others
Kafka
Spark/Flink/etc.
AWS
Azure
Google Cloud
DB
Data WH
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache NiFi vs Kafka
NiFi
Good for data traceability
and flow management
• Interactive command and control – real time
operational visibility
• Data provenance – real time visual chain of
custody
• Low scripting maintenance
⚠ Requires adding/removing processors
according to consumer-side updates
Kafka
Good for large number of consumers
and dynamic consumer-side updates
• Low latency
• Great data durability
• Support large number of
producers/consumers
⚠ Not optimized to manage dataflows
(prioritization, enrichment, protocols, formats,
event level authorizations, objects with various
sizes, etc.)
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache NiFi vs Storm
NiFi
Good for data traceability, flow
management, and enrichment
• Data provenance – real time visual chain
of custody
• Security – end-to-end secure routing with
event level authorization
• Simple event processing
⚠ Scaling model allowing for processor level
workload to be only evenly distributed
across worker nodes
Storm
Good for streaming analytics
• Complex event processing
• Flexible scaling model, allowing to specify
workload distribution on-demand at bolt level
⚠ Not designed to manage data flows
In a nutshell…
NiFi
Hadoop
HDFS
HBase Hive SOLR
YARN
Storm
Service
Management /
Workflow
SIEM
Spark
Raw Network Stream
Network Metadata Stream
Data Stores
Syslog
Raw Application Logs
Other Streaming Telemetry
Key Tenants of Lambda Architecture
 Batch Layer
 Manages master data
 Immutable, append-only set of raw data
 Cleanse, Normalize & Pre-Compute
Batch Views
 Advanced Statistical Calculations
 Speed layer
 Real Time Event Stream Processing
 Computes Real-Time Views
 Serving Layer
 Low-latency, ad-hoc query
 Reporting, BI & Dashboard
New Data
Stream
Store Pre-Compute Views
Process
Streams
Incremental
Views
Business
View
Business
View
Query
SPEED LAYER
BATCH LAYER
SERVING LAYER
HDP and HDF
Fundamental Principles of Streaming Architectures
Page25 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Storm/Spark Streaming
Storm
Detailed Reference Architecture for IoT Applications
HDF
Flume
Sink to
HDFS
Transform
Interactive
UI Framework
Hive
Hive
HDFS
HDFS
SOURCE DATA
Server logs
Application Logs
Firewall Logs
CRM/ERP
Sensor
Kafka
Kafka
Stream to
HDF
Forward to
Storm
Real Time Storage
Spark-ML
Pig
Alerts
Bolt to
HDFS
Dashboard
Silk
JMS
Alerts
Hive Server
HiveServer
Reporting
BI Tools
High Speed
Ingest
Real-Time
Batch Interactive
Machine Learning
Models
Spark
Pig
AlertsSQOOP
Flume
Iterative ML
Hbase/Pheonix
HBaseEvent Enrichment
Spark-Thrift
Pig
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Demo!

More Related Content

What's hot (18)

PDF
Data ingestion and distribution with apache NiFi
Lev Brailovskiy
 
PPTX
Apache NiFi in the Hadoop Ecosystem
Bryan Bende
 
PPTX
Integrating NiFi and Apex
Bryan Bende
 
PPTX
Apache NiFi Crash Course - San Jose Hadoop Summit
Aldrin Piri
 
PDF
Apache NiFi: latest developments for flow management at scale
Abdelkrim Hadjidj
 
PDF
Running Apache NiFi with Apache Spark : Integration Options
Timothy Spann
 
PPTX
Harnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFI
Haimo Liu
 
PDF
Introduction to data flow management using apache nifi
Anshuman Ghosh
 
PDF
Apache NiFi: Ingesting Enterprise Data At Scale
Timothy Spann
 
PPTX
The Avant-garde of Apache NiFi
DataWorks Summit/Hadoop Summit
 
PDF
Apache Nifi Crash Course
DataWorks Summit
 
PPTX
Integrating NiFi and Flink
Bryan Bende
 
PPTX
Apache NiFi Crash Course Intro
DataWorks Summit/Hadoop Summit
 
PDF
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
 
PDF
Apache NiFi Meetup - Introduction to NiFi Registry
Bryan Bende
 
PDF
Dataflow Management From Edge to Core with Apache NiFi
DataWorks Summit
 
PPTX
Data at Scales and the Values of Starting Small with Apache NiFi & MiNiFi
Aldrin Piri
 
PDF
Dataflow with Apache NiFi - Crash Course - HS16SJ
DataWorks Summit/Hadoop Summit
 
Data ingestion and distribution with apache NiFi
Lev Brailovskiy
 
Apache NiFi in the Hadoop Ecosystem
Bryan Bende
 
Integrating NiFi and Apex
Bryan Bende
 
Apache NiFi Crash Course - San Jose Hadoop Summit
Aldrin Piri
 
Apache NiFi: latest developments for flow management at scale
Abdelkrim Hadjidj
 
Running Apache NiFi with Apache Spark : Integration Options
Timothy Spann
 
Harnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFI
Haimo Liu
 
Introduction to data flow management using apache nifi
Anshuman Ghosh
 
Apache NiFi: Ingesting Enterprise Data At Scale
Timothy Spann
 
The Avant-garde of Apache NiFi
DataWorks Summit/Hadoop Summit
 
Apache Nifi Crash Course
DataWorks Summit
 
Integrating NiFi and Flink
Bryan Bende
 
Apache NiFi Crash Course Intro
DataWorks Summit/Hadoop Summit
 
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
 
Apache NiFi Meetup - Introduction to NiFi Registry
Bryan Bende
 
Dataflow Management From Edge to Core with Apache NiFi
DataWorks Summit
 
Data at Scales and the Values of Starting Small with Apache NiFi & MiNiFi
Aldrin Piri
 
Dataflow with Apache NiFi - Crash Course - HS16SJ
DataWorks Summit/Hadoop Summit
 

Viewers also liked (20)

PPTX
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Developmen...
Data Con LA
 
PPTX
Big Data Day LA 2016 Keynote - Andy Feng/ Yahoo
Data Con LA
 
PDF
Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave Club
Data Con LA
 
PDF
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
Data Con LA
 
PDF
Big Data Day LA 2016/ Data Science Track - Backstage to a Data Driven Culture...
Data Con LA
 
PPTX
Big Data Day LA 2016 Keynote - Tom Horan/ Claremont Graduate University
Data Con LA
 
PDF
Apache Mesos and the new Open Source Architecture of the Modern Datacenter
Data Con LA
 
PDF
Big Data Day LA 2016/ NoSQL track - Introduction to Graph Databases, Oren Gol...
Data Con LA
 
PDF
Big Data Day LA 2016/ Use Case Driven track - The Encyclopedia of World Probl...
Data Con LA
 
PPTX
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
Data Con LA
 
PDF
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
Data Con LA
 
PDF
Big Data Day LA 2016/ Data Science Track - Enabling Cross-Screen Advertising ...
Data Con LA
 
PDF
Big Data Day LA 2016/ NoSQL track - Architecting Real Life IoT Architecture, ...
Data Con LA
 
PDF
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
Data Con LA
 
PPTX
Big Data Day LA 2016/ Use Case Driven track - From Clusters to Clouds, Hardwa...
Data Con LA
 
PPTX
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Data Con LA
 
PPTX
NJ Hadoop Meetup - Apache NiFi Deep Dive
Bryan Bende
 
PPTX
Big Data Day LA 2016/ Data Science Track - Intuit's Payments Risk Platform, D...
Data Con LA
 
PPTX
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...
Data Con LA
 
PPTX
Big Data Day LA 2016/ NoSQL track - MongoDB 3.2 Goodness!!!, Mark Helmstetter...
Data Con LA
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Developmen...
Data Con LA
 
Big Data Day LA 2016 Keynote - Andy Feng/ Yahoo
Data Con LA
 
Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave Club
Data Con LA
 
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
Data Con LA
 
Big Data Day LA 2016/ Data Science Track - Backstage to a Data Driven Culture...
Data Con LA
 
Big Data Day LA 2016 Keynote - Tom Horan/ Claremont Graduate University
Data Con LA
 
Apache Mesos and the new Open Source Architecture of the Modern Datacenter
Data Con LA
 
Big Data Day LA 2016/ NoSQL track - Introduction to Graph Databases, Oren Gol...
Data Con LA
 
Big Data Day LA 2016/ Use Case Driven track - The Encyclopedia of World Probl...
Data Con LA
 
Big Data Day LA 2016/ Big Data Track - Apply R in Enterprise Applications, Lo...
Data Con LA
 
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
Data Con LA
 
Big Data Day LA 2016/ Data Science Track - Enabling Cross-Screen Advertising ...
Data Con LA
 
Big Data Day LA 2016/ NoSQL track - Architecting Real Life IoT Architecture, ...
Data Con LA
 
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
Data Con LA
 
Big Data Day LA 2016/ Use Case Driven track - From Clusters to Clouds, Hardwa...
Data Con LA
 
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Data Con LA
 
NJ Hadoop Meetup - Apache NiFi Deep Dive
Bryan Bende
 
Big Data Day LA 2016/ Data Science Track - Intuit's Payments Risk Platform, D...
Data Con LA
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Why is my Hadoop cluster s...
Data Con LA
 
Big Data Day LA 2016/ NoSQL track - MongoDB 3.2 Goodness!!!, Mark Helmstetter...
Data Con LA
 
Ad

Similar to Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flows and IoT apps using Apache NiFi - Dhruv Kumar, Senior Solutions Architect - Hortonworks (20)

PPTX
State of the Apache NiFi Ecosystem & Community
Accumulo Summit
 
PPTX
Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks
 
PPTX
Data Con LA 2018 - Streaming and IoT by Pat Alwell
Data Con LA
 
PPTX
Apache NiFi in the Hadoop Ecosystem
DataWorks Summit/Hadoop Summit
 
PDF
HDF: Hortonworks DataFlow: Technical Workshop
Hortonworks
 
PPTX
Connecting the Drops with Apache NiFi & Apache MiNiFi
DataWorks Summit
 
PDF
Apache Nifi Crash Course
DataWorks Summit
 
PDF
Devnexus 2018 - Let Your Data Flow with Apache NiFi
Bryan Bende
 
PPTX
Integrating Apache NiFi and Apache Flink
Isheeta Sanghi
 
PPTX
Integrating Apache NiFi and Apache Flink
Isheeta Sanghi
 
PPTX
Integrating Apache NiFi and Apache Flink
Hortonworks
 
PPTX
Integrating Apache NiFi and Apache Flink
Isheeta Sanghi
 
PPTX
Introduction to Apache NiFi - Seattle Scalability Meetup
Saptak Sen
 
PPTX
Hadoop Summit Tokyo Apache NiFi Crash Course
DataWorks Summit/Hadoop Summit
 
PDF
Apache NiFi - Flow Based Programming Meetup
Joseph Witt
 
PDF
Enterprise IIoT Edge Processing with Apache NiFi
Timothy Spann
 
PDF
Dataflow Management From Edge to Core with Apache NiFi
DataWorks Summit
 
PPTX
Future of Data New Jersey - HDF 3.0 Deep Dive
Aldrin Piri
 
PPTX
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
Aldrin Piri
 
PDF
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
DataWorks Summit
 
State of the Apache NiFi Ecosystem & Community
Accumulo Summit
 
Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks
 
Data Con LA 2018 - Streaming and IoT by Pat Alwell
Data Con LA
 
Apache NiFi in the Hadoop Ecosystem
DataWorks Summit/Hadoop Summit
 
HDF: Hortonworks DataFlow: Technical Workshop
Hortonworks
 
Connecting the Drops with Apache NiFi & Apache MiNiFi
DataWorks Summit
 
Apache Nifi Crash Course
DataWorks Summit
 
Devnexus 2018 - Let Your Data Flow with Apache NiFi
Bryan Bende
 
Integrating Apache NiFi and Apache Flink
Isheeta Sanghi
 
Integrating Apache NiFi and Apache Flink
Isheeta Sanghi
 
Integrating Apache NiFi and Apache Flink
Hortonworks
 
Integrating Apache NiFi and Apache Flink
Isheeta Sanghi
 
Introduction to Apache NiFi - Seattle Scalability Meetup
Saptak Sen
 
Hadoop Summit Tokyo Apache NiFi Crash Course
DataWorks Summit/Hadoop Summit
 
Apache NiFi - Flow Based Programming Meetup
Joseph Witt
 
Enterprise IIoT Edge Processing with Apache NiFi
Timothy Spann
 
Dataflow Management From Edge to Core with Apache NiFi
DataWorks Summit
 
Future of Data New Jersey - HDF 3.0 Deep Dive
Aldrin Piri
 
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
Aldrin Piri
 
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
DataWorks Summit
 
Ad

More from Data Con LA (20)

PPTX
Data Con LA 2022 Keynotes
Data Con LA
 
PPTX
Data Con LA 2022 Keynotes
Data Con LA
 
PDF
Data Con LA 2022 Keynote
Data Con LA
 
PPTX
Data Con LA 2022 - Startup Showcase
Data Con LA
 
PPTX
Data Con LA 2022 Keynote
Data Con LA
 
PDF
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA
 
PPTX
Data Con LA 2022 - AI Ethics
Data Con LA
 
PDF
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA
 
PDF
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA
 
PDF
Data Con LA 2022 - Real world consumer segmentation
Data Con LA
 
PPTX
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA
 
PPTX
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA
 
PDF
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA
 
PDF
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA
 
PDF
Data Con LA 2022 - Intro to Data Science
Data Con LA
 
PDF
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA
 
PPTX
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA
 
PPTX
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA
 
PPTX
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA
 
PPTX
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA
 
Data Con LA 2022 Keynotes
Data Con LA
 
Data Con LA 2022 Keynotes
Data Con LA
 
Data Con LA 2022 Keynote
Data Con LA
 
Data Con LA 2022 - Startup Showcase
Data Con LA
 
Data Con LA 2022 Keynote
Data Con LA
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA
 
Data Con LA 2022 - AI Ethics
Data Con LA
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA
 
Data Con LA 2022 - Intro to Data Science
Data Con LA
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA
 

Recently uploaded (20)

PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PDF
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
PDF
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
PDF
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
PPTX
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
PDF
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PDF
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PDF
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 

Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flows and IoT apps using Apache NiFi - Dhruv Kumar, Senior Solutions Architect - Hortonworks

  • 1. 1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache NiFi and Stream Processing Dhruv Kumar Sr. Solutions Architect
  • 2. Page2 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Simplistic View of Enterprise Data Flow Store Data Process and Analyze Data Acquire Data Dataflow
  • 3. Page3 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Realistic View of Enterprise Data Flow ? ? ? ? ? ? ?
  • 4. Page4 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Basics of Connecting Systems For every connection, these must agree: 1. Protocol 2. Format 3. Schema 4. Priority 5. Size of event 6. Frequency of event 7. Authorization access 8. Relevance P1 Producer C1 Consumer
  • 5. Page5 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Apache NiFi: The three key concepts • Manage the flow of information • Data Provenance • Secure the control plane and data plane
  • 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Visual Command & Control • Drag and drop processors to build a flow • Start, stop, and configure components in real time • View errors and corresponding error messages • View statistics and health of data flow • Create templates of common processor & connections
  • 7. Page7 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Apache NiFi – Key Features • Guaranteed delivery • Data buffering - Backpressure - Pressure release • Prioritized queuing • Flow specific QoS - Latency vs. throughput - Loss tolerance • Data provenance • Recovery/recording a rolling log of fine- grained history • Visual command and control • Flow templates • Pluggable/multi-role security • Designed for extension • Clustering
  • 8. Page8 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Matured at NSA 2006-2014 Brief history of the Apache NiFi Community Code developed at NSA 2006 Today Achieved TLP status in just 7 months July 2015 Dev mailing list Users mailing list* 182 subscribers producing ~100 emails/week 165 subscribers producing ~40 emails/week 55 125 1170 Code contributors Pull requests via Github JIRAs Filed. Code available open source ASL v2 December 2014 *Only 5 months old In 11 months… 6Targeting a 6-8 week release cycle Releases 153 new in last two months With more in pipeline Committers 13 PMC Members Affiliations Hortonworks, Twitter, Cloudera, US Government, Defense Contractors, etc.
  • 9. Page9 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Flow Based Programming (FBP) FBP Term NiFi Term Description Information Packet FlowFile Each object moving through the system. Black Box FlowFile Processor Performs the work, doing some combination of data routing, transformation, or mediation between systems. Bounded Buffer Connection The linkage between processors, acting as queues and allowing various processes to interact at differing rates. Scheduler Flow Controller Maintains the knowledge of how processes are connected, and manages the threads and allocations thereof which all processes use. Subnet Process Group A set of processes and their connections, which can receive and send data via ports. A process group allows creation of entirely new component simply by composition of its components.
  • 10. Page10 © Hortonworks Inc. 2011 – 2015. All Rights Reserved OS/Host JVM Flow Controller Web Server Processor 1 Extension N FlowFile Repository Content Repository Provenance Repository Local Storage OS/Host JVM Flow Controller Web Server Processor 1 Extension N FlowFile Repository Content Repository Provenance Repository Local Storage Architecture OS/Host JVM NiFi Cluster Manger – Request Replicator Web Server Master NiFi Cluster Manager (NCM) OS/Host JVM Flow Controller Web Server Processor 1 Extension N FlowFile Repository Content Repository Provenance Repository Local Storage Slaves NiFi Nodes
  • 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache NiFi’s uses are many… What is Apache NiFi used for? • Reliable and secure transfer of data between systems • Delivery of data from sources to analytic platforms • Enrichment and preparation of data: – Conversion between formats – Extraction/Parsing – Routing decisions What is Apache NiFi NOT used for? • Distributed Computation • Complex Event Processing • Joins / Complex Rolling Window Operations
  • 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved HDF Powered by Apache NiFi Addresses Modern Data Flow Challenges Aggregate all IoAT data from sensors, geo-location devices, machines, logs, files, and feeds via a highly secure lightweight agent Collect: Bring Together• Logs • Files • Feeds • Sensors Mediate point-to-point and bi-directional data flows, delivering data reliably to real-time applications and storage platforms such as HDP Conduct: Mediate the Data Flow• Deliver • Secure • Govern • Audit Parse, filter, join, transform, fork, and clone data in motion to empower analytics and perishable insights Curate: Gain Insights• Parse • Filter • Transform • Fork • Clone
  • 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved HDP + HDF Create Modern Data Apps DATA AT REST HDF DATA IN MOTION ACTIONABLE INTELLIGENCE MODERN DATA APPS Real-Time Cyber Security protects systems with superior threat detection Smart Manufacturing dramatically improves yields by managing more variables in greater detail Connected, Autonomous Cars drive themselves and improve road safety Future Farming optimizing soil, seeds and equipment to measured conditions on each square foot Automatic Recommendation Engines match products to preferences in milliseconds
  • 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Streaming Architectures
  • 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Drive Data to Core for Analysis NiFi Stream Processing MiNiFi MiNiFi • Drive data from sources to central data center for analysis • Tiered collection approach at various locations, think regional data centers Edge Edge Core Batch Analytics
  • 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Dynamically Adjusting Data Flows • Push contents back to core NiFi • Push results back to edge locations/devices to change behavior NiFi MiNiFi MiNiFi Edge Edge Core Batch Analytics Stream Processing
  • 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Retail Store Gateway Server MiNiFi Mobile Client Libraries Freezer Client Libraries Server Cluster NiFi Register MiNiFi Regional Center NiFi NiFi Kafka Core Data Center Server Cluster NiFi NiFi NiFi Others Storm Kafka Spark/Flink/etc. AWS Azure Google Cloud Hortonworks DataFlow Reference Architecture DB Data WH  Tiered processing framework  Bi-directional communication  Data prioritization  Interactive command & control in the center, design & deploy on the edge
  • 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Retail Store Gateway Server MiNiFi Mobile Client Libraries Freezer Client Libraries Server Cluster NiFi Register MiNiFi Regional Center NiFi NiFi Kafka Storm Hortonworks DataFlow Reference Architecture  Campaign management: coupons/promotions/etc.  Location based services Core Data Center Server Cluster NiFi NiFi NiFi Others Kafka Spark/Flink/etc. AWS Azure Google Cloud DB Data WH
  • 19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Retail Store Gateway Server MiNiFi Mobile Client Libraries Freezer Client Libraries Server Cluster NiFi Register MiNiFi Regional Center NiFi NiFi Kafka Storm Hortonworks DataFlow Reference Architecture  Transaction processing  Fraud detection Core Data Center Server Cluster NiFi NiFi NiFi Others Kafka Spark/Flink/etc. AWS Azure Google Cloud DB Data WH
  • 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Retail Store Gateway Server MiNiFi Mobile Client Libraries Freezer Client Libraries Server Cluster NiFi Register MiNiFi Regional Center NiFi NiFi Kafka Storm Hortonworks DataFlow Reference Architecture  Complex processing and cloud computing  Historical data analytics based on nightly updates Core Data Center Server Cluster NiFi NiFi NiFi Others Kafka Spark/Flink/etc. AWS Azure Google Cloud DB Data WH
  • 21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache NiFi vs Kafka NiFi Good for data traceability and flow management • Interactive command and control – real time operational visibility • Data provenance – real time visual chain of custody • Low scripting maintenance ⚠ Requires adding/removing processors according to consumer-side updates Kafka Good for large number of consumers and dynamic consumer-side updates • Low latency • Great data durability • Support large number of producers/consumers ⚠ Not optimized to manage dataflows (prioritization, enrichment, protocols, formats, event level authorizations, objects with various sizes, etc.)
  • 22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache NiFi vs Storm NiFi Good for data traceability, flow management, and enrichment • Data provenance – real time visual chain of custody • Security – end-to-end secure routing with event level authorization • Simple event processing ⚠ Scaling model allowing for processor level workload to be only evenly distributed across worker nodes Storm Good for streaming analytics • Complex event processing • Flexible scaling model, allowing to specify workload distribution on-demand at bolt level ⚠ Not designed to manage data flows
  • 23. In a nutshell… NiFi Hadoop HDFS HBase Hive SOLR YARN Storm Service Management / Workflow SIEM Spark Raw Network Stream Network Metadata Stream Data Stores Syslog Raw Application Logs Other Streaming Telemetry
  • 24. Key Tenants of Lambda Architecture  Batch Layer  Manages master data  Immutable, append-only set of raw data  Cleanse, Normalize & Pre-Compute Batch Views  Advanced Statistical Calculations  Speed layer  Real Time Event Stream Processing  Computes Real-Time Views  Serving Layer  Low-latency, ad-hoc query  Reporting, BI & Dashboard New Data Stream Store Pre-Compute Views Process Streams Incremental Views Business View Business View Query SPEED LAYER BATCH LAYER SERVING LAYER HDP and HDF Fundamental Principles of Streaming Architectures
  • 25. Page25 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Storm/Spark Streaming Storm Detailed Reference Architecture for IoT Applications HDF Flume Sink to HDFS Transform Interactive UI Framework Hive Hive HDFS HDFS SOURCE DATA Server logs Application Logs Firewall Logs CRM/ERP Sensor Kafka Kafka Stream to HDF Forward to Storm Real Time Storage Spark-ML Pig Alerts Bolt to HDFS Dashboard Silk JMS Alerts Hive Server HiveServer Reporting BI Tools High Speed Ingest Real-Time Batch Interactive Machine Learning Models Spark Pig AlertsSQOOP Flume Iterative ML Hbase/Pheonix HBaseEvent Enrichment Spark-Thrift Pig
  • 26. 26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Demo!

Editor's Notes

  • #10: Introduce Flow Based Programming fundamentals, why they matter, and how NiFi adopts them
  • #11: Introduce the architecture of NiFi, describe major system components, and describe the single node and clustering models. For each component describe its available (and potential)deployment models (relate it to Hadoop).
  • #13: HDF Powered by Apache NiFi Addresses Modern Data Flow Challenges - HDF provides 3 key capabilities – the ability to collect data from different types of data sources via a highly secure lightweigt agent, the ability to mediate the data flow to/from the data source and the “collector”, and the ability to trace, parse, transform data in motion to enable analytics and derive insights within an operationally relevant time window. Systems fail Networks fail, disks fail, software crashes, people make mistakes. Data access exceeds capacity to consume Sometimes a given data source can outpace some part of the processing or delivery chain - it only takes one weak-link to have an issue. Boundary conditions are mere suggestions You will invariably get data that is too big, too small, too fast, too slow, corrupt, wrong, or in the wrong format. What is noise one day becomes signal the next Priorities of an organization change - rapidly. Enabling new flows and changing existing ones must be fast. Systems evolve at different rates The protocols and formats used by a given system can change anytime and often irrespective of the systems around them. Dataflow exists to connect what is essentially a massively distributed system of components that are loosely or not-at-all designed to work together. Compliance and security Laws, regulations, and policies change. Business to business agreements change. System to system and system to user interactions must be secure, trusted, accountable. Continuous improvement occurs in production It is often not possible to come even close to replicating production environments in the lab.
  • #14: TALK TRACK Here are just a few of the modern data apps that convert yesterday’s impossible challenges into today’s new products, cures, conveniences and life saving innovations. These apps are either custom-built by our customers or they come of the shelf, created by Hortonworks or one of of our ecosystem partners to solve a particular problem. Symantec and other cyber security leaders have built powerful apps to detect threats to digital information. Leading pharma, automotive, consumer electronics and packaged goods companies are building their factories of the future that use actionable intelligence to improve manufacturing yields. And age-old industries like automotive, agriculture and retail are taking connected data platforms on the road, through the field or to the cash register to do things that have never before been possible. [NEXT SLIDE]
  • #18: Tiered processing framework: often times not necessary to centralize every thing back to data center. Processing can happen in regional offices as well as on the edge devices, for efficiency (fraud detection logic defined in branch offices, etc.) Bi-directional communication: real-time analytical results can be pushed back to the edge, adjust flow behavior accordingly. Example: prioritize data collection based on real-time bandwidth (calculated in DC with Flink jobs); fraud detection, send triggering events back to the edge to block transactions in real-time Data prioritization: prioritize data flow, example: higher priority data can be sent back via LTE, lower priority data can wait until wifi becomes available. Interactive vs design/deploy: in data center, complex flow, interactive command and control, allowing users to fix pipes without shutting down the water; design data flow with a visual interface in DC, and push to multiple MINIFI agents with one click (also providing a centralized place to version control flows on all the agents).
  • #24: CapOne – Ingesting from everywhere Email, Syslog, Applog, Netflow… Moving to “Cloud Only model”….even looking to use “docker Containers” in Amazon…
  • #25: Roll forward a few years, Hadoop today provides a complete platform to address the batch, serving and speed layers of the Lambda Architecture.
  • #26: The team puts together a detailed architecture of the proposed solution using HDP and HDF. The architecture considers sources data from the numerous sources including Server Logs, Application Logs, XML and Senso data. This data is easily accepted into the flexible schema of HDP using HDF and Sqoop. The data is processed using Pig and analyzed using Spark. Then the data is made available in a real-time dashboard as well as to visualization and reporting tools. [NEXT SLIDE]