SlideShare a Scribd company logo
1 © Hortonworks Inc. 2011–2018. All rights reserved
Using Spark Streaming and NiFi for
the next generation of ETL in the
enterprise
Andrew Psaltis
Regional CTO APAC
2 © Hortonworks Inc. 2011–2018. All rights reserved
Traditional ETL
1011010 DBseconds hoursFile
3 © Hortonworks Inc. 2011–2018. All rights reserved
Streaming ETL
1011010 DBseconds
4 © Hortonworks Inc. 2011–2018. All rights reserved
Reference Streaming Architecture
5 © Hortonworks Inc. 2011–2018. All rights reserved
Apache NiFi
6 © Hortonworks Inc. 2011–2018. All rights reserved
7 © Hortonworks Inc. 2011–2018. All rights reserved
Simplistic View of Dataflows: Easy, Definitive
Acquire
Data
Store
Data
Data
Flow
Process
Analyze
Data
8 © Hortonworks Inc. 2011–2018. All rights reserved
Standards: http://xkcd.com/927/
9 © Hortonworks Inc. 2011–2018. All rights reserved
Realistic View of Dataflows: Complex, Convoluted
Acquire
Data
Store
Data
Acquire
Data
Store
Data
Store
Data
Store
Data
Store
Data
Process
and
Analyze
Data
Data
Flow
Acquire
Data
Acquire
Data
10 © Hortonworks Inc. 2011–2018. All rights reserved
The National Security Agency Years
• Created in 2006
• Improved over eight years
• Simple initial vision – Visio for real-time dataflow management
• National Security Agency donated the codebase to the ASF in late 2014
11 © Hortonworks Inc. 2011–2018. All rights reserved
Apache NiFi
à Key Features
• Guaranteed delivery
• Data buffering
- Backpressure
- Pressure release
• Prioritized queuing
• Flow specific QoS
- Latency vs. throughput
- Loss tolerance
• Data provenance
• Supports push and pull
models
• Recovery/recording
a rolling log of fine-
grained history
• Visual command and
control
• Flow templates
• Pluggable/multi-role
security
• Designed for extension
• Clustering
12 © Hortonworks Inc. 2011–2018. All rights reserved
Visual Command and Control
• Drag and drop processors to build a flow
• Start, stop, and configure components in real time
• View errors and corresponding error messages
• View statistics and health of data flow
• Create templates of common processor & connections
13 © Hortonworks Inc. 2011–2018. All rights reserved
Provenance/Lineage
14 © Hortonworks Inc. 2011–2018. All rights reserved
Prioritization
• Configure a prioritizer per
connection
• Determine what is important for
your data – time based, arrival
order, importance of a data set
• Funnel many connections down to
a single connection to prioritize
across data sets
• Develop your own prioritizer if
needed
15 © Hortonworks Inc. 2011–2018. All rights reserved
16 © Hortonworks Inc. 2011–2018. All rights reserved
Latency vs. Throughput
• Choose between lower latency, or higher throughput on each processor
17 © Hortonworks Inc. 2011–2018. All rights reserved
NiFi Positioning
Apache
NiFi / MiNiFi
ETL
(Informatica, etc.)
Enterprise
Service Bus
(Fuse, Mule, etc.)
Messaging
Bus
(Kafka, MQ, etc.)
Processing
Framework
(Storm, Spark, etc.)
18 © Hortonworks Inc. 2011–2018. All rights reserved
Apache NiFi / Processing Frameworks
NiFi
Simple event processing
• Primarily feed data into processing
frameworks, can process data, with a focus on
simple event processing
• Operate on a single piece of data, or in
correlation with an enrichment dataset
(enrichment, parsing, splitting, and
transformations)
• Can scale out, but scale up better to take full
advantage of hardware resources, run
concurrent processing tasks/threads
(processing terabytes of data per day on a
single node)
Not another distributed processing framework,
but to feed data into those
Processing Frameworks (Storm, Spark, etc.)
Complex and distributed processing
• Complex processing from multiple streams (JOIN
operations)
• Analyzing data across time windows (rolling window
aggregation, standard deviation, etc.)
• Scale out to thousands of nodes if needed
Not designed to collect data or manage data flow
19 © Hortonworks Inc. 2011–2018. All rights reserved
Apache NiFi / Messaging Bus Services
NiFi
Provide dataflow solution
• Centralized management, from edge to core
• Great traceability, event level data provenance
starting when data is born
• Interactive command and control – real time
operational visibility
• Dataflow management, including prioritization,
back pressure, and edge intelligence
• Visual representation of global dataflow
Not a messaging bus, flow maintenance needed
when you have frequent consumer side updates
Messaging Bus (Kafka, JMS, etc.)
Provide messaging bus service
• Low latency
• Great data durability
• Decentralized management (producers & consumers)
• Low broker maintenance for dynamic consumer side
updates
Not designed to solve dataflow problems
(prioritization, edge intelligence, etc.)
Traceability limited to in/out of topics, no lineage
Lack of global view of components/connectivities
20 © Hortonworks Inc. 2011–2018. All rights reserved
Apache NiFi / Integration, or ingestion, Frameworks
NiFi
End user facing dataflow management tool
• Out of the box solution for dataflow
management
• Interactive command and control in the core,
design and deploy on the edge
• Flexible failure handling at each point of the flow
• Visual representation of global dataflow and
connectivities
• Native cross data center communication
• Data provenance for traceability
Not a library to be embedded in other
applications
Integration framework (Spring Integration,
Camel, etc), ingestion framework (Flume, etc)
Developer facing integration tool with a focus
on data ingestion
• A set of tools to orchestrate workflow
• A fixed design and deploy pattern
• Leverage messaging bus across disconnected
networks
Developer facing, custom coding needed to optimize
Pre-built failure handling, lack of flexibility
No holistic view of global dataflow
No built-in data traceability
21 © Hortonworks Inc. 2011–2018. All rights reserved
Apache NiFi / ETL Tools
NiFi
NOT schema dependent
• Dataflow management for both structured and
unstructured data, powered by separation of
metadata and payload
• Schema is not required, but you can have
schema
• Minimum modeling effort, just enough to
manage dataflows
• Do the plumbing job, maximize developers’
brainpower for creative work
Not designed to do heavy lifting transformation
work for DB tables (JOIN datasets, etc.). You can
create custom processors to do that, but long
way to go to catch up with existing ETL tools from
user experience perspective (GUI for data
wrangling, cleansing, etc.)
ETL (Informatica, etc.)
Schema dependent
• Tailored for Databases/WH
• ETL operations based on schema/data modeling
• Highly efficient, optimized performance
Must pre-prepare your data, time consuming to build
data modeling, and maintain schemas
Not geared towards handling unstructured data, PDF,
Audio, Video, etc.
Not designed to solve dataflow problems
22 © Hortonworks Inc. 2011–2018. All rights reserved
NiFi Big Picture Pattern: Diverse Flows from One Tool
“Swiss Army Knife of
Data Movement”
23 © Hortonworks Inc. 2011–2018. All rights reserved
Apache Kafka
24 © Hortonworks Inc. 2011–2018. All rights reserved
25 © Hortonworks Inc. 2011–2018. All rights reserved
What is Apache Kafka?
• Distributed streaming platform that allows
publishing and subscribing to streams of
records
• Streams of records are organized into
categories called topics
• Topics can be partitioned and/or replicated
• Records consist of a key, value, and
timestamp
http://kafka.apache.org/intro
Kafka
Cluster
producer
producer
producer
consumer
consumer
consumer
APACHE KAFKA
High throughput, distributed system. Designed to operate at large scale.
26 © Hortonworks Inc. 2011–2018. All rights reserved
Why Kafka
Source
System
Source
System
Source
System
Source
System
Kafka
Hadoop Security
Systems
Real-Time
Monitoring
Data
Warehouse
Producers
Brokers
Consumers
Kafka decouples data pipelines
27 © Hortonworks Inc. 2011–2018. All rights reserved
Overview of Topics
• Topics are a partitioned ordered, immutable sequence of messages
• Messages are retained for a configurable amount of time (24 hours, 7 days,
etc.)
• Each consumer retains its own offset in the partition
28 © Hortonworks Inc. 2011–2018. All rights reserved
Understanding Partitions
• Partitions are an ordered, immutable sequence of messages
• Each partition is replicated for fault tolerance
• A replicated partition has one broker that acts as the leader and the rest
as followers
29 © Hortonworks Inc. 2011–2018. All rights reserved
Publishing Messages
message_a
message_b
message_c
message_d
message_e
message_f
1. A producer publishes messages to a topic
2. The producer decides which
partition to send each message to
offset -> 0 1 2 3 4
Partition 0 message_b message_f
Partition 1 message_a message_c message_e
Partition 2 message_d
Old New
3. New messages are written to
the end of the partition
4. A consumer fetches messages from a
partition by specifying an offset
30 © Hortonworks Inc. 2011–2018. All rights reserved
Leader and Followers
Broker 1
my_topic
Partition-1 (follower)
Broker 2
my_topic
Partition-1 (leader)
Broker 3
my_topic
Partition-1 (follower)
The leader handles
all read and write
requests
31 © Hortonworks Inc. 2011–2018. All rights reserved
Consuming Messages
• Messages are consumed in Kafka by a consumer group
• Each individual consumer is labeled with a group name
• Each message in a topic is sent to one consumer in the group
32 © Hortonworks Inc. 2011–2018. All rights reserved
Consumer Groups
Broker 1
my_topic: Partition-0
my_topic: Partition-3
Broker 2
my_topic: Partition-1
my_topic: Partition-2
Consumer Group A
consumer consumer
consumer consumer
Consumer Group B
consumer consumer
consumer consumer
consumer
message_1
Each message is
consumed by one
consumer per group
33 © Hortonworks Inc. 2011–2018. All rights reserved
Spark Structured
Streaming
34 © Hortonworks Inc. 2011–2018. All rights reserved
What is Real-Time?
35 © Hortonworks Inc. 2011–2018. All rights reserved
Thinking about time
36 © Hortonworks Inc. 2011–2018. All rights reserved
37 © Hortonworks Inc. 2011–2018. All rights reserved
Time Skew
38 © Hortonworks Inc. 2011–2018. All rights reserved
Windows
39 © Hortonworks Inc. 2011–2018. All rights reserved
Tumbling Windows
40 © Hortonworks Inc. 2011–2018. All rights reserved
Tumbling Time Windowing
41 © Hortonworks Inc. 2011–2018. All rights reserved
Tumbling Count Windowing
42 © Hortonworks Inc. 2011–2018. All rights reserved
Sliding Windows
43 © Hortonworks Inc. 2011–2018. All rights reserved
Sliding Time Window
44 © Hortonworks Inc. 2011–2018. All rights reserved
Sliding Count Window
45 © Hortonworks Inc. 2011–2018. All rights reserved
Watermarking
• The threshold of how late data is expected to be and when to drop old
state
• Trails behind max event time seen by the Spark engine
• Easiest to think of the Watermark delay as the trailing gap
46 © Hortonworks Inc. 2011–2018. All rights reserved
Watermarking
• Data that is newer than the watermark is late, but allowed to aggregate
• Data that is older than the watermark is dropped as it’s "too late"
• Any windows older than watermark are automatically deleted to limit
state
47 © Hortonworks Inc. 2011–2018. All rights reserved
Structured Streaming
Turns stream processing into SQL
fast, scalable, fault-tolerant
Unifies high level APIs with Spark
deal with complex data and complex workloads
48 © Hortonworks Inc. 2011–2018. All rights reserved Source: http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
49 © Hortonworks Inc. 2011–2018. All rights reserved Source: http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
50 © Hortonworks Inc. 2011–2018. All rights reserved Source: http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
51 © Hortonworks Inc. 2011–2018. All rights reserved Source: http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
52 © Hortonworks Inc. 2011–2018. All rights reserved Source: http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
53 © Hortonworks Inc. 2011–2018. All rights reserved Source: http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
54 © Hortonworks Inc. 2011–2018. All rights reserved
55 © Hortonworks Inc. 2011–2018. All rights reserved
The Do’s and Don’ts
56 © Hortonworks Inc. 2011–2018. All rights reserved
The Good, The Bad, The Ugly
• Use NiFi for all ingestion and data preparationAcquire
Data
Store
Data
Data
Flow
Process
Analyze
Data
Acquire
Data
Store
Data
Acquire
Data
Store
Data
Store
Data
Store
Data
Store
Data
Process
and
Analyze
Data
Data
Flow
Acquire
Data
Acquire
Data
57 © Hortonworks Inc. 2011–2018. All rights reserved
The Good, The Bad, The Ugly
• Be very cautious of forcing order in Kafka
58 © Hortonworks Inc. 2011–2018. All rights reserved
The Good, The Bad, The Ugly
• Chose your data store to match your query pattern
59 © Hortonworks Inc. 2011–2018. All rights reserved
The Good, The Bad, The Ugly
• Chose your data store to match your query pattern
60 © Hortonworks Inc. 2011–2018. All rights reserved
The Good, The Bad, The Ugly
• Monitor, Monitor, Monitor
• Think about using schemas
• Plan for spike in traffic – this impacts Kafka portioning and consumers
61 © Hortonworks Inc. 2011–2018. All rights reserved
Questions?
62 © Hortonworks Inc. 2011–2018. All rights reserved
Thank you
Ad

More Related Content

What's hot (20)

Apache Nifi Crash Course
Apache Nifi Crash CourseApache Nifi Crash Course
Apache Nifi Crash Course
DataWorks Summit
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
DataWorks Summit
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
Introduction To Flink
Introduction To FlinkIntroduction To Flink
Introduction To Flink
Knoldus Inc.
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big Data
DataWorks Summit
 
Apache Nifi Crash Course
Apache Nifi Crash CourseApache Nifi Crash Course
Apache Nifi Crash Course
DataWorks Summit
 
Real-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiReal-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFi
Manish Gupta
 
Rds data lake @ Robinhood
Rds data lake @ Robinhood Rds data lake @ Robinhood
Rds data lake @ Robinhood
BalajiVaradarajan13
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue
 
Webinar: Deep Dive on Apache Flink State - Seth Wiesman
Webinar: Deep Dive on Apache Flink State - Seth WiesmanWebinar: Deep Dive on Apache Flink State - Seth Wiesman
Webinar: Deep Dive on Apache Flink State - Seth Wiesman
Ververica
 
Apache flink
Apache flinkApache flink
Apache flink
pranay kumar
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
enissoz
 
ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3
DataWorks Summit
 
NiFi Developer Guide
NiFi Developer GuideNiFi Developer Guide
NiFi Developer Guide
Deon Huang
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
Databricks
 
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeAdaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Databricks
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Databricks
 
Securing Hadoop with Apache Ranger
Securing Hadoop with Apache RangerSecuring Hadoop with Apache Ranger
Securing Hadoop with Apache Ranger
DataWorks Summit
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
DataWorks Summit
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
Introduction To Flink
Introduction To FlinkIntroduction To Flink
Introduction To Flink
Knoldus Inc.
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big Data
DataWorks Summit
 
Real-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiReal-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFi
Manish Gupta
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue
 
Webinar: Deep Dive on Apache Flink State - Seth Wiesman
Webinar: Deep Dive on Apache Flink State - Seth WiesmanWebinar: Deep Dive on Apache Flink State - Seth Wiesman
Webinar: Deep Dive on Apache Flink State - Seth Wiesman
Ververica
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
enissoz
 
ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3ORC improvement in Apache Spark 2.3
ORC improvement in Apache Spark 2.3
DataWorks Summit
 
NiFi Developer Guide
NiFi Developer GuideNiFi Developer Guide
NiFi Developer Guide
Deon Huang
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
Databricks
 
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeAdaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Databricks
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Databricks
 
Securing Hadoop with Apache Ranger
Securing Hadoop with Apache RangerSecuring Hadoop with Apache Ranger
Securing Hadoop with Apache Ranger
DataWorks Summit
 

Similar to Using Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise (20)

Curing the Kafka blindness—Streams Messaging Manager
Curing the Kafka blindness—Streams Messaging ManagerCuring the Kafka blindness—Streams Messaging Manager
Curing the Kafka blindness—Streams Messaging Manager
DataWorks Summit
 
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseUsing Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
DataWorks Summit
 
Apache NiFi - Flow Based Programming Meetup
Apache NiFi - Flow Based Programming MeetupApache NiFi - Flow Based Programming Meetup
Apache NiFi - Flow Based Programming Meetup
Joseph Witt
 
Data Con LA 2018 - Streaming and IoT by Pat Alwell
Data Con LA 2018 - Streaming and IoT by Pat AlwellData Con LA 2018 - Streaming and IoT by Pat Alwell
Data Con LA 2018 - Streaming and IoT by Pat Alwell
Data Con LA
 
Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi
Taking DataFlow Management to the Edge with Apache NiFi/MiNiFiTaking DataFlow Management to the Edge with Apache NiFi/MiNiFi
Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi
Bryan Bende
 
State of the Apache NiFi Ecosystem & Community
State of the Apache NiFi Ecosystem & CommunityState of the Apache NiFi Ecosystem & Community
State of the Apache NiFi Ecosystem & Community
Accumulo Summit
 
HDF 3.1 : An Introduction to New Features
HDF 3.1 : An Introduction to New FeaturesHDF 3.1 : An Introduction to New Features
HDF 3.1 : An Introduction to New Features
Timothy Spann
 
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
Data Con LA
 
Beyond Messaging Enterprise Dataflow powered by Apache NiFi
Beyond Messaging Enterprise Dataflow powered by Apache NiFiBeyond Messaging Enterprise Dataflow powered by Apache NiFi
Beyond Messaging Enterprise Dataflow powered by Apache NiFi
Isheeta Sanghi
 
Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks Data in Motion Webinar Series - Part 1Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks
 
Harnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFI
Harnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFIHarnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFI
Harnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFI
Haimo Liu
 
BigData Techcon - Beyond Messaging with Apache NiFi
BigData Techcon - Beyond Messaging with Apache NiFiBigData Techcon - Beyond Messaging with Apache NiFi
BigData Techcon - Beyond Messaging with Apache NiFi
Aldrin Piri
 
NJ Hadoop Meetup - Apache NiFi Deep Dive
NJ Hadoop Meetup - Apache NiFi Deep DiveNJ Hadoop Meetup - Apache NiFi Deep Dive
NJ Hadoop Meetup - Apache NiFi Deep Dive
Bryan Bende
 
HDF Powered by Apache NiFi Introduction
HDF Powered by Apache NiFi IntroductionHDF Powered by Apache NiFi Introduction
HDF Powered by Apache NiFi Introduction
Milind Pandit
 
HDF: Hortonworks DataFlow: Technical Workshop
HDF: Hortonworks DataFlow: Technical WorkshopHDF: Hortonworks DataFlow: Technical Workshop
HDF: Hortonworks DataFlow: Technical Workshop
Hortonworks
 
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
 
[253] apache ni fi
[253] apache ni fi[253] apache ni fi
[253] apache ni fi
NAVER D2
 
Containers and Big Data
Containers and Big DataContainers and Big Data
Containers and Big Data
DataWorks Summit
 
Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data
Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big DataHortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data
Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data
Mats Johansson
 
Curing the Kafka blindness—Streams Messaging Manager
Curing the Kafka blindness—Streams Messaging ManagerCuring the Kafka blindness—Streams Messaging Manager
Curing the Kafka blindness—Streams Messaging Manager
DataWorks Summit
 
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseUsing Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
DataWorks Summit
 
Apache NiFi - Flow Based Programming Meetup
Apache NiFi - Flow Based Programming MeetupApache NiFi - Flow Based Programming Meetup
Apache NiFi - Flow Based Programming Meetup
Joseph Witt
 
Data Con LA 2018 - Streaming and IoT by Pat Alwell
Data Con LA 2018 - Streaming and IoT by Pat AlwellData Con LA 2018 - Streaming and IoT by Pat Alwell
Data Con LA 2018 - Streaming and IoT by Pat Alwell
Data Con LA
 
Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi
Taking DataFlow Management to the Edge with Apache NiFi/MiNiFiTaking DataFlow Management to the Edge with Apache NiFi/MiNiFi
Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi
Bryan Bende
 
State of the Apache NiFi Ecosystem & Community
State of the Apache NiFi Ecosystem & CommunityState of the Apache NiFi Ecosystem & Community
State of the Apache NiFi Ecosystem & Community
Accumulo Summit
 
HDF 3.1 : An Introduction to New Features
HDF 3.1 : An Introduction to New FeaturesHDF 3.1 : An Introduction to New Features
HDF 3.1 : An Introduction to New Features
Timothy Spann
 
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
Data Con LA
 
Beyond Messaging Enterprise Dataflow powered by Apache NiFi
Beyond Messaging Enterprise Dataflow powered by Apache NiFiBeyond Messaging Enterprise Dataflow powered by Apache NiFi
Beyond Messaging Enterprise Dataflow powered by Apache NiFi
Isheeta Sanghi
 
Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks Data in Motion Webinar Series - Part 1Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks
 
Harnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFI
Harnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFIHarnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFI
Harnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFI
Haimo Liu
 
BigData Techcon - Beyond Messaging with Apache NiFi
BigData Techcon - Beyond Messaging with Apache NiFiBigData Techcon - Beyond Messaging with Apache NiFi
BigData Techcon - Beyond Messaging with Apache NiFi
Aldrin Piri
 
NJ Hadoop Meetup - Apache NiFi Deep Dive
NJ Hadoop Meetup - Apache NiFi Deep DiveNJ Hadoop Meetup - Apache NiFi Deep Dive
NJ Hadoop Meetup - Apache NiFi Deep Dive
Bryan Bende
 
HDF Powered by Apache NiFi Introduction
HDF Powered by Apache NiFi IntroductionHDF Powered by Apache NiFi Introduction
HDF Powered by Apache NiFi Introduction
Milind Pandit
 
HDF: Hortonworks DataFlow: Technical Workshop
HDF: Hortonworks DataFlow: Technical WorkshopHDF: Hortonworks DataFlow: Technical Workshop
HDF: Hortonworks DataFlow: Technical Workshop
Hortonworks
 
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks
 
[253] apache ni fi
[253] apache ni fi[253] apache ni fi
[253] apache ni fi
NAVER D2
 
Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data
Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big DataHortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data
Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data
Mats Johansson
 
Ad

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Ad

Recently uploaded (20)

Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes Partner Innovation Updates for May 2025
ThousandEyes
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven InsightsAndrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell: Transforming Business Strategy Through Data-Driven Insights
Andrew Marnell
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...
Alan Dix
 

Using Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise

  • 1. 1 © Hortonworks Inc. 2011–2018. All rights reserved Using Spark Streaming and NiFi for the next generation of ETL in the enterprise Andrew Psaltis Regional CTO APAC
  • 2. 2 © Hortonworks Inc. 2011–2018. All rights reserved Traditional ETL 1011010 DBseconds hoursFile
  • 3. 3 © Hortonworks Inc. 2011–2018. All rights reserved Streaming ETL 1011010 DBseconds
  • 4. 4 © Hortonworks Inc. 2011–2018. All rights reserved Reference Streaming Architecture
  • 5. 5 © Hortonworks Inc. 2011–2018. All rights reserved Apache NiFi
  • 6. 6 © Hortonworks Inc. 2011–2018. All rights reserved
  • 7. 7 © Hortonworks Inc. 2011–2018. All rights reserved Simplistic View of Dataflows: Easy, Definitive Acquire Data Store Data Data Flow Process Analyze Data
  • 8. 8 © Hortonworks Inc. 2011–2018. All rights reserved Standards: http://xkcd.com/927/
  • 9. 9 © Hortonworks Inc. 2011–2018. All rights reserved Realistic View of Dataflows: Complex, Convoluted Acquire Data Store Data Acquire Data Store Data Store Data Store Data Store Data Process and Analyze Data Data Flow Acquire Data Acquire Data
  • 10. 10 © Hortonworks Inc. 2011–2018. All rights reserved The National Security Agency Years • Created in 2006 • Improved over eight years • Simple initial vision – Visio for real-time dataflow management • National Security Agency donated the codebase to the ASF in late 2014
  • 11. 11 © Hortonworks Inc. 2011–2018. All rights reserved Apache NiFi à Key Features • Guaranteed delivery • Data buffering - Backpressure - Pressure release • Prioritized queuing • Flow specific QoS - Latency vs. throughput - Loss tolerance • Data provenance • Supports push and pull models • Recovery/recording a rolling log of fine- grained history • Visual command and control • Flow templates • Pluggable/multi-role security • Designed for extension • Clustering
  • 12. 12 © Hortonworks Inc. 2011–2018. All rights reserved Visual Command and Control • Drag and drop processors to build a flow • Start, stop, and configure components in real time • View errors and corresponding error messages • View statistics and health of data flow • Create templates of common processor & connections
  • 13. 13 © Hortonworks Inc. 2011–2018. All rights reserved Provenance/Lineage
  • 14. 14 © Hortonworks Inc. 2011–2018. All rights reserved Prioritization • Configure a prioritizer per connection • Determine what is important for your data – time based, arrival order, importance of a data set • Funnel many connections down to a single connection to prioritize across data sets • Develop your own prioritizer if needed
  • 15. 15 © Hortonworks Inc. 2011–2018. All rights reserved
  • 16. 16 © Hortonworks Inc. 2011–2018. All rights reserved Latency vs. Throughput • Choose between lower latency, or higher throughput on each processor
  • 17. 17 © Hortonworks Inc. 2011–2018. All rights reserved NiFi Positioning Apache NiFi / MiNiFi ETL (Informatica, etc.) Enterprise Service Bus (Fuse, Mule, etc.) Messaging Bus (Kafka, MQ, etc.) Processing Framework (Storm, Spark, etc.)
  • 18. 18 © Hortonworks Inc. 2011–2018. All rights reserved Apache NiFi / Processing Frameworks NiFi Simple event processing • Primarily feed data into processing frameworks, can process data, with a focus on simple event processing • Operate on a single piece of data, or in correlation with an enrichment dataset (enrichment, parsing, splitting, and transformations) • Can scale out, but scale up better to take full advantage of hardware resources, run concurrent processing tasks/threads (processing terabytes of data per day on a single node) Not another distributed processing framework, but to feed data into those Processing Frameworks (Storm, Spark, etc.) Complex and distributed processing • Complex processing from multiple streams (JOIN operations) • Analyzing data across time windows (rolling window aggregation, standard deviation, etc.) • Scale out to thousands of nodes if needed Not designed to collect data or manage data flow
  • 19. 19 © Hortonworks Inc. 2011–2018. All rights reserved Apache NiFi / Messaging Bus Services NiFi Provide dataflow solution • Centralized management, from edge to core • Great traceability, event level data provenance starting when data is born • Interactive command and control – real time operational visibility • Dataflow management, including prioritization, back pressure, and edge intelligence • Visual representation of global dataflow Not a messaging bus, flow maintenance needed when you have frequent consumer side updates Messaging Bus (Kafka, JMS, etc.) Provide messaging bus service • Low latency • Great data durability • Decentralized management (producers & consumers) • Low broker maintenance for dynamic consumer side updates Not designed to solve dataflow problems (prioritization, edge intelligence, etc.) Traceability limited to in/out of topics, no lineage Lack of global view of components/connectivities
  • 20. 20 © Hortonworks Inc. 2011–2018. All rights reserved Apache NiFi / Integration, or ingestion, Frameworks NiFi End user facing dataflow management tool • Out of the box solution for dataflow management • Interactive command and control in the core, design and deploy on the edge • Flexible failure handling at each point of the flow • Visual representation of global dataflow and connectivities • Native cross data center communication • Data provenance for traceability Not a library to be embedded in other applications Integration framework (Spring Integration, Camel, etc), ingestion framework (Flume, etc) Developer facing integration tool with a focus on data ingestion • A set of tools to orchestrate workflow • A fixed design and deploy pattern • Leverage messaging bus across disconnected networks Developer facing, custom coding needed to optimize Pre-built failure handling, lack of flexibility No holistic view of global dataflow No built-in data traceability
  • 21. 21 © Hortonworks Inc. 2011–2018. All rights reserved Apache NiFi / ETL Tools NiFi NOT schema dependent • Dataflow management for both structured and unstructured data, powered by separation of metadata and payload • Schema is not required, but you can have schema • Minimum modeling effort, just enough to manage dataflows • Do the plumbing job, maximize developers’ brainpower for creative work Not designed to do heavy lifting transformation work for DB tables (JOIN datasets, etc.). You can create custom processors to do that, but long way to go to catch up with existing ETL tools from user experience perspective (GUI for data wrangling, cleansing, etc.) ETL (Informatica, etc.) Schema dependent • Tailored for Databases/WH • ETL operations based on schema/data modeling • Highly efficient, optimized performance Must pre-prepare your data, time consuming to build data modeling, and maintain schemas Not geared towards handling unstructured data, PDF, Audio, Video, etc. Not designed to solve dataflow problems
  • 22. 22 © Hortonworks Inc. 2011–2018. All rights reserved NiFi Big Picture Pattern: Diverse Flows from One Tool “Swiss Army Knife of Data Movement”
  • 23. 23 © Hortonworks Inc. 2011–2018. All rights reserved Apache Kafka
  • 24. 24 © Hortonworks Inc. 2011–2018. All rights reserved
  • 25. 25 © Hortonworks Inc. 2011–2018. All rights reserved What is Apache Kafka? • Distributed streaming platform that allows publishing and subscribing to streams of records • Streams of records are organized into categories called topics • Topics can be partitioned and/or replicated • Records consist of a key, value, and timestamp http://kafka.apache.org/intro Kafka Cluster producer producer producer consumer consumer consumer APACHE KAFKA High throughput, distributed system. Designed to operate at large scale.
  • 26. 26 © Hortonworks Inc. 2011–2018. All rights reserved Why Kafka Source System Source System Source System Source System Kafka Hadoop Security Systems Real-Time Monitoring Data Warehouse Producers Brokers Consumers Kafka decouples data pipelines
  • 27. 27 © Hortonworks Inc. 2011–2018. All rights reserved Overview of Topics • Topics are a partitioned ordered, immutable sequence of messages • Messages are retained for a configurable amount of time (24 hours, 7 days, etc.) • Each consumer retains its own offset in the partition
  • 28. 28 © Hortonworks Inc. 2011–2018. All rights reserved Understanding Partitions • Partitions are an ordered, immutable sequence of messages • Each partition is replicated for fault tolerance • A replicated partition has one broker that acts as the leader and the rest as followers
  • 29. 29 © Hortonworks Inc. 2011–2018. All rights reserved Publishing Messages message_a message_b message_c message_d message_e message_f 1. A producer publishes messages to a topic 2. The producer decides which partition to send each message to offset -> 0 1 2 3 4 Partition 0 message_b message_f Partition 1 message_a message_c message_e Partition 2 message_d Old New 3. New messages are written to the end of the partition 4. A consumer fetches messages from a partition by specifying an offset
  • 30. 30 © Hortonworks Inc. 2011–2018. All rights reserved Leader and Followers Broker 1 my_topic Partition-1 (follower) Broker 2 my_topic Partition-1 (leader) Broker 3 my_topic Partition-1 (follower) The leader handles all read and write requests
  • 31. 31 © Hortonworks Inc. 2011–2018. All rights reserved Consuming Messages • Messages are consumed in Kafka by a consumer group • Each individual consumer is labeled with a group name • Each message in a topic is sent to one consumer in the group
  • 32. 32 © Hortonworks Inc. 2011–2018. All rights reserved Consumer Groups Broker 1 my_topic: Partition-0 my_topic: Partition-3 Broker 2 my_topic: Partition-1 my_topic: Partition-2 Consumer Group A consumer consumer consumer consumer Consumer Group B consumer consumer consumer consumer consumer message_1 Each message is consumed by one consumer per group
  • 33. 33 © Hortonworks Inc. 2011–2018. All rights reserved Spark Structured Streaming
  • 34. 34 © Hortonworks Inc. 2011–2018. All rights reserved What is Real-Time?
  • 35. 35 © Hortonworks Inc. 2011–2018. All rights reserved Thinking about time
  • 36. 36 © Hortonworks Inc. 2011–2018. All rights reserved
  • 37. 37 © Hortonworks Inc. 2011–2018. All rights reserved Time Skew
  • 38. 38 © Hortonworks Inc. 2011–2018. All rights reserved Windows
  • 39. 39 © Hortonworks Inc. 2011–2018. All rights reserved Tumbling Windows
  • 40. 40 © Hortonworks Inc. 2011–2018. All rights reserved Tumbling Time Windowing
  • 41. 41 © Hortonworks Inc. 2011–2018. All rights reserved Tumbling Count Windowing
  • 42. 42 © Hortonworks Inc. 2011–2018. All rights reserved Sliding Windows
  • 43. 43 © Hortonworks Inc. 2011–2018. All rights reserved Sliding Time Window
  • 44. 44 © Hortonworks Inc. 2011–2018. All rights reserved Sliding Count Window
  • 45. 45 © Hortonworks Inc. 2011–2018. All rights reserved Watermarking • The threshold of how late data is expected to be and when to drop old state • Trails behind max event time seen by the Spark engine • Easiest to think of the Watermark delay as the trailing gap
  • 46. 46 © Hortonworks Inc. 2011–2018. All rights reserved Watermarking • Data that is newer than the watermark is late, but allowed to aggregate • Data that is older than the watermark is dropped as it’s "too late" • Any windows older than watermark are automatically deleted to limit state
  • 47. 47 © Hortonworks Inc. 2011–2018. All rights reserved Structured Streaming Turns stream processing into SQL fast, scalable, fault-tolerant Unifies high level APIs with Spark deal with complex data and complex workloads
  • 48. 48 © Hortonworks Inc. 2011–2018. All rights reserved Source: http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
  • 49. 49 © Hortonworks Inc. 2011–2018. All rights reserved Source: http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
  • 50. 50 © Hortonworks Inc. 2011–2018. All rights reserved Source: http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
  • 51. 51 © Hortonworks Inc. 2011–2018. All rights reserved Source: http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
  • 52. 52 © Hortonworks Inc. 2011–2018. All rights reserved Source: http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
  • 53. 53 © Hortonworks Inc. 2011–2018. All rights reserved Source: http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
  • 54. 54 © Hortonworks Inc. 2011–2018. All rights reserved
  • 55. 55 © Hortonworks Inc. 2011–2018. All rights reserved The Do’s and Don’ts
  • 56. 56 © Hortonworks Inc. 2011–2018. All rights reserved The Good, The Bad, The Ugly • Use NiFi for all ingestion and data preparationAcquire Data Store Data Data Flow Process Analyze Data Acquire Data Store Data Acquire Data Store Data Store Data Store Data Store Data Process and Analyze Data Data Flow Acquire Data Acquire Data
  • 57. 57 © Hortonworks Inc. 2011–2018. All rights reserved The Good, The Bad, The Ugly • Be very cautious of forcing order in Kafka
  • 58. 58 © Hortonworks Inc. 2011–2018. All rights reserved The Good, The Bad, The Ugly • Chose your data store to match your query pattern
  • 59. 59 © Hortonworks Inc. 2011–2018. All rights reserved The Good, The Bad, The Ugly • Chose your data store to match your query pattern
  • 60. 60 © Hortonworks Inc. 2011–2018. All rights reserved The Good, The Bad, The Ugly • Monitor, Monitor, Monitor • Think about using schemas • Plan for spike in traffic – this impacts Kafka portioning and consumers
  • 61. 61 © Hortonworks Inc. 2011–2018. All rights reserved Questions?
  • 62. 62 © Hortonworks Inc. 2011–2018. All rights reserved Thank you