Using Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise

1 © Hortonworks Inc. 2011–2018. All rights reserved
Using Spark Streaming and NiFi for
the next generation of ETL in the
enterprise
Andrew Psaltis
Regional CTO APAC

Traditional ETL
1011010 DBseconds hoursFile

Streaming ETL
1011010 DBseconds

Reference Streaming Architecture

Apache NiFi

Simplistic View of Dataflows: Easy, Definitive
Acquire
Data
Store
Data
Data
Flow
Process
Analyze
Data

Standards: http://xkcd.com/927/

Realistic View of Dataflows: Complex, Convoluted
Acquire
Data
Store
Data
Acquire
Data
Store
Data
Store
Data
Store
Data
Store
Data
Process
and
Analyze
Data
Data
Flow
Acquire
Data
Acquire
Data

The National Security Agency Years
• Created in 2006
• Improved over eight years
• Simple initial vision – Visio for real-time dataflow management
• National Security Agency donated the codebase to the ASF in late 2014

Apache NiFi
Ã Key Features
• Guaranteed delivery
• Data buffering
- Backpressure
- Pressure release
• Prioritized queuing
• Flow specific QoS
- Latency vs. throughput
- Loss tolerance
• Data provenance
• Supports push and pull
models
• Recovery/recording
a rolling log of fine-
grained history
• Visual command and
control
• Flow templates
• Pluggable/multi-role
security
• Designed for extension
• Clustering

Visual Command and Control
• Drag and drop processors to build a flow
• Start, stop, and configure components in real time
• View errors and corresponding error messages
• View statistics and health of data flow
• Create templates of common processor & connections

Provenance/Lineage

Prioritization
• Configure a prioritizer per
connection
• Determine what is important for
your data – time based, arrival
order, importance of a data set
• Funnel many connections down to
a single connection to prioritize
across data sets
• Develop your own prioritizer if
needed

Latency vs. Throughput
• Choose between lower latency, or higher throughput on each processor

NiFi Positioning
Apache
NiFi / MiNiFi
ETL
(Informatica, etc.)
Enterprise
Service Bus
(Fuse, Mule, etc.)
Messaging
Bus
(Kafka, MQ, etc.)
Processing
Framework
(Storm, Spark, etc.)

Apache NiFi / Processing Frameworks
NiFi
Simple event processing
• Primarily feed data into processing
frameworks, can process data, with a focus on
simple event processing
• Operate on a single piece of data, or in
correlation with an enrichment dataset
(enrichment, parsing, splitting, and
transformations)
• Can scale out, but scale up better to take full
advantage of hardware resources, run
concurrent processing tasks/threads
(processing terabytes of data per day on a
single node)
Not another distributed processing framework,
but to feed data into those
Processing Frameworks (Storm, Spark, etc.)
Complex and distributed processing
• Complex processing from multiple streams (JOIN
operations)
• Analyzing data across time windows (rolling window
aggregation, standard deviation, etc.)
• Scale out to thousands of nodes if needed
Not designed to collect data or manage data flow

Apache NiFi / Messaging Bus Services
NiFi
Provide dataflow solution
• Centralized management, from edge to core
• Great traceability, event level data provenance
starting when data is born
• Interactive command and control – real time
operational visibility
• Dataflow management, including prioritization,
back pressure, and edge intelligence
• Visual representation of global dataflow
Not a messaging bus, flow maintenance needed
when you have frequent consumer side updates
Messaging Bus (Kafka, JMS, etc.)
Provide messaging bus service
• Low latency
• Great data durability
• Decentralized management (producers & consumers)
• Low broker maintenance for dynamic consumer side
updates
Not designed to solve dataflow problems
(prioritization, edge intelligence, etc.)
Traceability limited to in/out of topics, no lineage
Lack of global view of components/connectivities

Apache NiFi / Integration, or ingestion, Frameworks
NiFi
End user facing dataflow management tool
• Out of the box solution for dataflow
management
• Interactive command and control in the core,
design and deploy on the edge
• Flexible failure handling at each point of the flow
• Visual representation of global dataflow and
connectivities
• Native cross data center communication
• Data provenance for traceability
Not a library to be embedded in other
applications
Integration framework (Spring Integration,
Camel, etc), ingestion framework (Flume, etc)
Developer facing integration tool with a focus
on data ingestion
• A set of tools to orchestrate workflow
• A fixed design and deploy pattern
• Leverage messaging bus across disconnected
networks
Developer facing, custom coding needed to optimize
Pre-built failure handling, lack of flexibility
No holistic view of global dataflow
No built-in data traceability

Apache NiFi / ETL Tools
NiFi
NOT schema dependent
• Dataflow management for both structured and
unstructured data, powered by separation of
metadata and payload
• Schema is not required, but you can have
schema
• Minimum modeling effort, just enough to
manage dataflows
• Do the plumbing job, maximize developers’
brainpower for creative work
Not designed to do heavy lifting transformation
work for DB tables (JOIN datasets, etc.). You can
create custom processors to do that, but long
way to go to catch up with existing ETL tools from
user experience perspective (GUI for data
wrangling, cleansing, etc.)
ETL (Informatica, etc.)
Schema dependent
• Tailored for Databases/WH
• ETL operations based on schema/data modeling
• Highly efficient, optimized performance
Must pre-prepare your data, time consuming to build
data modeling, and maintain schemas
Not geared towards handling unstructured data, PDF,
Audio, Video, etc.
Not designed to solve dataflow problems

NiFi Big Picture Pattern: Diverse Flows from One Tool
“Swiss Army Knife of
Data Movement”

Apache Kafka

What is Apache Kafka?
• Distributed streaming platform that allows
publishing and subscribing to streams of
records
• Streams of records are organized into
categories called topics
• Topics can be partitioned and/or replicated
• Records consist of a key, value, and
timestamp
http://kafka.apache.org/intro
Kafka
Cluster
producer
producer
producer
consumer
consumer
consumer
APACHE KAFKA
High throughput, distributed system. Designed to operate at large scale.

Why Kafka
Source
System
Source
System
Source
System
Source
System
Kafka
Hadoop Security
Systems
Real-Time
Monitoring
Data
Warehouse
Producers
Brokers
Consumers
Kafka decouples data pipelines

Overview of Topics
• Topics are a partitioned ordered, immutable sequence of messages
• Messages are retained for a configurable amount of time (24 hours, 7 days,
etc.)
• Each consumer retains its own offset in the partition

Understanding Partitions
• Partitions are an ordered, immutable sequence of messages
• Each partition is replicated for fault tolerance
• A replicated partition has one broker that acts as the leader and the rest
as followers

Publishing Messages
message_a
message_b
message_c
message_d
message_e
message_f
1. A producer publishes messages to a topic
2. The producer decides which
partition to send each message to
offset -> 0 1 2 3 4
Partition 0 message_b message_f
Partition 1 message_a message_c message_e
Partition 2 message_d
Old New
3. New messages are written to
the end of the partition
4. A consumer fetches messages from a
partition by specifying an offset

Leader and Followers
Broker 1
my_topic
Partition-1 (follower)
Broker 2
my_topic
Partition-1 (leader)
Broker 3
my_topic
Partition-1 (follower)
The leader handles
all read and write
requests

Consuming Messages
• Messages are consumed in Kafka by a consumer group
• Each individual consumer is labeled with a group name
• Each message in a topic is sent to one consumer in the group

Consumer Groups
Broker 1
my_topic: Partition-0
Broker 2
Consumer Group A
consumer consumer
consumer consumer
Consumer Group B
consumer consumer
consumer consumer
consumer
message_1
Each message is
consumed by one
consumer per group

Spark Structured
Streaming

What is Real-Time?

Thinking about time

Time Skew

Windows

Tumbling Windows

Tumbling Time Windowing

Tumbling Count Windowing

Sliding Windows

Sliding Time Window

Sliding Count Window

Watermarking
• The threshold of how late data is expected to be and when to drop old
state
• Trails behind max event time seen by the Spark engine
• Easiest to think of the Watermark delay as the trailing gap

Watermarking
• Data that is newer than the watermark is late, but allowed to aggregate
• Data that is older than the watermark is dropped as it’s "too late"
• Any windows older than watermark are automatically deleted to limit
state

Structured Streaming
Turns stream processing into SQL
fast, scalable, fault-tolerant
Unifies high level APIs with Spark
deal with complex data and complex workloads

The Do’s and Don’ts

The Good, The Bad, The Ugly
• Use NiFi for all ingestion and data preparationAcquire
Data
Store
Data
Data
Flow
Process
Analyze
Data
Acquire
Data
Store
Data
Acquire
Data
Store
Data
Store
Data
Store
Data
Store
Data
Process
and
Analyze
Data
Data
Flow
Acquire
Data
Acquire
Data

• Be very cautious of forcing order in Kafka

• Chose your data store to match your query pattern

• Monitor, Monitor, Monitor
• Think about using schemas
• Plan for spike in traffic – this impacts Kafka portioning and consumers

Questions?

Thank you

Using Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise

Recommended

More Related Content

What's hot (20)

Similar to Using Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise (20)

More from DataWorks Summit (20)

Recently uploaded (20)

Using Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise