Webinar Series Part 5 New Features of HDF 5

GuideTo New Features of
Hortonworks DataFlow 2.0
Haimo Liu
Product Manager
Bryan Bende
Sr. Software Engineer

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Connected Data Platforms

Stream Processing
Flow Management
Enterprise Services
At the edge
Security
Visualization
On premises In the cloud
Registries/Catalogs Governance (Security/Compliance) Operations
HDF 2.0 – Data in Motion Platform

Flow Management Flow management + Stream Processing
D A T A I N M O T I O N D A T A A T R E S T
IoT Data Sources AWS
Azure
Google Cloud
Hadoop
NiFi
Kafka
Storm
Others…
NiFi
NiFi NiFi
MiNiFi
MiNiFi
MiNiFi
MiNiFi
MiNiFi
MiNiFi
MiNiFi
NiFi
HDF 2.0 – Data in Motion Platform
Enterprise Services
Ambari Ranger Other services

Dataflow Management

Problems Today: Timely Access to Data and Decisions
http://diginomica.com/2016/04/22/royal-mail-starts-to-deliver-on-hortonworks-data-in-motion-promise
“HDF helps us to streamline the flow
of data and build models and
visualisations quickly, so that my team
can work iteratively with business
colleagues on building solutions
that work for the business.“
Royal Mail

HDP
HORTONWORKS
DATA PLATFORM
Powered by Apache Hadoop
HDF Makes Big Data Ingest Easy
Complicated, messy, and takes weeks to
months to move the right data into Hadoop
Streamlined, Efficient, Easy
HDP
HORTONWORKS
DATA PLATFORM
Powered by Apache Hadoop

Create a live dataflow in minutes
How would that change your business?

Add processor for data intake. Time: 1 minute
1 Drag and drop processor from top menu

Choose the specific processor
2 Choose one of the processors – currently 170+ available

Example: Pick Twitter Processor

Configure the processor. Time: 2 minutes
3
4
Select processor and choose
option to Configure
Adjust
parameters as
required

Another processor for data output. Time: 1 minute
5
6 Filter for and select a “Put” processor
Drag and drop processor from top menu

Configure second processor. Time: 1 minute
7 Configure 2nd processor

Connect processors, configure connection. 2 minutes
Configure Connection8
Note: Sample Flow is different from previous example of PutHDFS. This dataflow is PutFile. Same concepts apply.

Click Start to Begin Processing. Time total: 7 minutes
9 Click start “play” to being processing
(will run continuously until you select stop)

HDF 2.0: what’s new?

Challenges
Different devices
Globally distributed organization
Intelligence on the edge
Time to delivery
Getting the right data to
the right place at the
right time is not trivial!

Challenges & NiFi
Different devices: different standards/protocols/formats
• Out of the box processors
• Intuitive GUI to combine processors and build ingestion pipeline
• Extensible framework, extremely easy to add a new source/protocol
Globally distributed organizations
Time to delivery
Support disparate,
distributed systems
with easy drag & drop

Challenges & NiFi & HDF 2.0
• Deeper ecosystem integration, 170+ processors in total
Time to delivery Expanded ecosystem

HDF 2.0 has 170+ Processors, 30% Increase from HDF 1.2
Hash
Extract
Merge
Duplicate
Scan
GeoEnrich
Replace
ConvertSplit
Translate
Route Content
Route Context
Route Text
Control Rate
Distribute Load
Generate Table Fetch
Jolt Transform JSON
Prioritized Delivery
Encrypt
Tail
Evaluate
Execute
HL7
FTP
UDP
XML
SFTP
HTTP
Syslog
Email
HTML
Image
AMQP
MQTT
All Apache project logos are trademarks of the ASF and the respective projects.
Fetch

Deeper Ecosystem Integration – New Processors
Processor Description
Publish/ConsumeKafka Two NARs, with kafka 0.9/0.10 client libraries, respectively
JoltTransformJson Manipulate JSON data on the fly, with a preview functionality
GenerateTableFetch Incremental fetch + parallel fetch against source table partitions
PutHiveQL Ingest to Hive tables
SelectHiveQL Select from Hive tables
PutHiveStreaming ingest streaming data to Hive, leverage Hive streaming API
CovertAvroToORC Format conversation, Avro to ORC
Publish/ConsumeMQTT MQTT is a popular protocol in IoT world

• Deeper ecosystem integration, 170+ processors in total
• Redesigned UI, refreshed user experience
Time to delivery
More intuitive user
interface

Modernized UI – Complete Interface Redesign

Challenges & NiFi
Different devices
Globally distributed organizations: dataflow across multiple data centers
• Internal Site to Site communication, secured by 2-way SSL
• Environmental neutral
Time to delivery Secure communications
across disparate,
distributed systems

Different devices
• Variable registry
Time to delivery
Simplifies flow
provisioning

Variable Registry
 Variable registry
– To automatically resolve environmental specific values
• Example: connection string
• The same key referenced in a template, can be mapped to different values
in DEV vs PROD
– In-memory variable registry

Different devices
• Internal Site-to-Site communication, secured by 2-way SSL
• Better deployment management, Apache Ambari integration
Time to delivery Simplified operations in
distributed environments

Ambari Integration
 NiFi cluster management
– Start/stop NiFi service
– Centralized place for managing config files
 Ambari to display NiFi metrics
 Ambari to manage kerberos
authentication
Ambari-NiFi Integration
 Automated deployment by Ambari
 Manual RPM deployment
 Tar.gz/zip deployment (NIFI/MINIFI Java)
 Tar.gz for most Linux/Mac, compile your own
for other OS (MINIFI C++)
HDF 2.0 Deployment Model

Different devices
• Better deployment management, Apache Ambari integration
• Enhanced Site to Site communication
Time to delivery
Modularized s2s to support
pluggable protocols

Challenges & NiFi
Different devices, Globally distributed organizations
Intelligence on the edge: analytics on resource constrained devices
• Run single node on the edge, communicating back via S2S
• Bi-directional communication
Time to delivery
Analytics at the Edge

Different devices, Globally distributed organizations
Intelligence on the edge: analytics on resource constrained devices
• Run single node on the edge, communicating back via Site to Site protocol
• Bi-directional communication
• Apache MiNiFi, bi-directional command and control on the edge
Time to delivery
Edge Intelligence
for the
first mile

Edge Intelligence with Apache MiNiFi
 Guaranteed delivery
 Data buffering
‒ Backpressure
‒ Pressure release
 Prioritized queuing
 Flow specific QoS
‒ Latency vs. throughput
‒ Loss tolerance
 Data provenance
 Recovery / recording a rolling log
of fine-grained history
 Designed for extension
Different from Apache NiFi
 Design and Deploy
 Warm re-deploys
Key Features

NiFi vs. MiNiFi Java Agent
NiFi Framework
Components
MiNiFi
NiFi Framework
User Interface
Components
NiFi

Challenges & NiFi
Different devices, Globally distributed organizations, Intelligence on the edge
Time to delivery: need an application, out of the box solution
• Data provenance, traceability and compliance issues
• Flow visibility, big picture of the enterprise dataflow
• Automatic failure handling
FAST AND EASY
To get results, tune and
change dataflows

• Control plane high availability, zero-master clustering
High availability

Zero-master Clustering
 New clustering paradigm
 Zero-master clustering
– Multiple entry points, no master node, no single point of failure
– Auto-elected cluster coordinator for cluster maintenance
– Automatic failover handling
HDF 2.0 (NiFi 1.0.0)

Heartbeat messages (every 5s by default)
Node status: connecting/connected/disconnecting/disconnected

• Control plane high availability, zero-master clustering
• Multi-tenancy flow editing, and authorization
Secured enterprise wide
collaboration

Multi-tenant Flow Editing
 Multi-tenant flow editing
– Self-service collaborative model, google-doc type user experience
– Multiple teams making edits to different processors at the same time
– Only the component being modified is locked, not the entire flow
– Scalable model to speed up flow editing
HDF 2.0 (NiFi 1.0.0)

Multi-tenant Authorization
 Component level authorization
– New authorizer API
– “Read” and “Write” permissions
– Protection against unauthorized usage without losing context
 Authorization management
– Internal management (NIFI)
– External management (Ranger, etc.)
HDF 2.0 (NiFi 1.0.0)

Read Permission
Processor name
visible
Processor configuration
visible

NO Read Permission
Processor name & configuration invisible
(content)
Statistics visible
(context)

Questions?
Hortonworks Community Connection:
Data Ingestion and Streaming
https://community.hortonworks.com/

Webinar Series Part 5 New Features of HDF 5

Recommended

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Webinar Series Part 5 New Features of HDF 5 (20)

More from Hortonworks (20)

Recently uploaded (20)

Webinar Series Part 5 New Features of HDF 5

Editor's Notes