0% found this document useful (0 votes)
55 views

Hortonworks DataFlow White Paper

Accelerating Big Data Collection and DataFlow Management

Uploaded by

silentidea8317
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views

Hortonworks DataFlow White Paper

Accelerating Big Data Collection and DataFlow Management

Uploaded by

silentidea8317
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Hortonworks DataFlow

Accelerating Big Data Collection and DataFlow Management


A Hortonworks White Paper
DECEMBER 2015

Hortonworks DataFlow ©2015 Hortonworks


www.hortonworks.com
2

Contents
What is Hortonworks DataFlow? 3

Benefits of Hortonworks DataFlow 4

• Leverage Operational Efficiency

• Make Better Business Decisions

• Increase Data Security

Features of Hortonworks DataFlow 5

• Data Collection

• Real Time Decisions

• Operational Efficiency

• Security and Provenance

• Bi-directional DataFlow

• Command and Control

Common applications of Hortonworks DataFlow 6

• Accelerated Data Collection and


Operational Effectiveness

• Increased Security and Unprecedented


Chain of Custody

• The Internet of Any Thing with


Hortonworks DataFlow

• Adaptive to Resource Constraints

• Secure Data Collection

• Prioritized Data Transfer and


Bi-directional Feedback Loop

Why Hortonworks for Apache™ Hadoop®? 12

About Hortonworks 12
3

What is Hortonworks DataFlow?


Hortonworks DataFlow (HDF), powered by Apache™ NiFi, is the first Hortonworks DataFlow is based on Apache
integrated platform that solves the complexity and challenges of collecting NiFi, technology originally created by the
NSA (National Security Agency) in 2006
and transporting data from a multitude of sources be they big or small,
to address the challenge of automating
fast or slow, always connected or intermittently available.
the flow of data between systems of all
Hortonworks DataFlow is a single combined platform for data acquisition, simple event types–the very same problem that
processing, transport and delivery, designed to accommodate the highly diverse and enterprises are encountering today.
complicated dataflows generated by a world of connected people, systems and things. After eight years of development and use
at scale, the NSA Technology Transfer
An ideal solution for the Internet of Any Thing (IoAT), HDF enables simple, fast data
Program released NiFi to the Apache
acquisition, secure data transport, prioritized data flow and clear traceability of data from
Software Foundation in the fall of 2014.
the very edge of your network all the way to the core data center. Through a combination
of an intuitive visual interface, a high fidelity access and authorization mechanism and an
“always on” chain of custody (data provenance) framework, HDF is the perfect complement
to HDP to bring together historical and perishable insights for your business.

A single integrated platform for data acquisition, simple event processing, transport and
delivery mechanism from source to storage.

Hortonworks DataFlow enables the real time collection Hortonworks Data Platform can be used to enrich
and processing of perishable insights. content and support changes to real-time dataflows.

ENRICH CONTENT

STORE DATA AND METADATA

PERISHABLE HISTORICAL
INSIGHTS INSIGHTS

Hortonworks DataFlow is designed to securely collect and transport data from highly diverse data sources be they big or
small, fast or slow, always connected or intermittently available.

Figure 1: Hortonworks DataFlow


4

Benefits of Hortonworks DataFlow


DataFlow was designed inherently to meet the practical challenges of collecting data from a wide range of disparate
data sources; securely, efficiently and over a geographically disperse and possibly fragmented network. Because the
NSA encountered many of the issues enterprises are facing now, this technology has been field-proven with built-in
capabilities for security, scalability, integration, reliability and extensibility and has a proven track record for operational
usability and deployment.

HORTONWORKS DATAFLOW ENABLES ENTERPRISES TO:

Leverage Operational Efficiency Make Better Business Decisions Increase Data Security
• Accelerate big data ROI via simplified • Make better business decisions with • Support unprecedented yet simple to
data collection and a visually intuitive highly granular data sharing policies implement data security from source to
dataflow management interface • Focus on innovation by automating storage
• Significantly reduce cost and dataflow routing, management and • Improve compliance and reduce risk
complexity of managing, maintaining trouble-shooting without the need through highly granular data access,
and evolving dataflows for coding data sharing and data usage policies
• Trace and verify value of data sources • Enable on-time, immediate decision • Create a secure dataflow ecosystem
for future investments making by leveraging real time data with the ability to run the same security
• Quickly adapt to new data sources bi-directional dataflows and encryption on small scale JVM

through an extremely scalable • Increase business agility with prioritized capable data sources as well as

extensible platform data collection policies enterprise class datacenters

Accelerate big data ROI through a single Reduce cost and complexity through an Unprecedented yet simple to implement
data-source agnostic collection platform intuitive, real-time visual user interface data security from source to storage

Better business decisions with highly React in real time by leveraging Adapt to new data sources through an
granular data sharing policies bi-directional data flows and prioritized extremely scalable, extensible platform
data feeds
5

Features of Hortonworks DataFlow


DATA COLLECTION REAL TIME DECISIONS OPERATIONAL EFFICIENCY
Integrated collection from dynamic, Real-time evaluation of perishable Fast, effective drag and drop interface
disparate and distributed sources of insights at the edge as being pertinent or for creation, management, tuning and
differing formats, schemas, protocols, not, and executing upon consequent troubleshooting of dataflows, enabling
speeds and sizes such as machines, decisions to send, drop or locally store coding free creation and adjustments of
geo location devices, click streams, data as needed dataflows in five minutes or less
files, social feeds, log files and videos

SECURITY AND PROVENANCE BI-DIRECTIONAL DATAFLOW COMMAND AND CONTROL


Secure end-to-end routing from source Reliably prioritize and transport data Immediate ability to create, change,
to destination, with discrete user in real time leveraging bi-directional tune, view, start, stop, trace, parse,
authorization and detailed, real time dataflows to dynamically adapt filter, join, merge, transform, fork,
visual chain of custody and metadata to fluctuations in data volume, clone or replay dataflows through a
(data provenance) network connectivity and source visual user interface with real time
and endpoint capacity operational visibility and feedback

Figure 2: Apache NiFi Real-Time Visual User Interface


6

Common Applications of WHAT IS A COMMAND AND


CONTROL INTERFACE?
Hortonworks DataFlow
Hortonworks DataFlow accelerates time to insight by securely enabling A command and control interface provides
off-the shelf, flow based programming for big data infrastructure and the ability to manipulate the dataflow in
real time such that current contextual data
simplifying the current complexity of secure data acquisition, ingestion
can be fed back to the system to
and real time analysis of distributed, disparate data sources.
immediately change its output. This is in
An ideal framework for collection of data and management of dataflows, the most contrast to a design and deploy approach
popular uses of Hortonworks DataFlow are for: simplified, streamlined big data ingest, which involves statically programming a
increased security for collection and sharing of data with high fidelity chain of custody dataflow system before the data is flowed
metadata and as the underlying infrastructure for the internet of things. through it, and then returning to a static
programming phase to make adjustments
Case 1: Accelerated Data Collection And Operational Effectiveness
and restarting the dataflow again. An analogy
Streamlined Big Data Ingestion to this would be the difference between 3D
Hortonworks DataFlow accelerates big data pipeline ingest through a single integrated printing which requires pre-planning before
and easily extensible visual interface for acquiring, and ingesting data from different, execution, and molding clay which provides
disparate, distributed data sources in real time. The simplified and integrated creation, immediate feedback and adjustment of the
control and analyses of data flows results in faster ROI of big data projects and increased end product in real time.
operational effectiveness.
To learn more about Hortonworks DataFlow
go to http://hortonworks.com/hdf
UI
Easily Access and Trace
Changes to DataFlow

UI
Dynamically Adjust DataFlow
UI
Processor Real Time Changes

Figure 2: An integrated, data source agnostic collection platform


7

Why aren't current data collection systems ideal?


Current big data collection and ingest tools are purpose-built and over-engineered simply
because they were not originally designed with universally applicable, operationally efficient
design principles in mind. This creates a complex architecture of disparate acquisition,
messaging and often customized transformation tools that make big data ingest complex,
time consuming and expensive from both a deployment and a maintenance perspective.
Further, the time lag associated with the command line and coding dependent tools fetters
access to data and prevents the on-time, operational decision making required of today’s
business environment.

Figure 3: Current big data ingest solutions are complex and operationally inefficient
8

Case 2: Increased Security and Unprecedented Chain of Custody WHAT IS DATA PROVENANCE?

Increased security and provenance with Hortonworks DataFlow


Data security is growing ever more important in the world of ever connected devices and the Provenance is defined as the place of origin
need to adhere to compliance and data security regulations is currently difficult, complex and or earliest known history of some thing.
costly. Verification of data access and usage is difficult, time consuming and often involves a In the context of a dataflow, data provenance
manual process of piecing together different systems and reports to verify where data is is the ability to trace the path of a piece of
sourced from, how it is used, who has used it and how often. data within a dataflow from its place of
creation, through to its final destination.
Current tools utilized for transporting electronic data today are not designed for the expected
Within Hortonworks DataFlow data
security requirements of tomorrow. It is difficult, if not almost impossible for current tools to
provenance provides the ability to visually
share discrete bits of data, much less do so dynamically—a problem that had to be addressed
verify where data came from, how it was
in the environment of Apache NiFI as a dataflow platform used in governmental agencies.
used, who viewed it, whether it was sent,
Hortonworks Dataflow addresses the security and data provenance needs in an electronic copied, transformed or received. Any system
world of distributed real time big data flow management. Hortonworks DataFlow augments or person who came in contact with a
existing systems with a secure, reliable, simplified and integrated big data ingestion platform specific piece of data is captured in its
that ensures data security from all sources – be they centrally located, high volume data completeness in terms of time, date, action,
centers or remotely distributed data sources over geographically dispersed communication precedents and dependents for a complete
links. As part of its security features, HDF inherently provides end-to-end data provenance— picture of the chain of custody for that
a chain-of-custody for data. Beyond the ability to meet compliance regulations, provenance precise piece of data within a dataflow.
provides a method for tracing data from its point of origin, from any point in the dataflow, This provenance metadata information is
in order to determine which data sources are most used and most valuable. used to support data sharing compliance
requirements as well as for dataflow
troubleshooting and optimization.

Figure 4: Secure from source to storage with high fidelity data provenance
9

Data democratization with unprecedented security


Hortonworks DataFlow opens new doors of business insights by enabling secure access
without hindering analysis, as very specific data can be shared or not shared. For example,
Mary could be given access to discrete pieces of data tagged with the term “finance” within a
dataflow, while Jan could be given access to the same dataflow but with access only to data
tagged with “2015” and “revenue”. The removes the disadvantages of role based data access
which can inadvertently create security risks, while still enabling democratization of data for
comprehensive analysis and decision making.

Hortonworks Dataflow, with its inherent ability to support fine grained provenance data and
metadata throughout the collection, transport and ingest process provides comprehensive and
detailed information needed for audit and remediation unmatched by any existing data ingest
system in place today.

Figure 5: Fine-grained data access and control


10

Case 3: The Internet of Any Thing (IoAT)


The Internet of Any Thing with Hortonworks DataFlow
Designed in the field, where resources are scarce (power, connectivity, bandwidth),
Hortonworks DataFlow is a scalable, proven platform for the acquisition and ingestion
of the Internet of Things (IoT), or even more broadly, the Internet of Any Thing (IoAT).

Adaptive to Resource Constraints


There are many challenges in enabling an ever connected yet physically dispersed Internet of
Things. Data sources may often be remote, physical footprints may be limited, power and
bandwidth are likely to be both variable and constrained. Unreliable connectivity disrupts
communication and causes data loss while the lack of security on most of the world’s
deployed sensors puts businesses and safety at risk.

At the same time, devices are producing more data than ever before. Much of the data being
produced is data-in-motion and unlocking the business value from this data is crucial to
business transformations of the modern economy.

Yet business transformation relies on accurate, secure access to data from the source through
to storage. Hortonworks DataFlow was designed with all these real-world constraints in mind:
power limitations, connectivity fluctuations, data security and traceability, data source diversity
and geographical distribution, altogether, for accurate, time-sensitive decision making.

Figure 6: A proven platform for the Internet of Things


11

Secure Data Collection


Hortonworks Dataflow addresses the security needs of IoT with a secure, reliable, simplified
and integrated big data collection platform that ensures data security from distributed data
sources over geographically dispersed communication links. The security features of HDF
includes end-to-end data provenance — a chain-of-custody for data. This enables IoT systems
to verify origins of the dataflow, trouble-shoot problems from point of origin through
destination and the ability to determine which data sources are most frequently used and
most valuable.

Hortonworks DataFlow is able to run security and encryption on small scale, JVM-capable
data sources as well as enterprise class datacenters. This enables the Internet of Things
with a reliable, secure, common data collection and transport platform with a real-time
feedback loop to continually and immediately improve algorithms and analysis for accurate,
informed on-time decision making.

Prioritized Data Transfer and Bi-directional Feedback Loop


Because connectivity and available bandwidth may fluctuate, and the volume of data being
produced by the source may exceed that which can be accepted by the destination,
Hortonworks DataFlow supports the prioritization of data within a dataflow. This means
that should there be resource constraints, the data source can be instructed to automatically
promote the more important pieces of information to be sent first, while holding less
important data for future windows of transmission opportunity, or possibly not sent at all.
For example, should an outage from a remote device occur, it is critical to send the
“most important” data from that device first as the outage is repaired and communication
is re-established, Once this critical “most important” data has been sent, it can then be
followed by the backlog of lower priority data that is vital to historical analysis, but less
critical to immediate decision making.

Hortonworks DataFlow enables the decision to be made at the edge of whether to send,
drop or locally store data, as needed, and as conditions change. Additionally, with a fine
grained command and control interface, data queues can be slowed down, or accelerated to
balance the demands of the situation at hand with the current availability and cost of resources.

With the ability to seamlessly adapt to resource constraints in real time, ensure secure data
collection and prioritized data transfer, Hortonworks DataFlow is a proven platform ideal for
the Internet of Things.
12

Why Hortonworks for Apache™ Hadoop®?


Founded in 2011 by 24 engineers from the original Yahoo! Apache Hadoop development and
operations team, Hortonworks has amassed more Apache Hadoop experience under one roof
than any other organization. Our team members are active participants and leaders in Apache
Hadoop developing, designing, building and testing the core of the Apache Hadoop platform.
We have years of experience in Apache Hadoop operations and are best suited to support
your mission-critical Apache Hadoop project.

For an independent analysis of Hortonworks Data Platform and its leadership among Apache
Hadoop vendors, you can download the Forrester Wave™: Big Data Apache Hadoop Solutions,
Q1 2014 report from Forrester Research.

About Hortonworks
Hortonworks develops, distributes and supports the only 100% open source Apache Hadoop
data platform. Our team comprises the largest contingent of builders and architects within the
Apache Hadoop ecosystem who represent and lead the broader enterprise requirements
within these communities. Hortonworks Data Platform deeply integrates with existing IT
investments upon which enterprises can build and deploy Apache Hadoop-based applications.
Hortonworks has deep relationships with the key strategic data center partners that enable our
customers to unlock the broadest opportunities from Apache Hadoop.

For more information, visit www.hortonworks.com.

You might also like