Apache Flume: Distributed Log Collection For Hadoop - Second Edition - Sample Chapter
Apache Flume: Distributed Log Collection For Hadoop - Second Edition - Sample Chapter
Second Edition
P U B L I S H I N G
pl
C o m m u n i t y
E x p e r i e n c e
D i s t i l l e d
$ 36.99 US
22.99 UK
Sa
m
Steve Hoffman
Second Edition
ee
Second Edition
Design and implement a series of Flume agents to send streamed
data into Hadoop
Steve Hoffman
and on Twitter
This is the first update to Steve's first book, Apache Flume: Distributed Log Collection
for Hadoop, Packt Publishing.
I'd again like to dedicate this updated book to my loving and supportive
wife, Tracy. She puts up with a lot, and that is very much appreciated. I
couldn't ask for a better friend daily by my side.
My terrific children, Rachel and Noah, are a constant reminder that hard
work does pay off and that great things can come from chaos.
I also want to give a big thanks to my parents, Alan and Karen, for
molding me into the somewhat satisfactory human I've become. Their
dedication to family and education above all else guides me daily as I
attempt to help my own children find their happiness in the world.
Chapter 4, Sinks and Sink Processors, goes into great detail on using the HDFS Flume
output, including compression options and options for formatting the data. Failover
options are also covered so that you can create a more robust data pipeline.
Chapter 5, Sources and Channel Selectors, introduces several of the Flume input
mechanisms and their configuration options. Also covered is switching between different
channels based on data content, which allows the creation of complex data flows.
Chapter 6, Interceptors, ETL, and Routing, explains how to transform data in-flight as
well as extract information from the payload to use with Channel Selectors to make
routing decisions. Then this chapter covers tiering Flume agents using Avro serialization,
as well as using the Flume command line as a standalone Avro client for testing and
importing data manually.
Chapter 7, Putting It All Together, walks you through the details of an end-to-end use
case from the web server logs to a searchable UI, backed by Elasticsearch as well as
archival storage in HDFS.
Chapter 8, Monitoring Flume, discusses various options available for monitoring Flume
both internally and externally, including Monit, Nagios, Ganglia, and custom hooks.
Chapter 9, There Is No Spoon the Realities of Real-time Distributed Data Collection, is
a collection of miscellaneous things to consider that are outside the scope of just
configuring and using Flume.
This works great when you have all your data neatly packaged and ready to upload.
However, your website is creating data all the time. How often should you batch
load data to HDFS? Daily? Hourly? Whatever processing period you choose,
eventually somebody always asks "can you get me the data sooner?" What you
really need is a solution that can deal with streaming logs/data.
Turns out you aren't alone in this need. Cloudera, a provider of professional services
for Hadoop as well as their own distribution of Hadoop, saw this need over and over
when working with their customers. Flume was created to fill this need and create a
standard, simple, robust, flexible, and extensible tool for data ingestion into Hadoop.
Flume 0.9
Flume was first introduced in Cloudera's CDH3 distribution in 2011. It consisted
of a federation of worker daemons (agents) configured from a centralized master
(or masters) via Zookeeper (a federated configuration and coordination system).
From the master, you could check the agent status in a web UI as well as push
out configuration centrally from the UI or via a command-line shell (both really
communicating via Zookeeper to the worker agents).
Data could be sent in one of three modes: Best effort (BE), Disk Failover (DFO), and
End-to-End (E2E). The masters were used for the E2E mode acknowledgements and
multimaster configuration never really matured, so you usually only had one master,
making it a central point of failure for E2E data flows. The BE mode is just what it
sounds like: the agent would try to send the data, but if it couldn't, the data would
be discarded. This mode is good for things such as metrics, where gaps can easily be
tolerated, as new data is just a second away. The DFO mode stores undeliverable data
to the local disk (or sometimes, a local database) and would keep retrying until the
data could be delivered to the next recipient in your data flow. This is handy for those
planned (or unplanned) outages, as long as you have sufficient local disk space to
buffer the load.
In June, 2011, Cloudera moved control of the Flume project to the Apache Foundation.
It came out of the incubator status a year later in 2012. During the incubation year,
work had already begun to refactor Flume under the Star-Trek-themed tag, Flume-NG
(Flume the Next Generation).
Chapter 1
Another major difference in Flume 1.X is that the reading of input data and the
writing of output data are now handled by different worker threads (called
Runners). In Flume 0.9, the input thread also did the writing to the output (except
for failover retries). If the output writer was slow (rather than just failing outright),
it would block Flume's ability to ingest data. This new asynchronous design leaves
the input thread blissfully unaware of any downstream problem.
The first edition of this book covered all the versions of Flume up till Version 1.3.1.
This second edition will cover till Version 1.5.2 (the current version at the time of
writing this).
[9]
These factors need to be weighed when determining the rotation period to use when
writing to HDFS. If the plan is to keep the data around for a short time, then you can
lean toward the smaller file size. However, if you plan on keeping the data for a very
long time, you can either target larger files or do some periodic cleanup to compact
smaller files into fewer, larger files to make them more MapReduce friendly. After
all, you only ingest the data once, but you might run a MapReduce job on that data
hundreds or thousands of times.
Keep in mind:
Flume events
The basic payload of data transported by Flume is called an event. An event is
composed of zero or more headers and a body.
[ 10 ]
Chapter 1
The headers are key/value pairs that can be used to make routing decisions or
carry other structured information (such as the timestamp of the event or the
hostname of the server from which the event originated). You can think of it as
serving the same function as HTTP headersa way to pass additional information
that is distinct from the body.
The body is an array of bytes that contains the actual payload. If your input is
comprised of tailed log files, the array is most likely a UTF-8-encoded string
containing a line of text.
Flume may add additional headers automatically (like when a source adds the
hostname where the data is sourced or creating an event's timestamp), but the
body is mostly untouched unless you edit it en route using interceptors.
[ 11 ]
Channel selectors are responsible for how data moves from a source to one or more
channels. Flume comes packaged with two channel selectors that cover most use
cases you might have, although you can write your own if need be. A replicating
channel selector (the default) simply puts a copy of the event into each channel,
assuming you have configured more than one. In contrast, a multiplexing channel
selector can write to different channels depending on some header information.
Combined with some interceptor logic, this duo forms the foundation for routing
input to different channels.
Finally, a sink processor is the mechanism by which you can create failover paths for
your sinks or load balance events across multiple sinks from a channel.
[ 12 ]
Chapter 1
[ 13 ]
Many of the custom Java interceptors that I've written in the past were to modify
the body (data) and can easily be replaced with an out-of-the-box Morphline
command chain. You can get familiar with the Morphline commands by checking
out their reference guide at http://kitesdk.org/docs/current/kitemorphlines/morphlinesReferenceGuide.html
Flume Version 1.4 also includes a Morphline-backed sink used primarily to feed
data into Solr. We'll see more of this in Chapter 4, Sinks and Sink Processors,
Morphline Solr Search Sink.
Morphlines are just one component of the KiteSDK included in Flume. Starting
with Version 1.5, Flume has added experimental support for KiteData, which is an
effort to create a standard library for datasets in Hadoop. It looks very promising,
but it is outside the scope of this book.
Please see the project home page for more information, as it will
certainly become more prominent in the Hadoop ecosystem as
the technology matures. You can read all about the KiteSDK at
http://kitesdk.org.
Summary
In this chapter, we discussed the problem that Flume is attempting to solve: getting
data into your Hadoop cluster for data processing in an easily configured, reliable
way. We also discussed the Flume agent and its logical components, including
events, sources, channel selectors, channels, sink processors, and sinks. Finally, we
briefly discussed Morphlines as a powerful new ETL (Extract, Transform, Load)
library, starting with Version 1.4 of Flume.
The next chapter will cover these in more detail, specifically, the most commonly
used implementations of each. Like all good open source projects, almost all of these
components are extensible if the bundled ones don't do what you need them to do.
[ 14 ]
Get more information Apache Flume: Distributed Log Collection for Hadoop
Second Edition
www.PacktPub.com
Stay Connected: