0% found this document useful (0 votes)

271 views13 pages

Apache Flume: Distributed Log Collection For Hadoop - Second Edition - Sample Chapter

Chapter No. 1 Overview and Architecture Design and implement a series of Flume agents to send streamed data into Hadoop For more information: http://bit.ly/1DPAphk

Uploaded by

Packt Publishing

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

271 views13 pages

Apache Flume: Distributed Log Collection For Hadoop - Second Edition - Sample Chapter

Chapter No. 1 Overview and Architecture Design and implement a series of Flume agents to send streamed data into Hadoop For more information: http://bit.ly/1DPAphk

Uploaded by

Packt Publishing

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Fr

Second Edition

This book starts with an architectural overview of Flume and

its logical components. It explores channels, sinks, and sink
processors, followed by sources and channels. By the end
of this book, you will be fully equipped to construct a series
of Flume agents to dynamically transport your stream data
and logs from your systems into Hadoop.
A step-by-step book that guides you through the architecture
and components of Flume covering different approaches,
which are then pulled together as a real-world, end-to-end
use case, gradually going from the simplest to the most
advanced features.

Who this book is written for

Understand the Flume architecture, and

also how to download and install open
source Flume from Apache
Follow along a detailed example of transporting
weblogs in Near Real Time (NRT) to Kibana/
Elasticsearch and archival in HDFS
Learn tips and tricks for transporting logs
and data in your production environment
Understand and configure the Hadoop File
System (HDFS) Sink
Use a morphline-backed Sink to feed
data into Solr

P U B L I S H I N G

C o m m u n i t y

Configure and use various sources to

ingest data

E x p e r i e n c e

D i s t i l l e d

Apache Flume: Distributed

Log Collection for Hadoop

Inspect data records and move them

between multiple destinations based
on payload content
Transform data en-route to Hadoop and
monitor your data flows

$ 36.99 US
22.99 UK

community experience distilled

Sa
m

Create redundant data flows using sink groups

Steve Hoffman

If you are a Hadoop programmer who wants to learn about

Flume to be able to move datasets into Hadoop in a timely
and replicable manner, then this book is ideal for you. No
prior knowledge about Apache Flume is necessary, but a
basic knowledge of Hadoop and the Hadoop File System
(HDFS) is assumed.

What you will learn from this book

Apache Flume: Distributed Log Collection for Hadoop

Second Edition

Apache Flume: Distributed

Log Collection for Hadoop
Apache Flume is a distributed, reliable, and available
service used to efficiently collect, aggregate, and move
large amounts of log data. It is used to stream logs from
application servers to HDFS for ad hoc analysis.

Second Edition
Design and implement a series of Flume agents to send streamed
data into Hadoop

Prices do not include

local sales tax or VAT
where applicable

Visit www.PacktPub.com for books, eBooks,

code, downloads, and PacktLib.

Steve Hoffman

In this package, you will find:

The author biography

A preview chapter from the book, Chapter 1 'Overview and Architecture'
A synopsis of the books content
More information on Apache Flume: Distributed Log Collection
for Hadoop Second Edition

About the Author

Steve Hoffman has 32 years of experience in software development, ranging from
embedded software development to the design and implementation of large-scale,
service-oriented, object-oriented systems. For the last 5 years, he has focused on
infrastructure as code, including automated Hadoop and HBase implementations and
data ingestion using Apache Flume. Steve holds a BS in computer engineering from
the University of Illinois at Urbana-Champaign and an MS in computer science from
DePaul University. He is currently a senior principal engineer at Orbitz Worldwide
(
).
More information on Steve can be found at
at
.

and on Twitter

This is the first update to Steve's first book, Apache Flume: Distributed Log Collection
for Hadoop, Packt Publishing.
I'd again like to dedicate this updated book to my loving and supportive
wife, Tracy. She puts up with a lot, and that is very much appreciated. I
couldn't ask for a better friend daily by my side.
My terrific children, Rachel and Noah, are a constant reminder that hard
work does pay off and that great things can come from chaos.
I also want to give a big thanks to my parents, Alan and Karen, for
molding me into the somewhat satisfactory human I've become. Their
dedication to family and education above all else guides me daily as I
attempt to help my own children find their happiness in the world.

Apache Flume: Distributed Log

Collection for Hadoop
Second Edition
Hadoop is a great open source tool for shifting tons of unstructured data into something
manageable so that your business can gain better insight into your customers' needs. It's
cheap (mostly free), scales horizontally as long as you have space and power in your
datacenter, and can handle problems that would crush your traditional data warehouse.
That said, a little-known secret is that your Hadoop cluster requires you to feed it data.
Otherwise, you just have a very expensive heat generator! You will quickly realize (once
you get past the "playing around" phase with Hadoop) that you will need a tool to
automatically feed data into your cluster. In the past, you had to come up with a solution
for this problem, but no more! Flume was started as a project out of Cloudera, when its
integration engineers had to keep writing tools over and over again for their customers to
automatically import data. Today, the project lives with the Apache Foundation, is under
active development, and boasts of users who have been using it in their production
environments for years.
In this book, I hope to get you up and running quickly with an architectural overview of
Flume and a quick-start guide. After that, we'll dive deep into the details of many of the
more useful Flume components, including the very important fi le channel for the
persistence of in-flight data records and the HDFS Sink for buffering and writing data
into HDFS (the Hadoop File System). Since Flume comes with a wide variety of
modules, chances are that the only tool you'll need to get started is a text editor for the
configuration file.
By the time you reach the end of this book, you should know enough to build a highly
available, fault-tolerant, streaming data pipeline that feeds your Hadoop cluster.

What This Book Covers

Chapter 1, Overview and Architecture, introduces Flume and the problem space that it's
trying to address (specifically with regards to Hadoop). An architectural overview of the
various components to be covered in later chapters is given.
Chapter 2, A Quick Start Guide to Flume, serves to get you up and running quickly. It
includes downloading Flume, creating a "Hello, World!" configuration, and running it.
Chapter 3, Channels, covers the two major channels most people will use and the
configuration options available for each of them.

Chapter 4, Sinks and Sink Processors, goes into great detail on using the HDFS Flume
output, including compression options and options for formatting the data. Failover
options are also covered so that you can create a more robust data pipeline.
Chapter 5, Sources and Channel Selectors, introduces several of the Flume input
mechanisms and their configuration options. Also covered is switching between different
channels based on data content, which allows the creation of complex data flows.
Chapter 6, Interceptors, ETL, and Routing, explains how to transform data in-flight as
well as extract information from the payload to use with Channel Selectors to make
routing decisions. Then this chapter covers tiering Flume agents using Avro serialization,
as well as using the Flume command line as a standalone Avro client for testing and
importing data manually.
Chapter 7, Putting It All Together, walks you through the details of an end-to-end use
case from the web server logs to a searchable UI, backed by Elasticsearch as well as
archival storage in HDFS.
Chapter 8, Monitoring Flume, discusses various options available for monitoring Flume
both internally and externally, including Monit, Nagios, Ganglia, and custom hooks.
Chapter 9, There Is No Spoon the Realities of Real-time Distributed Data Collection, is
a collection of miscellaneous things to consider that are outside the scope of just
configuring and using Flume.

Overview and Architecture

If you are reading this book, chances are you are swimming in oceans of data. Creating
mountains of data has become very easy, thanks to Facebook, Twitter, Amazon, digital
cameras and camera phones, YouTube, Google, and just about anything else you can
think of being connected to the Internet. As a provider of a website, 10 years ago, your
application logs were only used to help you troubleshoot your website. Today, this
same data can provide a valuable insight into your business and customers if you
know how to pan gold out of your river of data.
Furthermore, as you are reading this book, you are also aware that Hadoop was
created to solve (partially) the problem of sifting through mountains of data. Of
course, this only works if you can reliably load your Hadoop cluster with data for
your data scientists to pick apart.
Getting data into and out of Hadoop (in this case, the Hadoop File System, or
HDFS) isn't hard; it is just a simple command, such as:
% hadoop fs --put data.csv

This works great when you have all your data neatly packaged and ready to upload.
However, your website is creating data all the time. How often should you batch
load data to HDFS? Daily? Hourly? Whatever processing period you choose,
eventually somebody always asks "can you get me the data sooner?" What you
really need is a solution that can deal with streaming logs/data.
Turns out you aren't alone in this need. Cloudera, a provider of professional services
for Hadoop as well as their own distribution of Hadoop, saw this need over and over
when working with their customers. Flume was created to fill this need and create a
standard, simple, robust, flexible, and extensible tool for data ingestion into Hadoop.

Overview and Architecture

Flume 0.9
Flume was first introduced in Cloudera's CDH3 distribution in 2011. It consisted
of a federation of worker daemons (agents) configured from a centralized master
(or masters) via Zookeeper (a federated configuration and coordination system).
From the master, you could check the agent status in a web UI as well as push
out configuration centrally from the UI or via a command-line shell (both really
communicating via Zookeeper to the worker agents).
Data could be sent in one of three modes: Best effort (BE), Disk Failover (DFO), and
End-to-End (E2E). The masters were used for the E2E mode acknowledgements and
multimaster configuration never really matured, so you usually only had one master,
making it a central point of failure for E2E data flows. The BE mode is just what it
sounds like: the agent would try to send the data, but if it couldn't, the data would
be discarded. This mode is good for things such as metrics, where gaps can easily be
tolerated, as new data is just a second away. The DFO mode stores undeliverable data
to the local disk (or sometimes, a local database) and would keep retrying until the
data could be delivered to the next recipient in your data flow. This is handy for those
planned (or unplanned) outages, as long as you have sufficient local disk space to
buffer the load.
In June, 2011, Cloudera moved control of the Flume project to the Apache Foundation.
It came out of the incubator status a year later in 2012. During the incubation year,
work had already begun to refactor Flume under the Star-Trek-themed tag, Flume-NG
(Flume the Next Generation).

Flume 1.X (Flume-NG)

There were many reasons why Flume was refactored. If you are interested in
the details, you can read about them at https://issues.apache.org/jira/
browse/FLUME-728. What started as a refactoring branch eventually became the
main line of development as Flume 1.X.
The most obvious change in Flume 1.X is that the centralized configuration master(s)
and Zookeeper are gone. The configuration in Flume 0.9 was overly verbose, and
mistakes were easy to make. Furthermore, centralized configuration was really outside
the scope of Flume's goals. Centralized configuration was replaced with a simple ondisk configuration file (although the configuration provider is pluggable so that it
can be replaced). These configuration files are easily distributed using tools such as
cf-engine, Chef, and Puppet. If you are using a Cloudera distribution, take a look at
Cloudera Manager to manage your configurations. About two years ago, they created
a free version with no node limit, so it may be an attractive option for you. Just be
sure you don't manage these configurations manually, or you'll be editing these files
manually forever.
[8]

Chapter 1

Another major difference in Flume 1.X is that the reading of input data and the
writing of output data are now handled by different worker threads (called
Runners). In Flume 0.9, the input thread also did the writing to the output (except
for failover retries). If the output writer was slow (rather than just failing outright),
it would block Flume's ability to ingest data. This new asynchronous design leaves
the input thread blissfully unaware of any downstream problem.
The first edition of this book covered all the versions of Flume up till Version 1.3.1.
This second edition will cover till Version 1.5.2 (the current version at the time of
writing this).

The problem with HDFS and streaming

data/logs
HDFS isn't a real filesystem, at least not in the traditional sense, and many of
the things we take for granted with normal filesystems don't apply here, such
as being able to mount it. This makes getting your streaming data into Hadoop
a little more complicated.
In a regular POSIX-style filesystem, if you open a file and write data, it still exists
on the disk before the file is closed. That is, if another program opens the same
file and starts reading, it will get the data already flushed by the writer to the disk.
Furthermore, if this writing process is interrupted, any portion that made it to disk
is usable (it may be incomplete, but it exists).
In HDFS, the file exists only as a directory entry; it shows zero length until the file
is closed. This means that if data is written to a file for an extended period without
closing it, a network disconnect with the client will leave you with nothing but an
empty file for all your efforts. This may lead you to the conclusion that it would be
wise to write small files so that you can close them as soon as possible.
The problem is that Hadoop doesn't like lots of tiny files. As the HDFS filesystem
metadata is kept in memory on the NameNode, the more files you create, the more
RAM you'll need to use. From a MapReduce prospective, tiny files lead to poor
efficiency. Usually, each Mapper is assigned a single block of a file as the input
(unless you have used certain compression codecs). If you have lots of tiny files,
the cost of starting the worker processes can be disproportionally high compared
to the data it is processing. This kind of block fragmentation also results in more
Mapper tasks, increasing the overall job run times.

[9]

Overview and Architecture

These factors need to be weighed when determining the rotation period to use when
writing to HDFS. If the plan is to keep the data around for a short time, then you can
lean toward the smaller file size. However, if you plan on keeping the data for a very
long time, you can either target larger files or do some periodic cleanup to compact
smaller files into fewer, larger files to make them more MapReduce friendly. After
all, you only ingest the data once, but you might run a MapReduce job on that data
hundreds or thousands of times.

Sources, channels, and sinks

The Flume agent's architecture can be viewed in this simple diagram. Inputs are
called sources and outputs are called sinks. Channels provide the glue between
sources and sinks. All of these run inside a daemon called an agent.

Keep in mind:

A source writes events to one or more channels.

A channel is the holding area as events are passed from a

source to a sink.

A sink receives events from one channel only.

An agent can have many channels.

Flume events
The basic payload of data transported by Flume is called an event. An event is
composed of zero or more headers and a body.

[ 10 ]

Chapter 1

The headers are key/value pairs that can be used to make routing decisions or
carry other structured information (such as the timestamp of the event or the
hostname of the server from which the event originated). You can think of it as
serving the same function as HTTP headersa way to pass additional information
that is distinct from the body.
The body is an array of bytes that contains the actual payload. If your input is
comprised of tailed log files, the array is most likely a UTF-8-encoded string
containing a line of text.

Flume may add additional headers automatically (like when a source adds the
hostname where the data is sourced or creating an event's timestamp), but the
body is mostly untouched unless you edit it en route using interceptors.

Interceptors, channel selectors, and sink

processors
An interceptor is a point in your data flow where you can inspect and alter Flume
events. You can chain zero or more interceptors after a source creates an event. If
you are familiar with the AOP Spring Framework, think MethodInterceptor. In
Java Servlets, it's similar to ServletFilter. Here's an example of what using four
chained interceptors on a source might look like:

[ 11 ]

Overview and Architecture

Channel selectors are responsible for how data moves from a source to one or more
channels. Flume comes packaged with two channel selectors that cover most use
cases you might have, although you can write your own if need be. A replicating
channel selector (the default) simply puts a copy of the event into each channel,
assuming you have configured more than one. In contrast, a multiplexing channel
selector can write to different channels depending on some header information.
Combined with some interceptor logic, this duo forms the foundation for routing
input to different channels.
Finally, a sink processor is the mechanism by which you can create failover paths for
your sinks or load balance events across multiple sinks from a channel.

Tiered data collection (multiple flows and/or

agents)
You can chain your Flume agents depending on your particular use case. For
example, you may want to insert an agent in a tiered fashion to limit the number
of clients trying to connect directly to your Hadoop cluster. More likely, your
source machines don't have sufficient disk space to deal with a prolonged outage
or maintenance window, so you create a tier with lots of disk space between your
sources and your Hadoop cluster.
In the following diagram, you can see that there are two places where data is
created (on the left-hand side) and two final destinations for the data (the HDFS
and ElasticSearch cloud bubbles on the right-hand side). To make things more
interesting, let's say one of the machines generates two kinds of data (let's call
them square and triangle data). You can see that in the lower-left agent, we use a
multiplexing channel selector to split the two kinds of data into different channels.
The rectangle channel is then routed to the agent in the upper-right corner (along
with the data coming from the upper-left agent). The combined volume of events
is written together in HDFS in Datacenter 1. Meanwhile, the triangle data is sent
to the agent that writes to ElasticSearch in Datacenter 2. Keep in mind that data
transformations can occur after any source. How all of these components can be
used to build complicated data workflows will be become clear as we proceed.

[ 12 ]

Chapter 1

The Kite SDK

One of the new technologies incorporated in Flume, starting with Version 1.4,
is something called a Morphline. You can think of a Morphline as a series of
commands chained together to form a data transformation pipe.
If you are a fan of pipelining Unix commands, this will be very familiar to you.
The commands themselves are intended to be small, single-purpose functions that
when chained together create powerful logic. In many ways, using a Morphline
command chain can be identical in functionality to the interceptor paradigm just
mentioned. There is a Morphline interceptor we will cover in Chapter 6, Interceptors,
ETL, and Routing, which you can use instead of, or in addition to, the included
Java-based interceptors.
To get an idea of how useful these commands can be, take a look
at the handy grok command and its included extensible regular
expression library at https://github.com/kite-sdk/kite/
blob/master/kite-morphlines/kite-morphlines-core/
src/test/resources/grok-dictionaries/grok-patterns

[ 13 ]

Overview and Architecture

Many of the custom Java interceptors that I've written in the past were to modify
the body (data) and can easily be replaced with an out-of-the-box Morphline
command chain. You can get familiar with the Morphline commands by checking
out their reference guide at http://kitesdk.org/docs/current/kitemorphlines/morphlinesReferenceGuide.html

Flume Version 1.4 also includes a Morphline-backed sink used primarily to feed
data into Solr. We'll see more of this in Chapter 4, Sinks and Sink Processors,
Morphline Solr Search Sink.
Morphlines are just one component of the KiteSDK included in Flume. Starting
with Version 1.5, Flume has added experimental support for KiteData, which is an
effort to create a standard library for datasets in Hadoop. It looks very promising,
but it is outside the scope of this book.
Please see the project home page for more information, as it will
certainly become more prominent in the Hadoop ecosystem as
the technology matures. You can read all about the KiteSDK at
http://kitesdk.org.

Summary
In this chapter, we discussed the problem that Flume is attempting to solve: getting
data into your Hadoop cluster for data processing in an easily configured, reliable
way. We also discussed the Flume agent and its logical components, including
events, sources, channel selectors, channels, sink processors, and sinks. Finally, we
briefly discussed Morphlines as a powerful new ETL (Extract, Transform, Load)
library, starting with Version 1.4 of Flume.
The next chapter will cover these in more detail, specifically, the most commonly
used implementations of each. Like all good open source projects, almost all of these
components are extensible if the bundled ones don't do what you need them to do.

[ 14 ]

Get more information Apache Flume: Distributed Log Collection for Hadoop
Second Edition

Where to buy this book

You can buy Apache Flume: Distributed Log Collection for Hadoop Second Edition from
the Packt Publishing website.
Alternatively, you can buy the book from Amazon, BN.com, Computer Manuals and most internet
book retailers.
Click here for ordering and shipping details.

www.PacktPub.com

Stay Connected:

JIRA 7 Administration Cookbook Second Edition - Sample Chapter
No ratings yet
JIRA 7 Administration Cookbook Second Edition - Sample Chapter
35 pages
EXPLORE ESP32 MICROPYTHON - Python Coding, Arduino Coding, Raspberry Pi, ESP8266, IoT Projects, Android Application Projects
100% (12)
EXPLORE ESP32 MICROPYTHON - Python Coding, Arduino Coding, Raspberry Pi, ESP8266, IoT Projects, Android Application Projects
347 pages
Actualización de Firmware OLED WMS
No ratings yet
Actualización de Firmware OLED WMS
8 pages
Notes Bug Data and of Apache
No ratings yet
Notes Bug Data and of Apache
11 pages
Expose BDD
No ratings yet
Expose BDD
16 pages
Apache Flume: Distributed Log Collection for Hadoop
From Everand
Apache Flume: Distributed Log Collection for Hadoop
Steve Hoffman
No ratings yet
Apache Flume Tutorial PDF
No ratings yet
Apache Flume Tutorial PDF
43 pages
Flume_Agent
No ratings yet
Flume_Agent
23 pages
2020300053_BDA_EXP7_CHINMAY
No ratings yet
2020300053_BDA_EXP7_CHINMAY
5 pages
FLUME[1]
No ratings yet
FLUME[1]
31 pages
Apache Flume
No ratings yet
Apache Flume
8 pages
Unit-3 (HDFS-II)
No ratings yet
Unit-3 (HDFS-II)
28 pages
Apache Flume - Data Transfer in Hadoop - Tutorialspoint
No ratings yet
Apache Flume - Data Transfer in Hadoop - Tutorialspoint
2 pages
Module 10 Flume - Massive Logs Aggregation
No ratings yet
Module 10 Flume - Massive Logs Aggregation
42 pages
06 - Acquire Data Using CLI and Flume
No ratings yet
06 - Acquire Data Using CLI and Flume
13 pages
Apache Flume: Distributed Log Collection for Hadoop - Second Edition
From Everand
Apache Flume: Distributed Log Collection for Hadoop - Second Edition
Steve Hoffman
No ratings yet
Apache Flume Distributed Log Collection for Hadoop 2nd Edition Steve Hoffman - Download the complete ebook in PDF format and read freely
No ratings yet
Apache Flume Distributed Log Collection for Hadoop 2nd Edition Steve Hoffman - Download the complete ebook in PDF format and read freely
55 pages
Apache Flume
No ratings yet
Apache Flume
21 pages
13618088
No ratings yet
13618088
50 pages
Module 5_Flume
No ratings yet
Module 5_Flume
23 pages
Arinto Murdopo Josep Subirats Group 4 EEDC 2012
No ratings yet
Arinto Murdopo Josep Subirats Group 4 EEDC 2012
19 pages
What Is Apache Flume?: Collecting, Aggregating, and Moving Large Amounts of Log Data. in
No ratings yet
What Is Apache Flume?: Collecting, Aggregating, and Moving Large Amounts of Log Data. in
8 pages
Flume User Guide
No ratings yet
Flume User Guide
32 pages
UNIT-2 IMP QUES ANS
No ratings yet
UNIT-2 IMP QUES ANS
8 pages
Flume User Guide
No ratings yet
Flume User Guide
48 pages
Streaming Data Via Flume
No ratings yet
Streaming Data Via Flume
13 pages
Chapter 8 Flume - Massive Log Aggregation
No ratings yet
Chapter 8 Flume - Massive Log Aggregation
35 pages
BDA Mid-2 Important Questions
No ratings yet
BDA Mid-2 Important Questions
19 pages
Flume
No ratings yet
Flume
15 pages
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Sqoop & Flume: Issues With Data Load Into Hadoop
No ratings yet
Sqoop & Flume: Issues With Data Load Into Hadoop
6 pages
Apache Flume Tutorial - What Is - Architecture
No ratings yet
Apache Flume Tutorial - What Is - Architecture
8 pages
Unit 3 Part 2 Scoopflume
No ratings yet
Unit 3 Part 2 Scoopflume
10 pages
5a. Introduction to Data Ingestion and Processing
No ratings yet
5a. Introduction to Data Ingestion and Processing
26 pages
Flume PDF
No ratings yet
Flume PDF
7 pages
BIG DATA UNIT -2
No ratings yet
BIG DATA UNIT -2
18 pages
Flume Case Study
No ratings yet
Flume Case Study
2 pages
Kafka Up and Running for Network DevOps: Set Your Network Data in Motion
From Everand
Kafka Up and Running for Network DevOps: Set Your Network Data in Motion
Eric Chou
No ratings yet
Big Data Ca
No ratings yet
Big Data Ca
14 pages
Search Analytics With Flume and HBase
No ratings yet
Search Analytics With Flume and HBase
24 pages
Unit -5 Updated Mhm
No ratings yet
Unit -5 Updated Mhm
25 pages
Hadoop 3
No ratings yet
Hadoop 3
52 pages
Assignment
No ratings yet
Assignment
37 pages
Apache Flume Distributed Log Collection For Hadoop Steve Hoffman instant download
No ratings yet
Apache Flume Distributed Log Collection For Hadoop Steve Hoffman instant download
49 pages
W13 (4)
No ratings yet
W13 (4)
33 pages
Lect - 11 - BIG DATA
No ratings yet
Lect - 11 - BIG DATA
42 pages
Big Data-2 Sourcing Data
No ratings yet
Big Data-2 Sourcing Data
38 pages
Cloudera Tutorial
100% (1)
Cloudera Tutorial
36 pages
8 - Big - Data Vivek
No ratings yet
8 - Big - Data Vivek
2 pages
A728542518 - 16469 - 30 - 2019 - Flume Complete
No ratings yet
A728542518 - 16469 - 30 - 2019 - Flume Complete
13 pages
Week 4 - Hadoop Ecosystem
No ratings yet
Week 4 - Hadoop Ecosystem
109 pages
Certified Hadoop and Spark Course Curriculum
No ratings yet
Certified Hadoop and Spark Course Curriculum
9 pages
Slide 4 Data Loading Tool
No ratings yet
Slide 4 Data Loading Tool
77 pages
Hortonworks Data Platform (HDP)
100% (1)
Hortonworks Data Platform (HDP)
56 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
55 pages
Twitter Data Analysis Using Flume & Hive On Hadoop Framework
No ratings yet
Twitter Data Analysis Using Flume & Hive On Hadoop Framework
5 pages
6 Flume - Student - Datadotz
No ratings yet
6 Flume - Student - Datadotz
29 pages
P.H.P Simple C.R.U.D Design
From Everand
P.H.P Simple C.R.U.D Design
Rohaya Mohamad
4/5 (1)
Data Ingest
No ratings yet
Data Ingest
15 pages
Cse 17CS82 M2 S2 PPT
No ratings yet
Cse 17CS82 M2 S2 PPT
20 pages
Relayd and Httpd Mastery: IT Mastery, #11
From Everand
Relayd and Httpd Mastery: IT Mastery, #11
Michael W. Lucas
No ratings yet
Hadoop Ecosystem PDF
No ratings yet
Hadoop Ecosystem PDF
55 pages
Learn PHP: Learn PHP Programming in 4 hours! PHP for Beginners - Smart and Easy Ways to learn PHP & MySQL
From Everand
Learn PHP: Learn PHP Programming in 4 hours! PHP for Beginners - Smart and Easy Ways to learn PHP & MySQL
Barry Page
3.5/5 (2)
RESTful Web API Design With Node - Js - Second Edition - Sample Chapter
0% (1)
RESTful Web API Design With Node - Js - Second Edition - Sample Chapter
17 pages
Mastering Mesos - Sample Chapter
No ratings yet
Mastering Mesos - Sample Chapter
36 pages
Unity 5.x Game Development Blueprints - Sample Chapter
No ratings yet
Unity 5.x Game Development Blueprints - Sample Chapter
57 pages
Moodle 3.x Teaching Techniques - Third Edition - Sample Chapter
No ratings yet
Moodle 3.x Teaching Techniques - Third Edition - Sample Chapter
23 pages
Practical Digital Forensics - Sample Chapter
100% (3)
Practical Digital Forensics - Sample Chapter
31 pages
Mastering Drupal 8 Views - Sample Chapter
0% (1)
Mastering Drupal 8 Views - Sample Chapter
23 pages
Internet of Things With Python - Sample Chapter
100% (1)
Internet of Things With Python - Sample Chapter
34 pages
Python Geospatial Development - Third Edition - Sample Chapter
No ratings yet
Python Geospatial Development - Third Edition - Sample Chapter
32 pages
Modular Programming With Python - Sample Chapter
No ratings yet
Modular Programming With Python - Sample Chapter
28 pages
Android UI Design - Sample Chapter
No ratings yet
Android UI Design - Sample Chapter
47 pages
Flux Architecture - Sample Chapter
No ratings yet
Flux Architecture - Sample Chapter
25 pages
Mastering Hibernate - Sample Chapter
No ratings yet
Mastering Hibernate - Sample Chapter
27 pages
Angular 2 Essentials - Sample Chapter
0% (1)
Angular 2 Essentials - Sample Chapter
39 pages
Expert Python Programming - Second Edition - Sample Chapter
57% (7)
Expert Python Programming - Second Edition - Sample Chapter
40 pages
Puppet For Containerization - Sample Chapter
No ratings yet
Puppet For Containerization - Sample Chapter
23 pages
Practical Mobile Forensics - Second Edition - Sample Chapter
No ratings yet
Practical Mobile Forensics - Second Edition - Sample Chapter
38 pages
Cardboard VR Projects For Android - Sample Chapter
No ratings yet
Cardboard VR Projects For Android - Sample Chapter
57 pages
QGIS 2 Cookbook - Sample Chapter
100% (1)
QGIS 2 Cookbook - Sample Chapter
44 pages
Troubleshooting NetScaler - Sample Chapter
No ratings yet
Troubleshooting NetScaler - Sample Chapter
25 pages
Learning Probabilistic Graphical Models in R - Sample Chapter
No ratings yet
Learning Probabilistic Graphical Models in R - Sample Chapter
37 pages
Canvas Cookbook - Sample Chapter
No ratings yet
Canvas Cookbook - Sample Chapter
34 pages
Apache Hive Cookbook - Sample Chapter
100% (1)
Apache Hive Cookbook - Sample Chapter
27 pages
RStudio For R Statistical Computing Cookbook - Sample Chapter
100% (1)
RStudio For R Statistical Computing Cookbook - Sample Chapter
38 pages
Odoo Development Cookbook - Sample Chapter
100% (1)
Odoo Development Cookbook - Sample Chapter
35 pages
3D Printing Designs: Design An SD Card Holder - Sample Chapter
100% (1)
3D Printing Designs: Design An SD Card Holder - Sample Chapter
16 pages
Practical Linux Security Cookbook - Sample Chapter
100% (1)
Practical Linux Security Cookbook - Sample Chapter
25 pages
Sass and Compass Designer's Cookbook - Sample Chapter
No ratings yet
Sass and Compass Designer's Cookbook - Sample Chapter
41 pages
Sitecore Cookbook For Developers - Sample Chapter
No ratings yet
Sitecore Cookbook For Developers - Sample Chapter
34 pages
Machine Learning in Java - Sample Chapter
100% (1)
Machine Learning in Java - Sample Chapter
26 pages
Word Processor
No ratings yet
Word Processor
4 pages
Utility User Guide Kyocera
No ratings yet
Utility User Guide Kyocera
22 pages
os 1-4
No ratings yet
os 1-4
16 pages
MF-Tyre/MF-Swift 6.2: Help Manual
No ratings yet
MF-Tyre/MF-Swift 6.2: Help Manual
58 pages
Dumpstate Log
No ratings yet
Dumpstate Log
1 page
Manual For The Software of Cen-V5.0
100% (2)
Manual For The Software of Cen-V5.0
17 pages
Security Policies and Procedures: Prepared By: Siti Kamila Binti Deraman, JTMK
No ratings yet
Security Policies and Procedures: Prepared By: Siti Kamila Binti Deraman, JTMK
21 pages
Xy4 Command Reference Guide
No ratings yet
Xy4 Command Reference Guide
516 pages
Windows Update Standalone Installer in Windows
No ratings yet
Windows Update Standalone Installer in Windows
6 pages
Proposal Document For Hotel Management System
No ratings yet
Proposal Document For Hotel Management System
9 pages
PSP Assignment 2 Sp2 2017
No ratings yet
PSP Assignment 2 Sp2 2017
38 pages
Wisdot c3d Train Manual
No ratings yet
Wisdot c3d Train Manual
282 pages
Appendix:: How To Sign Up and Download
No ratings yet
Appendix:: How To Sign Up and Download
22 pages
SINUMERIK 802D SL
No ratings yet
SINUMERIK 802D SL
392 pages
Print To Text File at OS Level
No ratings yet
Print To Text File at OS Level
7 pages
Operation Manual Videoline 2000
No ratings yet
Operation Manual Videoline 2000
34 pages
Centum VP 1 ENG System Config - Global
67% (6)
Centum VP 1 ENG System Config - Global
18 pages
Pathpilotmanual
No ratings yet
Pathpilotmanual
226 pages
SIES College of Management Studies MCA Batch 2020-22 Subject: Robotic Process Automation Assignment No. 1 1. Demonstrate Use of Recorder. Program
No ratings yet
SIES College of Management Studies MCA Batch 2020-22 Subject: Robotic Process Automation Assignment No. 1 1. Demonstrate Use of Recorder. Program
80 pages
Lesson 4 OperatingSystems
No ratings yet
Lesson 4 OperatingSystems
3 pages
Access Control Principles
No ratings yet
Access Control Principles
27 pages
Clavinova CVP-309 Manual (VGA, Color, NoTouch) (UK)
No ratings yet
Clavinova CVP-309 Manual (VGA, Color, NoTouch) (UK)
224 pages
How To Download Documents From Scribd
100% (1)
How To Download Documents From Scribd
38 pages
List of MS-DOS Commands
100% (1)
List of MS-DOS Commands
25 pages
YDS0110 - A1 CAMIO81 Support Tools
No ratings yet
YDS0110 - A1 CAMIO81 Support Tools
64 pages
Mathcad 15 Administration Guide
No ratings yet
Mathcad 15 Administration Guide
26 pages
XRY v10 Release Notes
No ratings yet
XRY v10 Release Notes
8 pages
Manual Mastest
100% (1)
Manual Mastest
93 pages

Apache Flume: Distributed Log Collection For Hadoop - Second Edition - Sample Chapter

Uploaded by

Apache Flume: Distributed Log Collection For Hadoop - Second Edition - Sample Chapter

Uploaded by

Fr

This book starts with an architectural overview of Flume and

Who this book is written for

Understand the Flume architecture, and

Configure and use various sources to

Apache Flume: Distributed

Inspect data records and move them

community experience distilled

Create redundant data flows using sink groups

If you are a Hadoop programmer who wants to learn about

What you will learn from this book

Apache Flume: Distributed Log Collection for Hadoop

Apache Flume: Distributed

Prices do not include

Visit www.PacktPub.com for books, eBooks,

In this package, you will find:

The author biography

About the Author

Apache Flume: Distributed Log

What This Book Covers

Overview and Architecture

Overview and Architecture

Flume 1.X (Flume-NG)

The problem with HDFS and streaming

Overview and Architecture

Sources, channels, and sinks

A source writes events to one or more channels.

A channel is the holding area as events are passed from a

A sink receives events from one channel only.

Interceptors, channel selectors, and sink

Overview and Architecture

Tiered data collection (multiple flows and/or

The Kite SDK

Overview and Architecture

Where to buy this book

You might also like