Chapter 8 Flume - Massive Log Aggregation
Chapter 8 Flume - Massive Log Aggregation
Foreword
1 Huawei Confidential
Objectives
2 Huawei Confidential
Contents
2. Key Features
3. Applications
3 Huawei Confidential
What Is Flume?
Flume is a stream log collection tool. It roughly processes data and writes data
to data receivers. Flume collects data from local files (spooling directory
source), real-time logs (taildir and exec), REST messages, Thrift, Avro, Syslog,
Kafka, and other data sources.
4 Huawei Confidential
What Can Flume Do?
Collect log information from a fixed directory to a destination (HDFS, HBase, or
Kafka).
Collect log information (taildir) to the destination in real time.
Support cascading (connecting multiple Flumes) and data conflation.
Support custom data collection tasks based on users.
5 Huawei Confidential
Flume Agent Architecture
Infrastructure: Flume can directly collect data with an agent, which is mainly for data collection in a cluster.
Source Sink
Multi-agent architecture: Flume can connect multiple agents to collect raw data and store them in the final
storage system. This architecture is used to import data from outside the cluster to the cluster.
Channel Channel
Log Agent 1 Agent 2 HDFS
6 Huawei Confidential
Flume Multi-Agent Consolidation
Source Sink
Consolidation
Log Channel
Agent 1
Source Source
Sink Sink
Log Channel Channel HDFS
Agent 2
Agent 4
Source
Sink
Log Channel
Agent 3
You can configure multiple level-1 agents and point them to the source of an agent using Flume. The source
of the level-2 agent consolidates the received events and sends the consolidated events into a single channel.
The events in the channel are consumed by a sink and then pushed to the destination.
7 Huawei Confidential
Flume Agent Principles
Interceptor Events
Channel
Events
Events Events Events
Channel Channel
Source Channel
Processor Selector
Events
Sink Sink
Sink
Runner Processor
8 Huawei Confidential
Basic Concepts - Source (1)
A source receives events or generates events using a special mechanism, and
places events to one or more channels. There are two types of sources: driver-
based source and polling source.
Driver-based source: External systems proactively send data to Flume, driving Flume
to receive data.
Polling source: Flume periodically obtains data.
A source must be associated with at least one channel.
10 Huawei Confidential
Basic Concepts - Source (2)
11 Huawei Confidential
Basic Concepts - Channel (1)
A channel is located between a source and a sink. A channel functions as a
queue and is used for caching temporary events. When a sink successfully sends
events to the channel of the next hop or the final destination, the events are
removed from the channel.
The persistency of channels varies with the channel types:
Memory channel: The memory in this channel type is not persistent.
File channel: It is implemented based on write-ahead logs (WALs).
JDBC channel: It is implemented based on the embedded database.
Channels support transactions and provide weak sequence assurance. They can
connect to any number of sources and sinks.
12 Huawei Confidential
Basic Concepts - Channel (2)
Memory channel: Messages are stored in the memory, which has high
throughput but does not ensure reliability. Data may be lost.
File channel: Data is permanently stored. The configuration is complex. You
need to configure the data directory and checkpoint directory. A checkpoint
directory must be configured for each file channel.
JDBC channel: A built-in Derby database makes events persistent with high
reliability. This channel type can replace the file channel with the persistence
feature.
13 Huawei Confidential
Basic Concepts - Sink
A sink sends events to the next hop or the final destination and then removes
the events from the channel.
A sink must work with a specific channel.
14 Huawei Confidential
Contents
2. Key Features
3. Applications
15 Huawei Confidential
Multi-level Cascading and Multi-channel Replication
Flume supports cascading of multiple Flume agents and data replication within the cascading nodes.
Source
Log Channel
Sink
Agent 1
Source
17 Huawei Confidential
Cascading Message Compression and Encryption
Data transmission between cascaded Flume agents can be compressed and
encrypted, improving data transmission efficiency and security.
Flume
18 Huawei Confidential
Data Monitoring
19 Huawei Confidential
Failover
Data can be automatically switched to another channel for transmission when the next-
hop Flume agent is faulty or data receiving is abnormal during Flume data
transmission.
Source Sink
HDFS
Source
Channel
Sink
Log
Channel
Source Sink
Sink
HDFS
Channel
21 Huawei Confidential
Data Filter During Data Transmission
During data transmission, Flume roughly filters and cleans data to delete unnecessary
data. If data to be filtered is complex, users need to develop filter plugins based on their
data characteristics. Flume supports third-party filter plugins.
Interceptor
Channel
Channel
22 Huawei Confidential
Contents
2. Key Features
3. Applications
23 Huawei Confidential
Flume Operation Example 1 (1)
Description
This example shows how Flume ingests logs generated by applications (such as e-
banking systems) in a cluster to HDFS.
Prepare data.
Create a log directory named mkdir /tmp/log_test on a node in the cluster.
Use this directory as the monitoring directory.
Download the Flume client.
Log in to MRS Manager. On the Clusters page, choose Services > Flume > Download
Client.
24 Huawei Confidential
Flume Operation Example 1 (2)
Install the Flume client.
Decompress the client.
25 Huawei Confidential
Flume Operation Example 1 (3)
Configure the Flume source.
server.sources = a1
server.channels = ch1
server.sinks = s1
# the source configuration of a1
server.sources.a1.type = spooldir
server.sources.a1.spoolDir = /tmp/log_test
server.sources.a1.fileSuffix = .COMPLETED
server.sources.a1.deletePolicy = never
server.sources.a1.trackerDir = .flumespool
server.sources.a1.ignorePattern = ^$
server.sources.a1.batchSize = 1000
server.sources.a1.inputCharset = UTF-8
server.sources.a1.deserializer = LINE
server.sources.a1.selector.type = replicating
server.sources.a1.fileHeaderKey = file
server.sources.a1.fileHeader = false
server.sources.a1.channels = ch1
26 Huawei Confidential
Flume Operation Example 1 (4)
Configure the Flume channel.
27 Huawei Confidential
Flume Operation Example 1 (5)
Configure the Flume sink.
server.sinks.s1.type = hdfs
server.sinks.s1.hdfs.path = /tmp/flume_avro
server.sinks.s1.hdfs.filePrefix = over_%{basename}
server.sinks.s1.hdfs.inUseSuffix = .tmp
server.sinks.s1.hdfs.rollInterval = 30
server.sinks.s1.hdfs.rollSize = 1024
server.sinks.s1.hdfs.rollCount = 10
server.sinks.s1.hdfs.batchSize = 1000
server.sinks.s1.hdfs.fileType = DataStream
server.sinks.s1.hdfs.maxOpenFiles = 5000
server.sinks.s1.hdfs.writeFormat = Writable
server.sinks.s1.hdfs.callTimeout = 10000
server.sinks.s1.hdfs.threadsPoolSize = 10
server.sinks.s1.hdfs.failcount = 10
server.sinks.s1.hdfs.fileCloseByEndEvent = true
server.sinks.s1.channel = ch1
28 Huawei Confidential
Flume Operation Example 1 (6)
Name the configuration file of the Flume agent properties.properties.
Upload the configuration file.
30 Huawei Confidential
Flume Operation Example 1 (7)
Produce data in the /tmp/log_test directory.
mv /var/log/log.11 /tmp/log_test
31 Huawei Confidential
Flume Operation Example 2 (1)
Description
This example shows how Flume ingests clickstream logs to Kafka in real time for
subsequent analysis and processing.
Prepare data.
Create a log directory named /tmp/log_click on a node in the cluster.
Ingest data to Kafka topic_1028.
32 Huawei Confidential
Flume Operation Example 2 (2)
Configure the Flume source.
server.sources = a1
server.channels = ch1
server.sinks = s1
# the source configuration of a1
server.sources.a1.type = spooldir
server.sources.a1.spoolDir = /tmp/log_click
server.sources.a1.fileSuffix = .COMPLETED
server.sources.a1.deletePolicy = never
server.sources.a1.trackerDir = .flumespool
server.sources.a1.ignorePattern = ^$
server.sources.a1.batchSize = 1000
server.sources.a1.inputCharset = UTF-8
server.sources.a1.selector.type = replicating
jserver.sources.a1.basenameHeaderKey = basename
server.sources.a1.deserializer.maxBatchLine = 1
server.sources.a1.deserializer.maxLineLength = 2048
server.sources.a1.channels = ch1
33 Huawei Confidential
Flume Operation Example 2 (3)
Configure the Flume channel.
34 Huawei Confidential
Flume Operation Example 2 (4)
Configure the Flume sink.
35 Huawei Confidential
Flume Operation Example 2 (5)
Upload the configuration file to Flume.
Run the Kafka command to view the data ingested from Kafka topic_1028.
36 Huawei Confidential
Summary
37 Huawei Confidential
Thank you. 把数字世界带入每个人、每个家庭、
每个组织,构建万物互联的智能世界。
Bring digital to every person, home, and
organization for a fully connected,
intelligent world.