SlideShare a Scribd company logo
Flume and HDFS
performance insights
Omid Vahdaty, Big Data Ninja
Test Details
● Fan In architecture
● Collector sending data, Input 100GB of pcap files, rate in ingress: 2MB/s.
● MiniFlume : Thrift source , mem channel, AVRO sink.
● MegaFlume: AVRO source, mem channel, HDFS SInk
● Hadoop
○ 4 DataNode
○ 1 NameNode
Mini Flume config
i. Virtual Machine: 8 CPU core, 16GB RAM. Ubuntu 14.04.
i. Environment: export JAVA_OPTS="-Xms100m -Xmx12000m Dcom.sun.management.jmxremote
i. Config
a. 2X thrift sources. thread count to 8.
b. 2X avro sinks. roll size 10000.
c. memory-channel.capacity = 100100
d. memory-channel.transactionCapacity=10010
Mini Flume File channel config
agent.channels.k1.type = file
agent.channels.k1.capacity = 82000000
agent.channels.k1.transactionCapacity = 2000009
agent.channels.k1.checkpointDir = /hadoop-data/hadoopuser/flume/tmp/checkpoint1
agent.channels.k1.dataDirs = /hadoop-data/hadoopuser/flume/data/dn1
agent.channels.k1.useDualCheckpoints=true
agent.channels.k1.backupCheckpointDir = /hadoop-data/hadoopuser/flume/tmp/checkpoint11
agent.channels.k1.use-fast-replay = true
agent.channels.k1.checkpointInterval=60000
Mega Flume config
i. Virtual Machine: 16 CPU core, 16GB RAM.
ii. Environment: export JAVA_OPTS="-Xms100m -Xmx12000m -Dcom.sun.management.jmxremote
iii. mem channel
a. memory-channel.capacity = 6200000
b. memory-channel.transactionCapacity=620000
iv. 2 avro sources 8 threads
v. 2 HDFS sinks
1. agent.sinks.hdfs-sink.hdfs.path = hdfs://master1:54310/data/EthernetCollector
2. agent.sinks.hdfs-sink.hdfs.fileType = DataStream
3. agent.sinks.hdfs-sink.hdfs.useLocalTimeStamp = true
4. agent.sinks.hdfs-sink.hdfs.filePrefix = ethernet
5. agent.sinks.hdfs-sink.hdfs.fileSuffix = .avro
6. agent.sinks.hdfs-sink.hdfs.inUsePrefix = part.
7. agent.sinks.hdfs-sink.serializer = avro_event
8. agent.sinks.hdfs-sink.hdfs.minBlockReplicas= 1
9. agent.sinks.hdfs-sink.hdfs.threadsPoolSize= 16
10. agent.sinks.hdfs-sink.hdfs.rollCount = 250030
11. agent.sinks.hdfs-sink.hdfs.rollSize = 0
12. agent.sinks.hdfs-sink.hdfs.batchSize = 620000
Hadoop Cluster
i. 4 hadoop data nodes , 12 Disks per node, 4 cores. 8GB RAM.
ii. Single Namenode, not HA , no Yarn. no ZooKeeper.
Flume Performance insights
i. thrift source has threads!!!
ii. VERY consuming in terms of resources
iii. file channel is MUCH slower than mem channel
iv. sink HDFS , stabilizes after long time (~ 30 min), each change takes time.
1. buffering into RAM – the RAM graphs goes up.
2. Stable ingress rate - RAM consumption is parallel line to the time axis.
3. small roll size - may crash the cluster
4. too much "pressure" on hadoop crushes the cluster, causes data-node loss. it even cause the
Name-node to enter safe mode, or even loss ALL data nodes.
5. Rule of thunk- for each sink - at least 2 data nodes with at least 2 data disks.
6. each batch request is divided to several threads. i.e 2MB/s is write speed on hadoop per sink
per node in parallel. read a little about flume performance metrics:
https://cwiki.apache.org/confluence/display/FLUME/Performance+Measurements+-+round+2
7. notice the above article: 20 data-node, 8 disks, 8 sinks.
Flume Performance insights
i. each flume should be considered for tuning of ingress and egress on same node. monitor the RAM
metrics via data-dog or another monitoring tool , and line should be parallel to time axis at all time.
ii. each source/sink may have threads - significant impact on performance.
iii. when increase batch size - all other parameters should increase with similar ratio.
iv. be sure to understand difference of batch vs. roll.
v. use unique numbers for debugging.
vi. each event size changes the buffering in the flume drastically. as long as it bound by min/max
values, your config is ok
vii. Consider giving unique prefix names per sink - multiply your tmp files in parallel.
File channel insights
● Consider 2 partitions per file channel : data and checkpoint
● Don't Consider file channels per port - unless you have separate partitions per
file channel.
● There is a limit in max capacity the flume can handle. - if you need to buffer a
week of down time, consider scaling out VIA fan in
● Consider Max file capacity written to disk - to be below RAM size, to utilize OS
caching.very efficient
● Consider increasing the transaction Capacity to 1M events for fast recovery
from filechannel reply.
File channel insights
● The File Channel takes a comma-separated list of data directories as the value of the
dataDirs parameter
● If the channel is stopped while it is checkpointing, the checkpoint may be incomplete
or corrupt. A corrupt or incomplete checkpoint could make the restart of the channel
extremely slow, as the channel would need to read and replay all data files.To avoid
this problem, it is recommended that the useDualCheckpointsparameter be set to
true and that backupCheckpointDir be set
● It will always retain two files per data directory, even if the files do not have any events to be taken
and committed. The channel will also not delete the file currently being written to.
● Using NFS-mounted disks with the File Channel is not a good idea
● Read: https://www.safaribooksonline.com/library/view/using-flume/9781491905326/ch04.html
● Increase checkpoint interval if you use dual backup. (checkpoint time was doubled if you use dual
Hadoop Performance Insights
Hadoop
i. more data node disk increases performance
ii. hadoop was designed for parallelism.but flume sinks - are not very
powerful. Yes you can add more sinks - but - you need stronger cluster - 1
sink = 2 data nodes with at least 2 data partions.
iii. you can add nodes dynamically easily while the cluster is running , so no
need to restart cluster.
iv. Increase IO buffer to 64KB or 128KB (assuming large block size)
v. Increase NN handlers to 20 X number datanode
General Performance Insights
Generally speaking:
i. very HARD to simulate the situation with 1 server. I over committed resources, causing failures
in the HDFS.
ii. the amount of thread is ENORMOUS! but very light, and short spanned. not CPU intensive.
iii. no Egress to engine was tested in this scenario
iv. not data correctness was tested.
v. very hard to fine tune flume - each change on file based sinks - take about 10 min to
reflect/stabilize in monitoring (unless it crushed first).
Stay in touch...
● Omid Vahdaty
● +972-54-2384178
Ad

More Related Content

What's hot (20)

Apache flume - Twitter Streaming
Apache flume - Twitter Streaming Apache flume - Twitter Streaming
Apache flume - Twitter Streaming
Kowndinya Mannepalli
 
Flume
FlumeFlume
Flume
Chirag Ahuja
 
Flume in 10minutes
Flume in 10minutesFlume in 10minutes
Flume in 10minutes
dwmclary
 
Centralized logging with Flume
Centralized logging with FlumeCentralized logging with Flume
Centralized logging with Flume
Ratnakar Pawar
 
Apache flume
Apache flumeApache flume
Apache flume
Ramakrishna kapa
 
Apache Flume and its use case in Manufacturing
Apache Flume and its use case in ManufacturingApache Flume and its use case in Manufacturing
Apache Flume and its use case in Manufacturing
Rapheephan Thongkham-Uan
 
How Orange Financial combat financial frauds over 50M transactions a day usin...
How Orange Financial combat financial frauds over 50M transactions a day usin...How Orange Financial combat financial frauds over 50M transactions a day usin...
How Orange Financial combat financial frauds over 50M transactions a day usin...
JinfengHuang3
 
Extracting twitter data using apache flume
Extracting twitter data using apache flumeExtracting twitter data using apache flume
Extracting twitter data using apache flume
Bharat Khanna
 
HBase: Where Online Meets Low Latency
HBase: Where Online Meets Low LatencyHBase: Where Online Meets Low Latency
HBase: Where Online Meets Low Latency
HBaseCon
 
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
Cloudera, Inc.
 
Pulsar - Distributed pub/sub platform
Pulsar - Distributed pub/sub platformPulsar - Distributed pub/sub platform
Pulsar - Distributed pub/sub platform
Matteo Merli
 
HBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed Storage
HBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed StorageHBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed Storage
HBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed Storage
Cloudera, Inc.
 
Kafka meetup JP #3 - Engineering Apache Kafka at LINE
Kafka meetup JP #3 - Engineering Apache Kafka at LINEKafka meetup JP #3 - Engineering Apache Kafka at LINE
Kafka meetup JP #3 - Engineering Apache Kafka at LINE
kawamuray
 
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
DataWorks Summit
 
HBaseCon 2015: HBase at Scale in an Online and High-Demand Environment
HBaseCon 2015: HBase at Scale in an Online and  High-Demand EnvironmentHBaseCon 2015: HBase at Scale in an Online and  High-Demand Environment
HBaseCon 2015: HBase at Scale in an Online and High-Demand Environment
HBaseCon
 
HBaseCon 2015: HBase Operations at Xiaomi
HBaseCon 2015: HBase Operations at XiaomiHBaseCon 2015: HBase Operations at Xiaomi
HBaseCon 2015: HBase Operations at Xiaomi
HBaseCon
 
Meet HBase 1.0
Meet HBase 1.0Meet HBase 1.0
Meet HBase 1.0
enissoz
 
Apache Flume
Apache FlumeApache Flume
Apache Flume
GetInData
 
Filesystems, RPC and HDFS
Filesystems, RPC and HDFSFilesystems, RPC and HDFS
Filesystems, RPC and HDFS
Alexander Alten
 
HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...
HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...
HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...
Cloudera, Inc.
 
Flume in 10minutes
Flume in 10minutesFlume in 10minutes
Flume in 10minutes
dwmclary
 
Centralized logging with Flume
Centralized logging with FlumeCentralized logging with Flume
Centralized logging with Flume
Ratnakar Pawar
 
Apache Flume and its use case in Manufacturing
Apache Flume and its use case in ManufacturingApache Flume and its use case in Manufacturing
Apache Flume and its use case in Manufacturing
Rapheephan Thongkham-Uan
 
How Orange Financial combat financial frauds over 50M transactions a day usin...
How Orange Financial combat financial frauds over 50M transactions a day usin...How Orange Financial combat financial frauds over 50M transactions a day usin...
How Orange Financial combat financial frauds over 50M transactions a day usin...
JinfengHuang3
 
Extracting twitter data using apache flume
Extracting twitter data using apache flumeExtracting twitter data using apache flume
Extracting twitter data using apache flume
Bharat Khanna
 
HBase: Where Online Meets Low Latency
HBase: Where Online Meets Low LatencyHBase: Where Online Meets Low Latency
HBase: Where Online Meets Low Latency
HBaseCon
 
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
Cloudera, Inc.
 
Pulsar - Distributed pub/sub platform
Pulsar - Distributed pub/sub platformPulsar - Distributed pub/sub platform
Pulsar - Distributed pub/sub platform
Matteo Merli
 
HBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed Storage
HBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed StorageHBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed Storage
HBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed Storage
Cloudera, Inc.
 
Kafka meetup JP #3 - Engineering Apache Kafka at LINE
Kafka meetup JP #3 - Engineering Apache Kafka at LINEKafka meetup JP #3 - Engineering Apache Kafka at LINE
Kafka meetup JP #3 - Engineering Apache Kafka at LINE
kawamuray
 
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
DataWorks Summit
 
HBaseCon 2015: HBase at Scale in an Online and High-Demand Environment
HBaseCon 2015: HBase at Scale in an Online and  High-Demand EnvironmentHBaseCon 2015: HBase at Scale in an Online and  High-Demand Environment
HBaseCon 2015: HBase at Scale in an Online and High-Demand Environment
HBaseCon
 
HBaseCon 2015: HBase Operations at Xiaomi
HBaseCon 2015: HBase Operations at XiaomiHBaseCon 2015: HBase Operations at Xiaomi
HBaseCon 2015: HBase Operations at Xiaomi
HBaseCon
 
Meet HBase 1.0
Meet HBase 1.0Meet HBase 1.0
Meet HBase 1.0
enissoz
 
Apache Flume
Apache FlumeApache Flume
Apache Flume
GetInData
 
Filesystems, RPC and HDFS
Filesystems, RPC and HDFSFilesystems, RPC and HDFS
Filesystems, RPC and HDFS
Alexander Alten
 
HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...
HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...
HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...
Cloudera, Inc.
 

Similar to Flume and Hadoop performance insights (20)

Building Apache Cassandra clusters for massive scale
Building Apache Cassandra clusters for massive scaleBuilding Apache Cassandra clusters for massive scale
Building Apache Cassandra clusters for massive scale
Alex Thompson
 
Hadoop Cluster With High Availability
Hadoop Cluster With High AvailabilityHadoop Cluster With High Availability
Hadoop Cluster With High Availability
Edureka!
 
Backing up Wikipedia Databases
Backing up Wikipedia DatabasesBacking up Wikipedia Databases
Backing up Wikipedia Databases
Jaime Crespo
 
Optimizing Servers for High-Throughput and Low-Latency at Dropbox
Optimizing Servers for High-Throughput and Low-Latency at DropboxOptimizing Servers for High-Throughput and Low-Latency at Dropbox
Optimizing Servers for High-Throughput and Low-Latency at Dropbox
ScyllaDB
 
Marian Marinov, 1H Ltd.
Marian Marinov, 1H Ltd.Marian Marinov, 1H Ltd.
Marian Marinov, 1H Ltd.
Ontico
 
Performance comparison of Distributed File Systems on 1Gbit networks
Performance comparison of Distributed File Systems on 1Gbit networksPerformance comparison of Distributed File Systems on 1Gbit networks
Performance comparison of Distributed File Systems on 1Gbit networks
Marian Marinov
 
PuppetConf 2016: An Introduction to Measuring and Tuning PE Performance – Cha...
PuppetConf 2016: An Introduction to Measuring and Tuning PE Performance – Cha...PuppetConf 2016: An Introduction to Measuring and Tuning PE Performance – Cha...
PuppetConf 2016: An Introduction to Measuring and Tuning PE Performance – Cha...
Puppet
 
A Detailed Look At cassandra.yaml (Edward Capriolo, The Last Pickle) | Cassan...
A Detailed Look At cassandra.yaml (Edward Capriolo, The Last Pickle) | Cassan...A Detailed Look At cassandra.yaml (Edward Capriolo, The Last Pickle) | Cassan...
A Detailed Look At cassandra.yaml (Edward Capriolo, The Last Pickle) | Cassan...
DataStax
 
Shak larry-jeder-perf-and-tuning-summit14-part2-final
Shak larry-jeder-perf-and-tuning-summit14-part2-finalShak larry-jeder-perf-and-tuning-summit14-part2-final
Shak larry-jeder-perf-and-tuning-summit14-part2-final
Tommy Lee
 
hbaseconasia2017: Large scale data near-line loading method and architecture
hbaseconasia2017: Large scale data near-line loading method and architecturehbaseconasia2017: Large scale data near-line loading method and architecture
hbaseconasia2017: Large scale data near-line loading method and architecture
HBaseCon
 
Squid proxy server
Squid proxy serverSquid proxy server
Squid proxy server
Green Jb
 
Comparison of foss distributed storage
Comparison of foss distributed storageComparison of foss distributed storage
Comparison of foss distributed storage
Marian Marinov
 
20080528dublinpt3
20080528dublinpt320080528dublinpt3
20080528dublinpt3
Jeff Hammerbacher
 
z/VM 6.3 - Mudanças de Comportamento do hypervisor para suporte de partições ...
z/VM 6.3 - Mudanças de Comportamento do hypervisor para suporte de partições ...z/VM 6.3 - Mudanças de Comportamento do hypervisor para suporte de partições ...
z/VM 6.3 - Mudanças de Comportamento do hypervisor para suporte de partições ...
Joao Galdino Mello de Souza
 
Dynomite: A Highly Available, Distributed and Scalable Dynamo Layer--Ioannis ...
Dynomite: A Highly Available, Distributed and Scalable Dynamo Layer--Ioannis ...Dynomite: A Highly Available, Distributed and Scalable Dynamo Layer--Ioannis ...
Dynomite: A Highly Available, Distributed and Scalable Dynamo Layer--Ioannis ...
Redis Labs
 
Storage and performance- Batch processing, Whiptail
Storage and performance- Batch processing, WhiptailStorage and performance- Batch processing, Whiptail
Storage and performance- Batch processing, Whiptail
Internet World
 
Challenges with Gluster and Persistent Memory with Dan Lambright
Challenges with Gluster and Persistent Memory with Dan LambrightChallenges with Gluster and Persistent Memory with Dan Lambright
Challenges with Gluster and Persistent Memory with Dan Lambright
Gluster.org
 
03 Hadoop
03 Hadoop03 Hadoop
03 Hadoop
Omid Djoudi
 
QNAP TS-832PX-4G.pdf
QNAP TS-832PX-4G.pdfQNAP TS-832PX-4G.pdf
QNAP TS-832PX-4G.pdf
GustavoLippera1
 
Start Counting: How We Unlocked Platform Efficiency and Reliability While Sav...
Start Counting: How We Unlocked Platform Efficiency and Reliability While Sav...Start Counting: How We Unlocked Platform Efficiency and Reliability While Sav...
Start Counting: How We Unlocked Platform Efficiency and Reliability While Sav...
VMware Tanzu
 
Building Apache Cassandra clusters for massive scale
Building Apache Cassandra clusters for massive scaleBuilding Apache Cassandra clusters for massive scale
Building Apache Cassandra clusters for massive scale
Alex Thompson
 
Hadoop Cluster With High Availability
Hadoop Cluster With High AvailabilityHadoop Cluster With High Availability
Hadoop Cluster With High Availability
Edureka!
 
Backing up Wikipedia Databases
Backing up Wikipedia DatabasesBacking up Wikipedia Databases
Backing up Wikipedia Databases
Jaime Crespo
 
Optimizing Servers for High-Throughput and Low-Latency at Dropbox
Optimizing Servers for High-Throughput and Low-Latency at DropboxOptimizing Servers for High-Throughput and Low-Latency at Dropbox
Optimizing Servers for High-Throughput and Low-Latency at Dropbox
ScyllaDB
 
Marian Marinov, 1H Ltd.
Marian Marinov, 1H Ltd.Marian Marinov, 1H Ltd.
Marian Marinov, 1H Ltd.
Ontico
 
Performance comparison of Distributed File Systems on 1Gbit networks
Performance comparison of Distributed File Systems on 1Gbit networksPerformance comparison of Distributed File Systems on 1Gbit networks
Performance comparison of Distributed File Systems on 1Gbit networks
Marian Marinov
 
PuppetConf 2016: An Introduction to Measuring and Tuning PE Performance – Cha...
PuppetConf 2016: An Introduction to Measuring and Tuning PE Performance – Cha...PuppetConf 2016: An Introduction to Measuring and Tuning PE Performance – Cha...
PuppetConf 2016: An Introduction to Measuring and Tuning PE Performance – Cha...
Puppet
 
A Detailed Look At cassandra.yaml (Edward Capriolo, The Last Pickle) | Cassan...
A Detailed Look At cassandra.yaml (Edward Capriolo, The Last Pickle) | Cassan...A Detailed Look At cassandra.yaml (Edward Capriolo, The Last Pickle) | Cassan...
A Detailed Look At cassandra.yaml (Edward Capriolo, The Last Pickle) | Cassan...
DataStax
 
Shak larry-jeder-perf-and-tuning-summit14-part2-final
Shak larry-jeder-perf-and-tuning-summit14-part2-finalShak larry-jeder-perf-and-tuning-summit14-part2-final
Shak larry-jeder-perf-and-tuning-summit14-part2-final
Tommy Lee
 
hbaseconasia2017: Large scale data near-line loading method and architecture
hbaseconasia2017: Large scale data near-line loading method and architecturehbaseconasia2017: Large scale data near-line loading method and architecture
hbaseconasia2017: Large scale data near-line loading method and architecture
HBaseCon
 
Squid proxy server
Squid proxy serverSquid proxy server
Squid proxy server
Green Jb
 
Comparison of foss distributed storage
Comparison of foss distributed storageComparison of foss distributed storage
Comparison of foss distributed storage
Marian Marinov
 
z/VM 6.3 - Mudanças de Comportamento do hypervisor para suporte de partições ...
z/VM 6.3 - Mudanças de Comportamento do hypervisor para suporte de partições ...z/VM 6.3 - Mudanças de Comportamento do hypervisor para suporte de partições ...
z/VM 6.3 - Mudanças de Comportamento do hypervisor para suporte de partições ...
Joao Galdino Mello de Souza
 
Dynomite: A Highly Available, Distributed and Scalable Dynamo Layer--Ioannis ...
Dynomite: A Highly Available, Distributed and Scalable Dynamo Layer--Ioannis ...Dynomite: A Highly Available, Distributed and Scalable Dynamo Layer--Ioannis ...
Dynomite: A Highly Available, Distributed and Scalable Dynamo Layer--Ioannis ...
Redis Labs
 
Storage and performance- Batch processing, Whiptail
Storage and performance- Batch processing, WhiptailStorage and performance- Batch processing, Whiptail
Storage and performance- Batch processing, Whiptail
Internet World
 
Challenges with Gluster and Persistent Memory with Dan Lambright
Challenges with Gluster and Persistent Memory with Dan LambrightChallenges with Gluster and Persistent Memory with Dan Lambright
Challenges with Gluster and Persistent Memory with Dan Lambright
Gluster.org
 
Start Counting: How We Unlocked Platform Efficiency and Reliability While Sav...
Start Counting: How We Unlocked Platform Efficiency and Reliability While Sav...Start Counting: How We Unlocked Platform Efficiency and Reliability While Sav...
Start Counting: How We Unlocked Platform Efficiency and Reliability While Sav...
VMware Tanzu
 
Ad

More from Omid Vahdaty (20)

Data Pipline Observability meetup
Data Pipline Observability meetup Data Pipline Observability meetup
Data Pipline Observability meetup
Omid Vahdaty
 
Couchbase Data Platform | Big Data Demystified
Couchbase Data Platform | Big Data DemystifiedCouchbase Data Platform | Big Data Demystified
Couchbase Data Platform | Big Data Demystified
Omid Vahdaty
 
Machine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data DemystifiedMachine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data Demystified
Omid Vahdaty
 
Machine Learning Essentials Demystified part1 | Big Data Demystified
Machine Learning Essentials Demystified part1 | Big Data DemystifiedMachine Learning Essentials Demystified part1 | Big Data Demystified
Machine Learning Essentials Demystified part1 | Big Data Demystified
Omid Vahdaty
 
The technology of fake news between a new front and a new frontier | Big Dat...
The technology of fake news  between a new front and a new frontier | Big Dat...The technology of fake news  between a new front and a new frontier | Big Dat...
The technology of fake news between a new front and a new frontier | Big Dat...
Omid Vahdaty
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Omid Vahdaty
 
Making your analytics talk business | Big Data Demystified
Making your analytics talk business | Big Data DemystifiedMaking your analytics talk business | Big Data Demystified
Making your analytics talk business | Big Data Demystified
Omid Vahdaty
 
BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...
BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...
BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...
Omid Vahdaty
 
AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...
AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...
AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...
Omid Vahdaty
 
Aerospike meetup july 2019 | Big Data Demystified
Aerospike meetup july 2019 | Big Data DemystifiedAerospike meetup july 2019 | Big Data Demystified
Aerospike meetup july 2019 | Big Data Demystified
Omid Vahdaty
 
ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...
ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...
ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...
Omid Vahdaty
 
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
Omid Vahdaty
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
Omid Vahdaty
 
AWS Big Data Demystified #4 data governance demystified [security, networ...
AWS Big Data Demystified #4   data governance demystified   [security, networ...AWS Big Data Demystified #4   data governance demystified   [security, networ...
AWS Big Data Demystified #4 data governance demystified [security, networ...
Omid Vahdaty
 
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
Omid Vahdaty
 
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive
Omid Vahdaty
 
Amazon aws big data demystified | Introduction to streaming and messaging flu...
Amazon aws big data demystified | Introduction to streaming and messaging flu...Amazon aws big data demystified | Introduction to streaming and messaging flu...
Amazon aws big data demystified | Introduction to streaming and messaging flu...
Omid Vahdaty
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
Omid Vahdaty
 
Emr spark tuning demystified
Emr spark tuning demystifiedEmr spark tuning demystified
Emr spark tuning demystified
Omid Vahdaty
 
Emr zeppelin & Livy demystified
Emr zeppelin & Livy demystifiedEmr zeppelin & Livy demystified
Emr zeppelin & Livy demystified
Omid Vahdaty
 
Data Pipline Observability meetup
Data Pipline Observability meetup Data Pipline Observability meetup
Data Pipline Observability meetup
Omid Vahdaty
 
Couchbase Data Platform | Big Data Demystified
Couchbase Data Platform | Big Data DemystifiedCouchbase Data Platform | Big Data Demystified
Couchbase Data Platform | Big Data Demystified
Omid Vahdaty
 
Machine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data DemystifiedMachine Learning Essentials Demystified part2 | Big Data Demystified
Machine Learning Essentials Demystified part2 | Big Data Demystified
Omid Vahdaty
 
Machine Learning Essentials Demystified part1 | Big Data Demystified
Machine Learning Essentials Demystified part1 | Big Data DemystifiedMachine Learning Essentials Demystified part1 | Big Data Demystified
Machine Learning Essentials Demystified part1 | Big Data Demystified
Omid Vahdaty
 
The technology of fake news between a new front and a new frontier | Big Dat...
The technology of fake news  between a new front and a new frontier | Big Dat...The technology of fake news  between a new front and a new frontier | Big Dat...
The technology of fake news between a new front and a new frontier | Big Dat...
Omid Vahdaty
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Omid Vahdaty
 
Making your analytics talk business | Big Data Demystified
Making your analytics talk business | Big Data DemystifiedMaking your analytics talk business | Big Data Demystified
Making your analytics talk business | Big Data Demystified
Omid Vahdaty
 
BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...
BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...
BI STRATEGY FROM A BIRD'S EYE VIEW (How to become a trusted advisor) | Omri H...
Omid Vahdaty
 
AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...
AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...
AI and Big Data in Health Sector Opportunities and challenges | Big Data Demy...
Omid Vahdaty
 
Aerospike meetup july 2019 | Big Data Demystified
Aerospike meetup july 2019 | Big Data DemystifiedAerospike meetup july 2019 | Big Data Demystified
Aerospike meetup july 2019 | Big Data Demystified
Omid Vahdaty
 
ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...
ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...
ALIGNING YOUR BI OPERATIONS WITH YOUR CUSTOMERS' UNSPOKEN NEEDS, by Eyal Stei...
Omid Vahdaty
 
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
Omid Vahdaty
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
Omid Vahdaty
 
AWS Big Data Demystified #4 data governance demystified [security, networ...
AWS Big Data Demystified #4   data governance demystified   [security, networ...AWS Big Data Demystified #4   data governance demystified   [security, networ...
AWS Big Data Demystified #4 data governance demystified [security, networ...
Omid Vahdaty
 
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
Omid Vahdaty
 
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive AWS Big Data Demystified #2 |  Athena, Spectrum, Emr, Hive
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive
Omid Vahdaty
 
Amazon aws big data demystified | Introduction to streaming and messaging flu...
Amazon aws big data demystified | Introduction to streaming and messaging flu...Amazon aws big data demystified | Introduction to streaming and messaging flu...
Amazon aws big data demystified | Introduction to streaming and messaging flu...
Omid Vahdaty
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
Omid Vahdaty
 
Emr spark tuning demystified
Emr spark tuning demystifiedEmr spark tuning demystified
Emr spark tuning demystified
Omid Vahdaty
 
Emr zeppelin & Livy demystified
Emr zeppelin & Livy demystifiedEmr zeppelin & Livy demystified
Emr zeppelin & Livy demystified
Omid Vahdaty
 
Ad

Recently uploaded (20)

AI-assisted Software Testing (3-hours tutorial)
AI-assisted Software Testing (3-hours tutorial)AI-assisted Software Testing (3-hours tutorial)
AI-assisted Software Testing (3-hours tutorial)
Vəhid Gəruslu
 
The Gaussian Process Modeling Module in UQLab
The Gaussian Process Modeling Module in UQLabThe Gaussian Process Modeling Module in UQLab
The Gaussian Process Modeling Module in UQLab
Journal of Soft Computing in Civil Engineering
 
π0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalizationπ0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalization
NABLAS株式会社
 
Introduction to FLUID MECHANICS & KINEMATICS
Introduction to FLUID MECHANICS &  KINEMATICSIntroduction to FLUID MECHANICS &  KINEMATICS
Introduction to FLUID MECHANICS & KINEMATICS
narayanaswamygdas
 
RICS Membership-(The Royal Institution of Chartered Surveyors).pdf
RICS Membership-(The Royal Institution of Chartered Surveyors).pdfRICS Membership-(The Royal Institution of Chartered Surveyors).pdf
RICS Membership-(The Royal Institution of Chartered Surveyors).pdf
MohamedAbdelkader115
 
Raish Khanji GTU 8th sem Internship Report.pdf
Raish Khanji GTU 8th sem Internship Report.pdfRaish Khanji GTU 8th sem Internship Report.pdf
Raish Khanji GTU 8th sem Internship Report.pdf
RaishKhanji
 
Oil-gas_Unconventional oil and gass_reseviours.pdf
Oil-gas_Unconventional oil and gass_reseviours.pdfOil-gas_Unconventional oil and gass_reseviours.pdf
Oil-gas_Unconventional oil and gass_reseviours.pdf
M7md3li2
 
Avnet Silica's PCIM 2025 Highlights Flyer
Avnet Silica's PCIM 2025 Highlights FlyerAvnet Silica's PCIM 2025 Highlights Flyer
Avnet Silica's PCIM 2025 Highlights Flyer
WillDavies22
 
MAQUINARIA MINAS CEMA 6th Edition (1).pdf
MAQUINARIA MINAS CEMA 6th Edition (1).pdfMAQUINARIA MINAS CEMA 6th Edition (1).pdf
MAQUINARIA MINAS CEMA 6th Edition (1).pdf
ssuser562df4
 
International Journal of Distributed and Parallel systems (IJDPS)
International Journal of Distributed and Parallel systems (IJDPS)International Journal of Distributed and Parallel systems (IJDPS)
International Journal of Distributed and Parallel systems (IJDPS)
samueljackson3773
 
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
inmishra17121973
 
Reagent dosing (Bredel) presentation.pptx
Reagent dosing (Bredel) presentation.pptxReagent dosing (Bredel) presentation.pptx
Reagent dosing (Bredel) presentation.pptx
AlejandroOdio
 
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITYADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ijscai
 
Development of MLR, ANN and ANFIS Models for Estimation of PCUs at Different ...
Development of MLR, ANN and ANFIS Models for Estimation of PCUs at Different ...Development of MLR, ANN and ANFIS Models for Estimation of PCUs at Different ...
Development of MLR, ANN and ANFIS Models for Estimation of PCUs at Different ...
Journal of Soft Computing in Civil Engineering
 
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptxExplainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
MahaveerVPandit
 
theory-slides-for react for beginners.pptx
theory-slides-for react for beginners.pptxtheory-slides-for react for beginners.pptx
theory-slides-for react for beginners.pptx
sanchezvanessa7896
 
QA/QC Manager (Quality management Expert)
QA/QC Manager (Quality management Expert)QA/QC Manager (Quality management Expert)
QA/QC Manager (Quality management Expert)
rccbatchplant
 
15th International Conference on Computer Science, Engineering and Applicatio...
15th International Conference on Computer Science, Engineering and Applicatio...15th International Conference on Computer Science, Engineering and Applicatio...
15th International Conference on Computer Science, Engineering and Applicatio...
IJCSES Journal
 
Degree_of_Automation.pdf for Instrumentation and industrial specialist
Degree_of_Automation.pdf for  Instrumentation  and industrial specialistDegree_of_Automation.pdf for  Instrumentation  and industrial specialist
Degree_of_Automation.pdf for Instrumentation and industrial specialist
shreyabhosale19
 
Value Stream Mapping Worskshops for Intelligent Continuous Security
Value Stream Mapping Worskshops for Intelligent Continuous SecurityValue Stream Mapping Worskshops for Intelligent Continuous Security
Value Stream Mapping Worskshops for Intelligent Continuous Security
Marc Hornbeek
 
AI-assisted Software Testing (3-hours tutorial)
AI-assisted Software Testing (3-hours tutorial)AI-assisted Software Testing (3-hours tutorial)
AI-assisted Software Testing (3-hours tutorial)
Vəhid Gəruslu
 
π0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalizationπ0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalization
NABLAS株式会社
 
Introduction to FLUID MECHANICS & KINEMATICS
Introduction to FLUID MECHANICS &  KINEMATICSIntroduction to FLUID MECHANICS &  KINEMATICS
Introduction to FLUID MECHANICS & KINEMATICS
narayanaswamygdas
 
RICS Membership-(The Royal Institution of Chartered Surveyors).pdf
RICS Membership-(The Royal Institution of Chartered Surveyors).pdfRICS Membership-(The Royal Institution of Chartered Surveyors).pdf
RICS Membership-(The Royal Institution of Chartered Surveyors).pdf
MohamedAbdelkader115
 
Raish Khanji GTU 8th sem Internship Report.pdf
Raish Khanji GTU 8th sem Internship Report.pdfRaish Khanji GTU 8th sem Internship Report.pdf
Raish Khanji GTU 8th sem Internship Report.pdf
RaishKhanji
 
Oil-gas_Unconventional oil and gass_reseviours.pdf
Oil-gas_Unconventional oil and gass_reseviours.pdfOil-gas_Unconventional oil and gass_reseviours.pdf
Oil-gas_Unconventional oil and gass_reseviours.pdf
M7md3li2
 
Avnet Silica's PCIM 2025 Highlights Flyer
Avnet Silica's PCIM 2025 Highlights FlyerAvnet Silica's PCIM 2025 Highlights Flyer
Avnet Silica's PCIM 2025 Highlights Flyer
WillDavies22
 
MAQUINARIA MINAS CEMA 6th Edition (1).pdf
MAQUINARIA MINAS CEMA 6th Edition (1).pdfMAQUINARIA MINAS CEMA 6th Edition (1).pdf
MAQUINARIA MINAS CEMA 6th Edition (1).pdf
ssuser562df4
 
International Journal of Distributed and Parallel systems (IJDPS)
International Journal of Distributed and Parallel systems (IJDPS)International Journal of Distributed and Parallel systems (IJDPS)
International Journal of Distributed and Parallel systems (IJDPS)
samueljackson3773
 
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
211421893-M-Tech-CIVIL-Structural-Engineering-pdf.pdf
inmishra17121973
 
Reagent dosing (Bredel) presentation.pptx
Reagent dosing (Bredel) presentation.pptxReagent dosing (Bredel) presentation.pptx
Reagent dosing (Bredel) presentation.pptx
AlejandroOdio
 
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITYADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ijscai
 
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptxExplainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
Explainable-Artificial-Intelligence-XAI-A-Deep-Dive (1).pptx
MahaveerVPandit
 
theory-slides-for react for beginners.pptx
theory-slides-for react for beginners.pptxtheory-slides-for react for beginners.pptx
theory-slides-for react for beginners.pptx
sanchezvanessa7896
 
QA/QC Manager (Quality management Expert)
QA/QC Manager (Quality management Expert)QA/QC Manager (Quality management Expert)
QA/QC Manager (Quality management Expert)
rccbatchplant
 
15th International Conference on Computer Science, Engineering and Applicatio...
15th International Conference on Computer Science, Engineering and Applicatio...15th International Conference on Computer Science, Engineering and Applicatio...
15th International Conference on Computer Science, Engineering and Applicatio...
IJCSES Journal
 
Degree_of_Automation.pdf for Instrumentation and industrial specialist
Degree_of_Automation.pdf for  Instrumentation  and industrial specialistDegree_of_Automation.pdf for  Instrumentation  and industrial specialist
Degree_of_Automation.pdf for Instrumentation and industrial specialist
shreyabhosale19
 
Value Stream Mapping Worskshops for Intelligent Continuous Security
Value Stream Mapping Worskshops for Intelligent Continuous SecurityValue Stream Mapping Worskshops for Intelligent Continuous Security
Value Stream Mapping Worskshops for Intelligent Continuous Security
Marc Hornbeek
 

Flume and Hadoop performance insights

  • 1. Flume and HDFS performance insights Omid Vahdaty, Big Data Ninja
  • 2. Test Details ● Fan In architecture ● Collector sending data, Input 100GB of pcap files, rate in ingress: 2MB/s. ● MiniFlume : Thrift source , mem channel, AVRO sink. ● MegaFlume: AVRO source, mem channel, HDFS SInk ● Hadoop ○ 4 DataNode ○ 1 NameNode
  • 3. Mini Flume config i. Virtual Machine: 8 CPU core, 16GB RAM. Ubuntu 14.04. i. Environment: export JAVA_OPTS="-Xms100m -Xmx12000m Dcom.sun.management.jmxremote i. Config a. 2X thrift sources. thread count to 8. b. 2X avro sinks. roll size 10000. c. memory-channel.capacity = 100100 d. memory-channel.transactionCapacity=10010
  • 4. Mini Flume File channel config agent.channels.k1.type = file agent.channels.k1.capacity = 82000000 agent.channels.k1.transactionCapacity = 2000009 agent.channels.k1.checkpointDir = /hadoop-data/hadoopuser/flume/tmp/checkpoint1 agent.channels.k1.dataDirs = /hadoop-data/hadoopuser/flume/data/dn1 agent.channels.k1.useDualCheckpoints=true agent.channels.k1.backupCheckpointDir = /hadoop-data/hadoopuser/flume/tmp/checkpoint11 agent.channels.k1.use-fast-replay = true agent.channels.k1.checkpointInterval=60000
  • 5. Mega Flume config i. Virtual Machine: 16 CPU core, 16GB RAM. ii. Environment: export JAVA_OPTS="-Xms100m -Xmx12000m -Dcom.sun.management.jmxremote iii. mem channel a. memory-channel.capacity = 6200000 b. memory-channel.transactionCapacity=620000 iv. 2 avro sources 8 threads v. 2 HDFS sinks 1. agent.sinks.hdfs-sink.hdfs.path = hdfs://master1:54310/data/EthernetCollector 2. agent.sinks.hdfs-sink.hdfs.fileType = DataStream 3. agent.sinks.hdfs-sink.hdfs.useLocalTimeStamp = true 4. agent.sinks.hdfs-sink.hdfs.filePrefix = ethernet 5. agent.sinks.hdfs-sink.hdfs.fileSuffix = .avro 6. agent.sinks.hdfs-sink.hdfs.inUsePrefix = part. 7. agent.sinks.hdfs-sink.serializer = avro_event 8. agent.sinks.hdfs-sink.hdfs.minBlockReplicas= 1 9. agent.sinks.hdfs-sink.hdfs.threadsPoolSize= 16 10. agent.sinks.hdfs-sink.hdfs.rollCount = 250030 11. agent.sinks.hdfs-sink.hdfs.rollSize = 0 12. agent.sinks.hdfs-sink.hdfs.batchSize = 620000
  • 6. Hadoop Cluster i. 4 hadoop data nodes , 12 Disks per node, 4 cores. 8GB RAM. ii. Single Namenode, not HA , no Yarn. no ZooKeeper.
  • 7. Flume Performance insights i. thrift source has threads!!! ii. VERY consuming in terms of resources iii. file channel is MUCH slower than mem channel iv. sink HDFS , stabilizes after long time (~ 30 min), each change takes time. 1. buffering into RAM – the RAM graphs goes up. 2. Stable ingress rate - RAM consumption is parallel line to the time axis. 3. small roll size - may crash the cluster 4. too much "pressure" on hadoop crushes the cluster, causes data-node loss. it even cause the Name-node to enter safe mode, or even loss ALL data nodes. 5. Rule of thunk- for each sink - at least 2 data nodes with at least 2 data disks. 6. each batch request is divided to several threads. i.e 2MB/s is write speed on hadoop per sink per node in parallel. read a little about flume performance metrics: https://cwiki.apache.org/confluence/display/FLUME/Performance+Measurements+-+round+2 7. notice the above article: 20 data-node, 8 disks, 8 sinks.
  • 8. Flume Performance insights i. each flume should be considered for tuning of ingress and egress on same node. monitor the RAM metrics via data-dog or another monitoring tool , and line should be parallel to time axis at all time. ii. each source/sink may have threads - significant impact on performance. iii. when increase batch size - all other parameters should increase with similar ratio. iv. be sure to understand difference of batch vs. roll. v. use unique numbers for debugging. vi. each event size changes the buffering in the flume drastically. as long as it bound by min/max values, your config is ok vii. Consider giving unique prefix names per sink - multiply your tmp files in parallel.
  • 9. File channel insights ● Consider 2 partitions per file channel : data and checkpoint ● Don't Consider file channels per port - unless you have separate partitions per file channel. ● There is a limit in max capacity the flume can handle. - if you need to buffer a week of down time, consider scaling out VIA fan in ● Consider Max file capacity written to disk - to be below RAM size, to utilize OS caching.very efficient ● Consider increasing the transaction Capacity to 1M events for fast recovery from filechannel reply.
  • 10. File channel insights ● The File Channel takes a comma-separated list of data directories as the value of the dataDirs parameter ● If the channel is stopped while it is checkpointing, the checkpoint may be incomplete or corrupt. A corrupt or incomplete checkpoint could make the restart of the channel extremely slow, as the channel would need to read and replay all data files.To avoid this problem, it is recommended that the useDualCheckpointsparameter be set to true and that backupCheckpointDir be set ● It will always retain two files per data directory, even if the files do not have any events to be taken and committed. The channel will also not delete the file currently being written to. ● Using NFS-mounted disks with the File Channel is not a good idea ● Read: https://www.safaribooksonline.com/library/view/using-flume/9781491905326/ch04.html ● Increase checkpoint interval if you use dual backup. (checkpoint time was doubled if you use dual
  • 11. Hadoop Performance Insights Hadoop i. more data node disk increases performance ii. hadoop was designed for parallelism.but flume sinks - are not very powerful. Yes you can add more sinks - but - you need stronger cluster - 1 sink = 2 data nodes with at least 2 data partions. iii. you can add nodes dynamically easily while the cluster is running , so no need to restart cluster. iv. Increase IO buffer to 64KB or 128KB (assuming large block size) v. Increase NN handlers to 20 X number datanode
  • 12. General Performance Insights Generally speaking: i. very HARD to simulate the situation with 1 server. I over committed resources, causing failures in the HDFS. ii. the amount of thread is ENORMOUS! but very light, and short spanned. not CPU intensive. iii. no Egress to engine was tested in this scenario iv. not data correctness was tested. v. very hard to fine tune flume - each change on file based sinks - take about 10 min to reflect/stabilize in monitoring (unless it crushed first).
  • 13. Stay in touch... ● Omid Vahdaty ● +972-54-2384178