UNIT-3 Hadoop and MapReduce Programming
UNIT-3 Hadoop and MapReduce Programming
And
Analytics
Seema Acharya
Subhashini Chellappan
Introduction to Hadoop
The key consideration (the rationale behind its huge popularity) is:
Flume is a distributed,
reliable, and available service
for efficiently collecting,
aggregating, and moving
large amounts of streaming
data into the Hadoop
Distributed File System
(HDFS).
HDFS:
(a) Storage component.
(b) Distributes data across several nodes.
(c) Natively redundant.
MapReduce:
(a) Computational framework.
(b) Splits a task across multiple nodes.
(c) Processes data in parallel.
Flume,Oozie,Mahout,Hive,Pig,Sqoop,Hbase
4. Optimized for high throughput (HDFS leverages large block size and
moves computation where data is stored).
7. You can realize the power of HDFS when you perform read or write
on large files (gigabytes and larger).
NameNode:
DataNode:
SecondaryNameNode:
• Housekeeping Daemon
Act:
Act:
Act:
Objective: To copy a file from local file system to HDFS via copyFromLocal command
Act:
hadoop fs –copyFromLocal /root /sample/test.txt /sample/testsample.txt
Objective: To copy a file from Hadoop file system to local file system via copyToLocal command
Act:
hadoop fs –copyToLocal /sample/test.txt /root/sample/testsample1.txt
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
HDFS Commands..
Updates Read and write many time Write once, read many
s times
4. Hadoop 1.0 is not suitable for machine learning algorithms, graphs, and
other memory intensive algorithms.
Hadoop 2 : HDFS
► Major components:
►Namespace
►Blocks storage device
► Features:
►Horizontal scalability
►High availability
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Fig: Active and Passive Name Node
Interaction
Hive: Hive is a Data Warehousing Layer on top of Hadoop. Analysis and queries
can be done using an SQL-like language. Hive can be used to do ad-hoc queries,
summarization, and data analysis. Figure 5.31 depicts Hive in the Hadoop
ecosystem.
Sqoop: Sqoop is a tool which helps to transfer data between Hadoop and
Relational Databases. With the help of Sqoop, you can import data from RDBMS
to HDFS and vice-versa. Figure 5.32 depicts the Sqoop in Hadoop ecosystem.
Each task processes small subset of data that has been assigned to
it. This way, Hadoop distributes the load across the cluster.
• RecordReader
• Map
• Combiner
• Partitioner
Big Data and Analytics by Seema Acharya and Subhashini Chellappan
Copyright 2015, WILEY INDIA PVT. LTD.
Reducer
• Objective
• Input data
• Act
• Output data
You can specify compression format in the Driver Program as shown below:
conf.setBoolean("mapred.output.compress",true);
conf.setClass("mapred.output.compression.codec",
GzipCodec.class,CompressionCodec.class);
Column A Column B
HDFS DataNode
MapReduce Programming NameNode
Master node Processing Data
Slave node Google File System and MapReduce
Hadoop Implementation Storage
Column A Column B