Bigdata Lecture 3
Bigdata Lecture 3
Analytics
Dr. Iman Ahmed ElSayed
5. Hadoop’s Rise
6. Evolution of HDFS
➢ It is written in Java.
➢ Hadoop comes as the solution to the problem of big data i.e. storing and
processing the big data with some extra capabilities.
4
Big Data Hadoop architecture and components
5
Big Data Hadoop architecture and components
6
Big Data Hadoop architecture and components
➢ MapReduce
7
Big Data Hadoop architecture and components
1. MapReduce
8
Big Data Map - Reduce functions
9
Big Data Map() function
10
Big Data Map() function
Map Task:
➢ Map: A map is nothing but a user-defined function whose work is to process the
Tuples obtained from record reader. The Map() function either does not generate
any key-value pair or generate multiple pairs of these tuples.
➢ Combiner: Combiner is used for grouping the data in the Map workflow.
Assume you have 5 files, and each file contains two columns (a key and a value in
Hadoop terms) that represent a city and the corresponding temperature recorded in
that city for the various measurement days (record reading).
Out of all the data we have collected, you want to find the maximum temperature for
each city across the data files.
12
Big Data Map() function
Using the MapReduce framework, you can break this down into five map tasks,
where each mapper works on one of the five files.
The mapper task goes through the data and returns the maximum temperature for
each city. For example, the results produced from one mapper task for the data
above would look like this:
(Toronto, 20) (Whitby, 25) (New York, 22) (Rome, 33) (combiner)
Assume the other four mapper tasks (working on the other four files not shown here)
produced the following intermediate results: (partitioner)
14
Big Data Reduce() function
Reduce Task:
➢ Shuffle and Sort: The Task of Reducer starts with this step, the process in which
the Mapper generates the intermediate key-value and transfers them to the Reducer
task is known as Shuffling. Using the Shuffling process the system can sort the data
using its key value. Once some of the Mapping tasks are done Shuffling begins that
is why it is a faster process and does not wait for the completion of the task
performed by Mapper.
➢ Reduce: The main function or task of the Reduce is to gather the Tuple generated
from Map and then perform some sorting and aggregation sort of process on those
key-value depending on its key element.
➢ OutputFormat: Once all the operations are performed, the key-value pairs are
written into the file with the help of record writer, each record in a new line, and the
key and value in a space-separated manner. 15
Big Data Reduce() function
Example Cont’d:
All five of these output streams would be fed into the reduce tasks, which combine the
input results and output a single value for each city, producing an output:
(Toronto,32) (Whitby,27) (New York, 33) (Rome,38)
16
Big Data Map - Reduce functions
17
Big Data Hadoop architecture and components
(done)
18
Big Data Hadoop distributed file system (HDFS)
➢ Hadoop comes with a distributed file system called HDFS,
which stands for Hadoop Distributed File system.
Files in HDFS are split into blocks before they are stored on cluster of
size 64MB or 128MB.
Each daemon runs separately in its own JVM. These daemons have
specific roles; some exist only on one server, some exist across multiple
servers. 20
Big Data Hadoop distributed file system (HDFS)
HDFS Architecture
1. NameNode
2. DataNode
3. Secondary NameNode
4. HDFS Client
5. Block Structure
6. JobTracker (master daemon)
7. TaskTracker (slave daemon)
Cont’d: NameNode
The Namenode maintains the entire metadata in RAM, which helps clients
receive quick responses to read requests.
The Namenode daemon also maintains a persistent checkpoint of the
metadata in a file stored on the disk called the fsimage file. Whenever a file is
placed/deleted/updated in the cluster, an entry of this action is updated in a file
called the edits logfile.
Key Responsibilities:
• Maintaining the filesystem tree and metadata.
• Managing the mapping of file blocks to DataNodes.
• Ensuring data integrity and coordinating replication of data blocks 23
Big Data Hadoop distributed file system (HDFS)
DataNode daemon
➢ Performs the dirty work of the distributed filesystem.
➢ Acts as a slave node and is responsible for storing the actual files in HDFS.
➢ The files are split as data blocks across the cluster.
➢ The blocks are typically 64 MB to 128 MB size blocks. The block size is a
configurable parameter.
➢ The file blocks in a Hadoop cluster also replicate themselves to other
datanodes for redundancy so that no data is lost in case a datanode
daemon fails.
➢ The datanode daemon sends information to the namenode daemon about
the files and blocks stored in that node and responds to the namenode
daemon for all filesystem operations 24
Big Data Hadoop distributed file system (HDFS)
25
Big Data Hadoop distributed file system (HDFS)
26
Big Data Hadoop distributed file system (HDFS)
It is the responsibility of the namenode daemon to maintain a list of the files
and their corresponding locations on the cluster. Whenever a client needs to
access a file, the namenode daemon provides the location of the file to client
and the client, and then accesses the file directly from the datanode daemon.
Key Responsibilities:
➢ Performing block creation, deletion, and replication upon instruction from the
NameNode.
Since the fsimage file is not updated for every operation inside the NameNode,
it is possible the edits logfile would grow to a very large file.
The restart of namenode service would become very slow because all the
actions in the large edits logfile will have to be applied on the fsimage file.
The slow boot up time could be avoided using the secondary namenode
daemon.
29
Big Data Hadoop distributed file system (HDFS)
Secondary Name Node:
31
Big Data Hadoop distributed file system (HDFS)
32
Big Data Hadoop distributed file system (HDFS)
33
Big Data Hadoop distributed file system (HDFS)
Job Tracker ()رئيس العمال:
➢ Along with the heartbeat, it also sends the free slots available
within it, to process tasks. The tasktracker daemon starts and
monitors the map, and reduces tasks and sends progress/status
information back to the jobtracker daemon. 35
Big Data Hadoop distributed file system (HDFS)
Block Structure:
Key Features:
✓ Large block size reduces the overhead of managing a large
number of blocks.
✓ Blocks are replicated across multiple DataNodes to ensure data
36
availability and fault tolerance.
Big Data Hadoop distributed file system (HDFS)
HDFS Client:
Key Responsibilities:
✓ Facilitating interaction between the user/application and HDFS.
✓ Communicating with the NameNode for metadata and with
DataNodes for data access. 37
Big Data Hadoop distributed file system (HDFS)
38
Big Data Hadoop architecture and components
(done)
(done)
39
Big Data Hadoop architecture and components
40
Big Data Hadoop architecture and components
41
Big Data Hadoop architecture and components
Bottle neck will
occur sooner or
later
42