Big data unit 3 own
Big data unit 3 own
MapReduce Workflow
1. Input Data: The data to be processed is stored in the Hadoop Distributed File
System (HDFS).
2. Splitting: Hadoop divides the data into smaller chunks called splits.
3. Mapping: The map() function processes each split separately on different machines
(mappers).
4. Combining (Optional): The output from mappers can be combined before moving to
the next step to reduce data transfer.
5. Shuffling & Sorting: The mapped data is shuffled and sorted based on keys.
6. Reducing: The reduce() function aggregates and processes the mapped data.
7. Final Output: The processed data is saved in HDFS.
Steps in Detail:
Intermediate files (between mappers and reducers) are stored locally instead of
in HDFS to improve speed.
Data Flow in MapReduce
1. Map Function – This function processes input data and transforms it into key-value
pairs.
2. Reduce Function – This function takes all values associated with the same key and
aggregates them to produce the final result.
The input data is split into smaller chunks. Each chunk is assigned to a "mapper,"
which processes it in parallel with others.
The mapper generates key-value pairs based on the data.
These key-value pairs are then sorted and grouped by key before moving to the
"reducer."
The reducer takes each key and combines all its values to generate the final output.
The final output is stored in a distributed file system for further use.
Components of MapReduce
Limitations of MapReduce
Task Assignment
1. Map Tasks:
o Assigned first, since they need to finish before reduce tasks can start.
o The system requests containers for mappers from the Resource Manager.
o Once 5% of the map tasks are completed, reduce tasks are requested.
Task Execution
o The Node Manager starts the container, which runs the YarnChild process.
o The map or reduce function executes and processes data.
Since jobs can run for a long time, the system provides real-time progress updates.
For map tasks, progress is tracked by the amount of data processed.
For reduce tasks, progress is estimated based on how much input data is processed.
Job Completion
When all tasks are done, the Application Master marks the job as Successful.
The system notifies the user and cleans up temporary data.
The job details are saved in the Job History Server for future reference.
3) YARN (Yet Another Resource Negotiator)
1. Manages Cluster Resources – Allocates CPU, memory, network, and storage for jobs.
2. Schedules and Monitors Jobs – Decides which job runs where and ensures it
completes successfully.
1. Resource Manager (Master Node) – Controls resource allocation for the entire
cluster.
2. Node Manager (Worker Nodes) – Runs tasks on each machine, monitors usage, and
reports to the Resource Manager.
3. Application Master – Manages each job’s execution, requests resources, and tracks
progress.
4. Container – A unit of allocated resources (CPU, memory, etc.) for running tasks.
The Resource Manager is the main decision-maker, while the Node Manager
handles execution on individual machines.
✅ Advantages:
✔Scalability – Can handle a large number of nodes.
✔Better Utilization – Manages resources dynamically instead of fixed slots.
✔Supports Multiple Versions – Different versions of MapReduce can run
together.
❌ Disadvantage:
✖Single Point of Failure – In Hadoop 1.0, the JobTracker was a weak point,
but YARN improves this.
1️ Task Failure
➡️ The TaskTracker detects failures and frees up space to run a new task.
2️ TaskTracker Failure
✅ The only solution is to restart the JobTracker and resubmit all running
jobs.
✅ This problem is why YARN was created!
2. Failures in YARN
YARN is better at handling failures than Classic MapReduce. It has three main
types of failures:
1️ Task Failure
2️ Node Manager Failure
3️ Resource Manager Failure
1️ Task Failure (Similar to Classic MapReduce)
✅ Any running tasks or Application Masters on that node are recovered using
built-in mechanisms.
1️ FIFO Scheduler
2️ Fair Scheduler
3️ Capacity Scheduler
Each scheduler has different ways of handling tasks, and each has its own
advantages and disadvantages.
Challenges in Job Scheduling
2️ Load Balancing – If some data blocks are much bigger than others, some
nodes do more work, leading to imbalance. Hadoop’s partitioning algorithm
tries to distribute data equally, but uneven key distribution can cause issues.
6️ Data Locality – The closer the computation is to the data, the faster the
processing.
7️ Synchronization – The reduce phase needs intermediate data from the map
phase. Ensuring smooth transfer is critical for performance.
🔹 How It Works?
🔹 Example:
Imagine you are in a ticket queue. The person who arrives first gets served
first, and the others have to wait in line.
🔹 How It Works?
This scheduler divides cluster resources fairly among users and jobs.
If only one job is running, it gets all the resources.
As more jobs arrive, resources are evenly distributed.
Jobs are placed into pools (groups) based on user-defined settings, such
as user name.
🔹 Key Features:
🔹 Example:
Imagine you are at a buffet. If you're the only person there, you can take as
much food as you want. But as more people arrive, food is shared equally
among all.
🔹 How It Works?
🔹 Key Features:
Guarantees a minimum capacity for each queue.
Uses security mechanisms to ensure each team can access only their
own queue.
Supports hierarchical queues—can have sub-queues within a main
queue.
🔹 Example:
Imagine a company with three teams: Engineering, Data Science, and
Marketing. Each team gets its own queue to ensure fair resource allocation. If
Marketing is not using its resources, Engineering can temporarily use them.
Hadoop guarantees that the input to each Reducer is sorted by key. The
process of sorting map outputs and transferring them to reducers is called
Shuffle.
When a MapReduce job runs, the Mapper produces output (key-value pairs).
Before the data reaches the Reducers, Hadoop automatically sorts it by key.
This internal process is known as Shuffle and Sort.
1️ Sorting in Mappers
2️ Shuffling to Reducers
The sorted data from Mappers is sent to Reducers through the network.
This process happens as soon as each Mapper finishes, to avoid network
congestion.
All data with the same key goes to the same Reducer.
3️ Sorting in Reducers
Before processing, the Reducer sorts the received data again to maintain order.
The final sorted data is written to HDFS or another storage system.
🔹 Using a Combiner
If needed, users can customize the shuffle and sort mechanism by modifying:
Hadoop splits a big job into smaller tasks and runs them in parallel to finish
the job faster.
However, sometimes one task runs much slower than the others. This slow
task is called a straggler.
8) MapReduce Types
1️ Map Function: Takes input key-value pairs and produces a list of new key-
value pairs.
2️ Reduce Function: Takes the output of the map function and processes it
further.
1. Input Formats
Hadoop can process different types of data, including text files, databases, and
binary files.
Example:
If a file has 100MB of data and the block size is 64MB, Hadoop will split it into two
parts.
FileInputFormat
Hadoop works better with fewer large files than many small files.
CombineFileInputFormat is used to combine multiple small files into larger splits,
reducing overhead.
Example:
0 This is line 1
25 This is line 2
50 This is line 3
Here, the keys (0, 25, 50) represent byte positions, and the values are the text lines.
A file is split into logical records (lines), but these don’t always align with HDFS
blocks.
Splits honor logical records, ensuring a full line is always included, even if it spans
multiple blocks.
SequenceFileInputFormat
SequenceFileAsBinaryInputFormat