0% found this document useful (0 votes)
13 views

Big data unit 3 own

MapReduce is a Java-based framework for processing large data sets in a distributed manner, consisting of two main steps: Map and Reduce. It simplifies big data handling by enabling parallel processing and fault tolerance across multiple computers. YARN enhances resource management for various data processing applications, improving efficiency and flexibility in Hadoop clusters.

Uploaded by

SUJITHA M
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Big data unit 3 own

MapReduce is a Java-based framework for processing large data sets in a distributed manner, consisting of two main steps: Map and Reduce. It simplifies big data handling by enabling parallel processing and fault tolerance across multiple computers. YARN enhances resource management for various data processing applications, improving efficiency and flexibility in Hadoop clusters.

Uploaded by

SUJITHA M
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

1) MapReduce

MapReduce is a Java-based framework used for processing large amounts of


data in a distributed manner. It is part of the Apache Hadoop ecosystem and
helps developers handle big data easily by breaking it down into two main
steps:

1. Map – Data is divided and processed in parallel.


2. Reduce – The processed data is combined to generate the final output.

MapReduce simplifies distributed computing by handling data across multiple


computers in a fault-tolerant way.

Key Features of MapReduce

 Handles massive data (Petabytes or Exabytes).


 Works on a write-once, read-many model.
 Simple operations: Only requires Map and Reduce functions.
 Parallel processing: All Map tasks finish before Reduce tasks start.
 Optimized execution: Mappers and reducers usually run on the same machine.
 Flexible configuration: The number of map and reduce tasks can be adjusted.

MapReduce Workflow

How MapReduce Works

1. Input Data: The data to be processed is stored in the Hadoop Distributed File
System (HDFS).
2. Splitting: Hadoop divides the data into smaller chunks called splits.
3. Mapping: The map() function processes each split separately on different machines
(mappers).
4. Combining (Optional): The output from mappers can be combined before moving to
the next step to reduce data transfer.
5. Shuffling & Sorting: The mapped data is shuffled and sorted based on keys.
6. Reducing: The reduce() function aggregates and processes the mapped data.
7. Final Output: The processed data is saved in HDFS.

Steps in Detail:

1. Mappers: Process data in small parts, working on different machines.


2. Reducers: Combine mapper outputs to get the final result.
3. Data locality: MapReduce runs tasks where data is stored, reducing unnecessary
data transfer.

Intermediate files (between mappers and reducers) are stored locally instead of
in HDFS to improve speed.
Data Flow in MapReduce

MapReduce is a programming model designed to handle large-scale data


processing in a distributed way. It consists of two main functions:

1. Map Function – This function processes input data and transforms it into key-value
pairs.
2. Reduce Function – This function takes all values associated with the same key and
aggregates them to produce the final result.

The MapReduce process works as follows:

 The input data is split into smaller chunks. Each chunk is assigned to a "mapper,"
which processes it in parallel with others.
 The mapper generates key-value pairs based on the data.
 These key-value pairs are then sorted and grouped by key before moving to the
"reducer."
 The reducer takes each key and combines all its values to generate the final output.
 The final output is stored in a distributed file system for further use.
Components of MapReduce

 JobTracker (Master Node):


o Assigns and monitors tasks.
o Handles failures and reschedules tasks if needed.
 TaskTracker (Worker Nodes):
o Runs map and reduce tasks.
o Sends progress updates to JobTracker.
o Manages task execution in separate environments to prevent failures from
affecting the system.

Limitations of MapReduce

1. Cannot control task execution order.


2. Requires independent processing (stateless operations).
3. Databases with indexes are faster than MapReduce in some cases.
4. Reduce tasks start only after all Map tasks finish.
5. Assumes Reduce output is smaller than Map input.

2)Anatomy of MapReduce Job Run

A MapReduce job is a process that runs on a cluster to process large amounts of


data. The process involves multiple components working together to manage
and execute the job.
Main Components of a MapReduce Job Run

1. Client – Submits the job for execution.


2. YARN Resource Manager – Allocates resources (CPU, memory, etc.) for the job.
3. YARN Node Managers – Monitor and manage tasks running on different machines.
4. MapReduce Application Master – Controls the execution of the MapReduce tasks.
5. Distributed File System (HDFS) – Stores job files and shares them among different
components.

Job Submission Process

 The client submits the job using JobClient.runJob(conf).


 The system assigns a Job ID and checks if the output directory is valid.
 The input data is split into smaller parts for processing.
 Necessary resources (JAR file, configuration file, input splits) are copied to HDFS.
 The job is submitted to the Resource Manager for execution.
Job Initialization

 The Resource Manager assigns a container to run the Application Master.


 The Application Master manages job execution and tracks progress.
 It reads the input splits and creates map tasks (one per input split).
 It also creates reduce tasks, based on the number set in the configuration.

Task Assignment

1. Map Tasks:
o Assigned first, since they need to finish before reduce tasks can start.
o The system requests containers for mappers from the Resource Manager.
o Once 5% of the map tasks are completed, reduce tasks are requested.

Task Execution

A container is assigned for each task.

o The Node Manager starts the container, which runs the YarnChild process.
o The map or reduce function executes and processes data.

Streaming Mode (For Custom Code Execution)

 Instead of writing Java code, users can supply external scripts/programs.


 The system runs these scripts using standard input/output streams to exchange
data.

Tracking Progress & Status Updates

 Since jobs can run for a long time, the system provides real-time progress updates.
 For map tasks, progress is tracked by the amount of data processed.
 For reduce tasks, progress is estimated based on how much input data is processed.

Job Completion

 When all tasks are done, the Application Master marks the job as Successful.
 The system notifies the user and cleans up temporary data.
 The job details are saved in the Job History Server for future reference.
3) YARN (Yet Another Resource Negotiator)

YARN is a resource management system in Hadoop that helps in running


different types of big data applications, not just MapReduce. It acts like an
operating system for managing computing resources in a Hadoop cluster.

Main Responsibilities of YARN

1. Manages Cluster Resources – Allocates CPU, memory, network, and storage for jobs.
2. Schedules and Monitors Jobs – Decides which job runs where and ensures it
completes successfully.

YARN allows different types of data processing like:


✔Batch processing (traditional MapReduce)
✔Stream processing (real-time data analysis)
✔Graph processing (complex network relationships)
✔Interactive processing (quick queries on big data)

Why is YARN Used?

✅ Better Resource Utilization – Dynamically allocates resources, improving


efficiency.
✅ Supports Multiple Processing Methods – Can handle batch, streaming, and
interactive jobs together.
✅ More Flexibility – Works with various applications beyond MapReduce.
YARN Architecture (Main Components)

1. Resource Manager (Master Node) – Controls resource allocation for the entire
cluster.
2. Node Manager (Worker Nodes) – Runs tasks on each machine, monitors usage, and
reports to the Resource Manager.
3. Application Master – Manages each job’s execution, requests resources, and tracks
progress.
4. Container – A unit of allocated resources (CPU, memory, etc.) for running tasks.

The Resource Manager is the main decision-maker, while the Node Manager
handles execution on individual machines.

How a Job Runs in YARN Work flow ?

1. Client submits a job.


2. Resource Manager allocates a container to start the Application Master.
3. Application Master registers with the Resource Manager.
4. Application Master requests containers from the Resource Manager for tasks.
5. Node Manager launches containers to execute the job.
6. Job runs inside containers and processes data.
7. Client monitors the job status.
8. Once complete, the Application Master unregisters.
Why is YARN Popular?

✔Highly Scalable – Can manage thousands of nodes efficiently.


✔Backward Compatible – Works with old Hadoop versions without breaking
applications.
✔Supports Multi-Tenancy – Can run multiple applications simultaneously.

Advantages & Disadvantages of YARN

✅ Advantages:
✔Scalability – Can handle a large number of nodes.
✔Better Utilization – Manages resources dynamically instead of fixed slots.
✔Supports Multiple Versions – Different versions of MapReduce can run
together.

❌ Disadvantage:
✖Single Point of Failure – In Hadoop 1.0, the JobTracker was a weak point,
but YARN improves this.

4)Failures in Classic MapReduce and YARN

Failures can happen in both Classic MapReduce and YARN

1. Failures in Classic MapReduce

MapReduce can have three types of failures:


1️ Task failure
2️ TaskTracker failure
3️ JobTracker failure

1️ Task Failure

🔹 A task can fail for two main reasons:


✅ User Code Error – If a map or reduce task has a bug, it may crash, and the
system marks it as failed.
✅ Streaming Process Failure – If a streaming task exits with a nonzero code,
it’s considered failed.

➡️ The TaskTracker detects failures and frees up space to run a new task.

2️ TaskTracker Failure

🔹 If a TaskTracker crashes or runs very slowly, it stops sending heartbeats


(signals) to the JobTracker.

✅ The JobTracker notices this and removes it from the cluster.


✅ Any completed map tasks from this tracker are rerun, so the data is available
for reducers.
✅ If too many tasks fail on a single TaskTracker, it is blacklisted and no
longer used.

3️ JobTracker Failure (Biggest Problem!)

🔹 The JobTracker is a single point of failure—if it crashes, everything stops


working.
🔹 Hadoop has no built-in way to handle this failure.

✅ The only solution is to restart the JobTracker and resubmit all running
jobs.
✅ This problem is why YARN was created!

2. Failures in YARN

YARN is better at handling failures than Classic MapReduce. It has three main
types of failures:
1️ Task Failure
2️ Node Manager Failure
3️ Resource Manager Failure
1️ Task Failure (Similar to Classic MapReduce)

🔹 If a task crashes due to runtime errors, the Application Master detects it


and marks it as failed.
🔹 YARN then retries the task on another available node.

2️ Node Manager Failure

🔹 If a Node Manager (worker node) crashes, it stops sending heartbeats to


the Resource Manager.
🔹 The Resource Manager removes it from the cluster.

✅ Any running tasks or Application Masters on that node are recovered using
built-in mechanisms.

3️ Resource Manager Failure (Most Critical!)

🔹 The Resource Manager controls everything in YARN, so if it fails, no jobs


or tasks can start.

✅ YARN was designed to recover from crashes by saving its state in


persistent storage (checkpointing).
✅ However, the latest versions do not fully support automatic recovery yet.

5) Job Scheduling in Hadoop

Hadoop uses schedulers to manage jobs and ensure efficient resource


utilization in a cluster. There are three main job schedulers:

1️ FIFO Scheduler
2️ Fair Scheduler
3️ Capacity Scheduler

Each scheduler has different ways of handling tasks, and each has its own
advantages and disadvantages.
Challenges in Job Scheduling

1️ Energy Efficiency – Running large-scale jobs consumes a lot of energy in


data centers, increasing costs. Reducing energy use is a big challenge.

2️ Load Balancing – If some data blocks are much bigger than others, some
nodes do more work, leading to imbalance. Hadoop’s partitioning algorithm
tries to distribute data equally, but uneven key distribution can cause issues.

3️ Mapping Scheme – A good mapping system is needed to reduce


communication costs between nodes.

4️ Automation & Configuration – Setting up a Hadoop cluster requires proper


hardware and software configuration. Small mistakes can lead to inefficient job
execution.

5️ Fairness – The scheduler should distribute resources equally among users.

6️ Data Locality – The closer the computation is to the data, the faster the
processing.

7️ Synchronization – The reduce phase needs intermediate data from the map
phase. Ensuring smooth transfer is critical for performance.

1️ FIFO Scheduler (First In, First Out)

🔹 How It Works?

 This is Hadoop’s default scheduler.


 Jobs are queued in order of arrival, and the first job submitted gets
executed first.
 The next job starts only when the previous one is completed.
 No priority system—all jobs are treated equally, regardless of their size
or importance.

🔹 Example:
Imagine you are in a ticket queue. The person who arrives first gets served
first, and the others have to wait in line.

🔹 Advantages of FIFO Scheduler:


✔️ Simple and easy to understand,doesn’t require extra configuration.
✔️ Jobs are executed in order, ensuring predictability.
🔹 Disadvantages of FIFO Scheduler:
❌ Not suitable for shared clusters—one big job can block smaller jobs.
❌ Doesn’t consider job size,so small jobs can get delayed behind long jobs.

2️ Fair Scheduler (Developed by Facebook)

🔹 How It Works?

 This scheduler divides cluster resources fairly among users and jobs.
 If only one job is running, it gets all the resources.
 As more jobs arrive, resources are evenly distributed.
 Jobs are placed into pools (groups) based on user-defined settings, such
as user name.

🔹 Key Features:

 Each user gets a minimum share of cluster resources.


 Unused resources from one pool can be used by others.
 If one user submits too many jobs, the scheduler limits their execution
to prevent overload.

🔹 Example:
Imagine you are at a buffet. If you're the only person there, you can take as
much food as you want. But as more people arrive, food is shared equally
among all.

🔹 Advantages of Fair Scheduler:


✔️ Fair and dynamic resource allocation—ensures no one user monopolizes
the system.
✔️ Fast response for small jobs—doesn’t let large jobs delay them.
✔️ Can limit the number of jobs per user or pool to ensure fairness.

🔹 Disadvantages of Fair Scheduler:


❌ More complex configuration compared to FIFO.
❌ Doesn’t consider job weight, leading to possible uneven performance
across pools.
❌ Each pool has a job limit, which may restrict performance.
3️ Capacity Scheduler (Developed by Yahoo)

🔹 How It Works?

 Designed for large organizations where multiple teams share a cluster.


 Uses queues, with each queue assigned to a different team or
organization.
 Unused resources in one queue can be used by others, ensuring
efficiency.
 Supports priority-based scheduling within each queue.

🔹 Key Features:
 Guarantees a minimum capacity for each queue.
 Uses security mechanisms to ensure each team can access only their
own queue.
 Supports hierarchical queues—can have sub-queues within a main
queue.

🔹 Example:
Imagine a company with three teams: Engineering, Data Science, and
Marketing. Each team gets its own queue to ensure fair resource allocation. If
Marketing is not using its resources, Engineering can temporarily use them.

🔹 Advantages of Capacity Scheduler:


✔️ Maximizes resource utilization and ensures high throughput.
✔️ Allows unused resources to be reallocated dynamically.
✔️ Supports hierarchical queues, making it flexible for large organizations.
✔️ Can control memory allocation based on available hardware.

🔹 Disadvantages of Capacity Scheduler:


❌ Most complex scheduler—requires careful configuration.
❌ Choosing the right queue setup can be challenging.
❌ May struggle with ensuring fairness when many jobs are waiting.
6) Shuffle and Sort in Hadoop

Hadoop guarantees that the input to each Reducer is sorted by key. The
process of sorting map outputs and transferring them to reducers is called
Shuffle.

When a MapReduce job runs, the Mapper produces output (key-value pairs).
Before the data reaches the Reducers, Hadoop automatically sorts it by key.
This internal process is known as Shuffle and Sort.

How Shuffle and Sort Works?

1️ Sorting in Mappers

 Mappers process data and generate key-value pairs as output.


 The output is sorted by key and stored in buffers in memory.
 If the buffers get full, the data is written to disk to prevent memory overload.

2️ Shuffling to Reducers

 The sorted data from Mappers is sent to Reducers through the network.
 This process happens as soon as each Mapper finishes, to avoid network
congestion.
 All data with the same key goes to the same Reducer.

3️ Sorting in Reducers

 Before processing, the Reducer sorts the received data again to maintain order.
 The final sorted data is written to HDFS or another storage system.

How to Reduce Network Load?

🔹 Using a Combiner

 A Combiner is like a mini-Reducer that runs on the Mapper’s side.


 It pre-processes data before sending it to Reducers, reducing the amount of data
transferred over the network.
 However, Hadoop decides when and how many times to use the Combiner—users
cannot control this.

Hadoop’s Default Shuffle and Sort Mechanism

By default, Hadoop uses:


✔Alphabetical sorting of keys.
✔Hash-based shuffling for distributing data to reducers.

If needed, users can customize the shuffle and sort mechanism by modifying:

1. Partitioner – Controls how data is divided among Reducers.


2. RawComparator (Mapper side) – Handles sorting on the Mapper side.
3. RawComparator (Reducer side) – Manages grouping of data in Reducers.
Steps in the Shuffle and Sort Phase

1️ Partitioning: Data is divided among Reducers based on partition rules.


2️ Sorting: Data is sorted by keys within each partition.
3️Temporary Files: Sorted output from Mappers is saved as temporary files.
4️ Merging Files: When the Map task finishes, all temporary files are merged
into a single file.
5️ Shuffling: Data from each partition (from all Mappers) is sent to the assigned
Reducer.
6️ Memory Management: If data exceeds memory, it is stored on disk to
prevent crashes.
7️ Final Sorting: Before processing, Reducers merge and sort data again to
maintain order.

7) Speculative Execution in Hadoop (Task execution)

Hadoop splits a big job into smaller tasks and runs them in parallel to finish
the job faster.

However, sometimes one task runs much slower than the others. This slow
task is called a straggler.

To prevent delays, Hadoop uses speculative execution—it creates a duplicate


copy of the slow task and runs it on a different node.

How Speculative Execution Works?

1️ Detecting a Slow Task (Straggler)

 Hadoop monitors task progress using a progress score (0 to 1).


 If a task is much slower than average and has run for at least 1 minute, it is marked
as a straggler.

2️ Creating a Duplicate Task

 Hadoop starts another copy of the slow task on a different node.


 The first task to finish (original or duplicate) is accepted, and the other is stopped
(killed).
 This ensures that slow tasks do not delay the entire job.

3️ Where is Speculative Execution Enabled?


 It is turned on by default in Hadoop.
 It can be enabled or disabled separately for Map tasks and Reduce tasks.
 Settings for speculative execution are found in mapred-site.xml.

Advantages of Speculative Execution

✔Prevents slow tasks from delaying the job


✔Improves overall job execution time
✔Handles failures in large clusters where hardware or network issues are
common
✔Ensures better resource utilization

8) MapReduce Types

Hadoop MapReduce processes data using two main functions:

1️ Map Function: Takes input key-value pairs and produces a list of new key-
value pairs.

 Example: map(K1, V1) → list(K2, V2)


 The input key and value (K1, V1) are usually different from the output key and
value (K2, V2).

2️ Reduce Function: Takes the output of the map function and processes it
further.

 Example: reduce(K2, list(V2)) → list(K3, V3)


 The input to reduce (K2, V2) is the same as the output of map, but the final output
(K3, V3) may be different.

1. Input Formats

Hadoop can process different types of data, including text files, databases, and
binary files.

What is an Input Split?

 Input splits are chunks of data processed by each mapper.


 A split is further divided into records, which are processed as key-value pairs.
 Input splits are logical and do not need to be tied to files (e.g., they can be a range of
rows from a database).

Example:

 If a file has 100MB of data and the block size is 64MB, Hadoop will split it into two
parts.

FileInputFormat

 Base class for input formats that process files.


 It determines which files are included as input and creates splits for them.
 Subclasses further break these splits into records.

Handling Small Files

 Hadoop works better with fewer large files than many small files.
 CombineFileInputFormat is used to combine multiple small files into larger splits,
reducing overhead.

2. Text Input Format


TextInputFormat (Default Format)

 Each line of a file is a record.


 The key is the position of the line in the file (byte offset).
 The value is the actual content of the line.

Example:
0 This is line 1
25 This is line 2
50 This is line 3

 Here, the keys (0, 25, 50) represent byte positions, and the values are the text lines.

Splitting and HDFS Blocks

 A file is split into logical records (lines), but these don’t always align with HDFS
blocks.
 Splits honor logical records, ensuring a full line is always included, even if it spans
multiple blocks.

3. Binary Input Format

Hadoop can also process binary data.

SequenceFileInputFormat

 Stores binary key-value pairs in a format optimized for Hadoop.


 Splittable and supports compression.

SequenceFileAsBinaryInputFormat

 Reads sequence files as raw binary objects.


 Data is stored as BytesWritable objects, which the application can interpret as
needed.

You might also like