0% found this document useful (0 votes)
23 views

Unit 4 CS 3RD Yr

Uploaded by

b3952355
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

Unit 4 CS 3RD Yr

Uploaded by

b3952355
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

UNIT 4

MAP REDUCE:
MapReduce is a programming model for processing large datasets by dividing the data into
smaller chunks and processing them in parallel. It consists of two main steps: Map and
Reduce. In the context of R and data science, you can implement the MapReduce concept to
handle tasks like distributed data processing across a large dataset. Here's how the Map and
Reduce phases work:

1. Map Phase: In the Map phase, the input dataset is divided into smaller subsets (chunks). A
mapping function is applied to each subset, and intermediate key-value pairs are generated.

2. Reduce Phase: The Reduce phase takes the intermediate key-value pairs and aggregates
them (usually combining values with the same key) to produce a final result.

Example in R: Let’s consider an example where you have a large text dataset, and you want
to count the occurrence of each word (a typical MapReduce example).

# Sample text data

text_data <- c("data science is amazing", "data science involves statistics", "R is great for
data analysis")

# Map function: This function will split the lines into words and return a list of key-value
pairs (word, 1)

map_function <- function(lines) {

words <- unlist(strsplit(lines, " ")) # Split each line into words

return(table(words)) # Count each word's occurrence in the line (word, count)

# Applying the Map phase

mapped_results <- lapply(text_data, map_function)

# Reduce function: This function will combine the results from the map step by summing the
word counts

reduce_function <- function(mapped_data) {

return(Reduce("+", mapped_data)) # Sum word counts across all mapped results

# Applying the Reduce phase

final_result <- reduce_function(mapped_results)

1
# View the final word count

print(final_result)

Explanation of Code:

Map Phase: The map_function takes each line of text, splits it into individual words, and
creates a word count using table().The lapply() function applies the map_function to each line
of the text. Reduce Phase: The reduce_function takes the results from the map phase and
aggregates them using Reduce("+", ...), which sums the word counts for each word. Output:#
Final word count (example output)

data science is amazing involves statistics R great for analysis

3 2 2 1 1 1 1 1 1 1

Concept Summary: MapReduce allows you to process large datasets in parallel by breaking
them into smaller chunks and processing them independently. In R, you can simulate the
MapReduce process using functions like lapply() for the Map phase and Reduce() for the
Reduce phase. This process is highly useful for distributed data processing, especially for
tasks like word count, summarizing large datasets, or parallelizing complex operations. In
larger environments, MapReduce would be implemented using a distributed system like
Hadoop, but this example in R illustrates the concept on a small scale.

Advantages of MapReduce:

 Scalability: MapReduce allows for efficient scaling by distributing tasks across a


large number of nodes. It can process large datasets across multiple machines in a
cluster, making it ideal for big data scenarios.
 Fault Tolerance: MapReduce handles failures efficiently. If a node crashes or a task
fails, the framework can reassign the task to another node. This ensures that the
overall process continues without data loss.
 Simplicity: MapReduce abstracts the complexity of parallel and distributed
computing. Programmers can focus on writing the map and reduce functions without
worrying about the underlying architecture.
 Data Locality: Hadoop (which implements MapReduce) moves the computation to
where the data resides (data locality). This reduces data transfer costs and improves
performance.
 Parallel Processing: MapReduce splits tasks into independent units that can be
processed in parallel. This results in faster execution times compared to sequential
processing, especially when handling large datasets.
 Flexibility: MapReduce can be used for a wide variety of tasks, including filtering,
sorting, summarizing, and performing complex transformations. It's flexible enough
to be used in many domains.

2
Disadvantages of MapReduce:

 Latency: MapReduce can have high latency, especially for real-time or low-latency
applications. It processes data in batches, which can be slow for tasks that require
quick responses or iterative processing.
 Not Suitable for All Use Cases: While MapReduce works well for certain types of
problems, such as those that can be expressed as key-value transformations, it is not
always the best choice for tasks like iterative algorithms (e.g., machine learning,
graph processing). Other frameworks like Spark or Apache Flink are better suited for
such use cases. Complexity in Developing: Although MapReduce simplifies
distributed computing, writing efficient MapReduce code for complex tasks can still
be difficult. Debugging, optimizing, and testing MapReduce jobs can be more
complex compared to non-distributed systems.
 I/O Bottleneck: MapReduce jobs often involve multiple rounds of writing
intermediate results to disk (e.g., after the map phase). This disk I/O can be a
bottleneck, reducing performance.
 Limited Expressiveness: The MapReduce model is relatively simple and rigid. It
doesn’t support more complex data flows and operations natively (e.g., joins, matrix
operations). This can limit its expressiveness for certain types of computations.
 Resource Intensive: MapReduce frameworks like Hadoop may require substantial
resources, both in terms of hardware and management. Running and maintaining large
clusters adds operational complexity and costs.

MAP REDUCE FUNDAMENTALS:


MapReduce is a programming model originally designed for processing large datasets in a
distributed computing environment. While R is not typically associated with MapReduce,
there are libraries and packages such as RHadoop and Rmr2 that bring MapReduce
functionality to R, allowing users to process data over a distributed cluster.

Fundamentals of MapReduce: MapReduce consists of two primary stages:

 Map Stage: In this stage, the data is divided into chunks and passed to the map
function, which processes the data. The map function usually performs filtering,
transforming, or any other per-record operation and emits key-value pairs as output.
 Reduce Stage: In this stage, the key-value pairs emitted by the map function are
grouped by keys. The reduce function aggregates or reduces the grouped key-value
pairs to produce a final output.
 Workflow:
 Splitting: The input data is split into manageable pieces. Mapping: Each chunk of
data is processed in parallel by the map function, which transforms the data into key-
value pairs. Shuffling: The intermediate key-value pairs are grouped by key (shuffled
and sorted) and sent to the reduce phase. Reducing: The reduce function aggregates
the results for each key, providing a final output.

3
 Example of MapReduce in R using rmr2: Here’s a simple example that
demonstrates a word count, one of the basic examples of MapReduce:# Load
necessary library

library(rmr2)

# Define the map function

map <- function(k, lines) {

# Split each line into words and emit (word, 1)

keyval(unlist(strsplit(lines, " ")), 1)

# Define the reduce function

reduce <- function(word, counts) {

# Sum the counts for each word

keyval(word, sum(counts))

# Input file

input <- "/path/to/input.txt"

output <- "/path/to/output"

# Running the MapReduce job

mapreduce(input = input,

output = output,

input.format = "text",

map = map,

reduce = reduce)Steps

Explained:

Map: The map function splits each line into words and emits each word along with a value of
1.Reduce: The reduce function groups all the instances of each word and sums the counts to
find the total occurrences of each word.

Key R Packages for MapReduce: 1.rmr2: An R package that allows you to implement
MapReduce jobs in R. 2. RHadoop: A collection of R packages that enable you to use

4
Hadoop’s MapReduce framework with R. By leveraging the power of MapReduce, large
datasets can be processed efficiently using distributed systems such as Hadoop, even with R.

TRACING THE ORIGINS OF MAPREDUCE:


MapReduce originated at Google in the early 2000s as a solution for processing and
generating large datasets in a distributed and parallel manner. Its creation was motivated by
the need to handle enormous data volumes (such as web indexing) that could not be
efficiently processed by traditional systems. Here’s a timeline of its origins and evolution:
Early Foundations:

 Distributed Systems and Parallel Computing (1970s-1990s):Prior to MapReduce,


concepts of distributed computing were explored in academic and industrial settings.
Systems such as Parallel Virtual Machine (PVM) and Message Passing Interface
(MPI) provided tools for parallel computing but required significant effort from
developers to handle low-level aspects like fault tolerance, data distribution, and load
balancing. These early systems laid the groundwork for high-performance computing
but lacked a unified, abstract model like MapReduce.
 Google’s Challenge (Late 1990s): By the late 1990s, Google was facing massive
challenges with data processing. Web indexing, searching, and ranking required
processing massive amounts of web data. Their earlier solutions, such as the Google
File System (GFS), addressed data storage but did not fully address data processing at
scale.Existing parallel computing tools at the time did not easily scale to the web-level
data that Google was handling.
 Creation of MapReduce (2003-2004):Jeffrey Dean and Sanjay Ghemawat, two
engineers at Google, developed the MapReduce programming model to address this
problem. Their goal was to create an abstraction that would allow developers to easily
process large datasets in parallel without needing to manage the complexity of
distributed systems. In 2004, Dean and Ghemawat published a foundational paper
titled "MapReduce: Simplified Data Processing on Large Clusters", which described
the MapReduce framework and how it could scale across thousands of machines to
process terabytes of data.

Key elements in their solution: Map and Reduce: They simplified the distributed
programming model by breaking it into two primary phases: Map for data transformation and
Reduce for data aggregation.

 Fault Tolerance: Their framework was designed to handle machine failures


automatically, restarting failed tasks and ensuring that no data was lost.
 Data Locality: Rather than moving data between machines, they brought the
computation closer to where the data was stored, reducing network overhead.
 Adoption at Google: Google used MapReduce internally to power various
applications, such as web indexing, data mining, log analysis, and machine learning.
The success of the framework at Google demonstrated the practicality of this
approach for handling large-scale data processing.

5
 Open Source Adaptation: Hadoop (2005):In 2005, Doug Cutting and Mike
Cafarella, inspired by Google’s MapReduce paper, created an open-source
implementation of the MapReduce framework called Hadoop. Originally developed
as part of the Nutch search engine project, Hadoop eventually became an independent
open-source project under the Apache Software Foundation.
 Hadoop incorporated both the MapReduce programming model and a distributed
storage system based on Google’s GFS, which became HDFS (Hadoop Distributed
File System).
 Hadoop quickly gained traction outside of Google as it provided a free, open-source
way to process big data in distributed clusters. It became the de facto tool for
companies like Yahoo, Facebook, and many others who needed scalable data
processing.

MapReduce in the Ecosystem (2000s-2020s):Over time, the MapReduce programming


model was adopted in various forms across many industries. Tools like Apache Pig and Hive
provided higher-level abstractions over Hadoop’s MapReduce to make it more accessible to
users unfamiliar with low-level programming.The model’s success led to the rise of
numerous other big data processing tools, though MapReduce eventually faced competition
from more advanced systems like Apache Spark and Apache Flink, which offered faster, in-
memory data processing and better support for iterative algorithms.

 Current Status: While the traditional MapReduce framework has been largely
replaced or augmented by more flexible and efficient frameworks, the fundamental
principles behind MapReduce (parallel processing, key-value transformation, fault
tolerance, etc.) continue to influence modern data processing systems.
 Even though frameworks like Spark and Flink are now more popular for certain tasks,
Hadoop’s MapReduce remains in use for batch processing, especially in legacy
systems and for extremely large datasets.
 Key Publications: "MapReduce: Simplified Data Processing on Large Clusters"
(2004) by Jeffrey Dean and Sanjay Ghemawat: The original paper that introduced
the MapReduce model.
 "The Google File System" (2003) by Sanjay Ghemawat, Howard Gobioff, and
Shun-Tak Leung: Describes the distributed file system used in conjunction with
MapReduce.

Key points:

 Origin: Created by Google in the early 2000s to address the challenges of processing
massive amounts of data.
 Key Contributors: Jeffrey Dean and Sanjay Ghemawat.
 Key Innovations: Abstracting parallel processing into two key functions (Map and
Reduce) and handling fault tolerance and scalability transparently for developers.
 Influence: Inspired the creation of the Hadoop ecosystem and shaped the modern
landscape of distributed data processing systems.

6
UNDERSTANDING THE MAP FUNCTION:
In R programming, the map function is not a built-in base R function but is available through
the purrr package, which is part of the tidyverse. The map function in R is used for functional
programming to apply a function to each element of a list (or vector) and return the
results.The map function is quite versatile, allowing for various forms of output (lists,
vectors, etc.) and enabling a functional programming style in R. It is similar to the lapply
function from base R, but map offers more options and flexibility.Basic Syntax of
map:map(.x, .f).x: The list or vector to iterate over..f: The function to apply to each element
of .x.

Types of map Functions:

There are several variants of the map function, depending on the type of output you need:

 map(): Returns a list.


 map_lgl(): Returns a logical vector (TRUE or FALSE).
 map_int(): Returns an integer vector.
 map_dbl(): Returns a double (numeric) vector.
 map_chr(): Returns a character vector.
 Example: Basic Usage of map: Simple Example: library(purrr)

# Create a list

numbers <- list(1, 2, 3, 4)

# Apply a function to each element (square each number)

squared_numbers <- map(numbers, function(x) x^2)

print(squared_numbers)

# Output: [[1]] 1, [[2]] 4, [[3]] 9, [[4]] 16

In this example, the map function applies the square function (x^2) to each element of the
numbers list.Using Anonymous Functions:# You can use anonymous (lambda) functions
directly in map

squared_numbers <- map(numbers, ~ .x^2)

print(squared_numbers)

# Output: [[1]] 1, [[2]] 4, [[3]] 9, [[4]] 16Here, the ~ operator is shorthand for an anonymous
function, with .x representing each element of the list.Returning Different Data Types: You
can use specific map_*() functions to ensure the output is of a certain type. For example:#
Return a numeric (double) vector

squared_numbers_dbl <- map_dbl(numbers, ~ .x^2)

7
print(squared_numbers_dbl)

# Output: 1 4 9 16

this case, map_dbl() ensures that the result is a numeric vector instead of a list.Working with
Named Lists:If the input list has names, map will retain those names in the output.# Named
list

people <- list(Alice = 25, Bob = 30, Charlie = 22)

# Apply a function to each element (increase age by 5)

new_ages <- map(people, ~ .x + 5)

print(new_ages)

# Output: Alice 30, Bob 35, Charlie 27

Using map with Multiple Inputs:map2 can be used when applying a function to two lists
simultaneously.# Two lists of numbers

list1 <- list(1, 2, 3)

list2 <- list(4, 5, 6)

# Sum corresponding elements from both lists

sum_lists <- map2(list1, list2, ~ .x + .y)

print(sum_lists)

# Output: [[1]] 5, [[2]] 7, [[3]] 9For more than two lists, you can use pmap(). Vectorization
and Purrr's Advantages: Compared to traditional looping constructs like for loops or base R
functions like lapply, map provides a cleaner, more expressive way to perform vectorized
operations in R, especially when combined with the tidyverse.

Key Differences between map and Other R Functions: map vs. lapply:

lapply (base R) always returns a list, whereas map_*() functions can return various types
(logical, integer, numeric, character).map allows you to control the type of output more
explicitly. map vs. for Loops: map leads to more concise, readable code when performing
repetitive operations over lists or vectors. It encourages a more functional programming style,
reducing the need for manual indexing and increasing readability.

ADDING THE REDUCE FUNCTION:

In R, the reduce function is part of the purrr package (from the tidyverse) and is used for
reducing or aggregating elements of a list or vector. It takes a binary function and applies it
iteratively to combine all the elements of a list into a single value.The reduce function works
similarly to the Reduce() function in base R, but the purrr version offers better integration

8
with the map family of functions and is generally more flexible.Syntax of reduce:reduce(.x,
.f, ..., .init).x: A list or vector to be reduced..f: A binary function to apply to elements of .x
(must take two arguments)....: Additional arguments passed to .f..init: Optional initial value
for the reduction.

Basic Example of reduce: library(purrr)

# Define a vector of numbers

numbers <- c(1, 2, 3, 4)

# Use reduce to sum the elements

sum_result <- reduce(numbers, +)

print(sum_result)

# Output: 10

Explanation:

 The function reduce(numbers, '+') sums all the elements in the vector by applying the
+ function iteratively.
 It performs the operation as: 1 + 2 + 3 + 4, yielding the result 10.
 Example: Using a Custom Function with reduce:# Define a custom function that
multiplies two numbers

multiply <- function(x, y) {

x*y

# Use reduce to multiply all the elements of the vector

product_result <- reduce(numbers, multiply)

print(product_result)

# Output: 24

Explanation: The reduce function applies the multiply function iteratively: First, 1 * 2 =
2Then, 2 * 3 = 6Finally, 6 * 4 = 24The final output is 24.

Example: Using a Custom Function with reduce:# Define a custom function that multiplies
two numbers

multiply <- function(x, y) {

x*y

9
}

# Use reduce to multiply all the elements of the vector

product_result <- reduce(numbers, multiply)

print(product_result)

# Output: 24Explanation:The reduce function applies the multiply function iteratively:First, 1


* 2 = 2Then, 2 * 3 = 6Finally, 6 * 4 = 24The final output is 24.

Example: Using a Reduce Function with Strings:# Define a vector of character strings

words <- c("Hello", "World", "from", "R")

# Use reduce to concatenate all the words into a sentence

sentence <- reduce(words, paste)

print(sentence)

# Output: "Hello World from R" Explanation: The reduce function concatenates the words
iteratively using the paste function: First: "Hello" "World" Then: "Hello World" "from"
Finally: "Hello World from" "R" The final result is "Hello World from R".

Example: Using .init in reduce:The .init argument sets an initial value for the reduction
process, which can be useful for more controlled operations.# Use reduce to sum elements,
starting from an initial value of 10

sum_with_init <- reduce(numbers, +, .init = 10)

print(sum_with_init)

# Output: 20

Explanation:The reduction starts with the initial value of 10, and then adds each element
from the vector:10 + 1 = 1111 + 2 = 1313 + 3 = 1616 + 4 = 20The final result is 20.

Base R Alternative: Reduce():R’s base package provides the Reduce() function, which
performs a similar task.# Base R's Reduce function

sum_base_r <- Reduce(+, numbers)

print(sum_base_r)

# Output: 10The result is identical to the output from purrr::reduce(). Both functions perform
iterative reduction, but purrr::reduce() integrates better with other tidyverse functions.

PUTTING MAP AND REDUCE TOGETHER:In R, map and reduce are operations
commonly used for functional programming, especially in data processing. The map function
applies a function to each element of a list or vector, while reduce combines the elements of a

10
list into a single value using a binary function. Here’s how to use them together in R using
the purrr package.Installing and Loading the purrr PackageFirst, install and load the purrr
package, which provides functional programming tools such as map and
reduce.install.packages("purrr")

library(purrr)Example: Using map and reduce TogetherSuppose you have a list of vectors and
want to first transform (map) each vector by multiplying its elements by 2 and then reduce
(combine) them by summing all vectors element-wise.

# Example list of vectors

list_of_vectors <- list(

c(1, 2, 3),

c(4, 5, 6),

c(7, 8, 9)

# Step 1: Use map to multiply each vector by 2

mapped <- map(list_of_vectors, ~ .x * 2)

# Step 2: Use reduce to sum the vectors element-wise

result <- reduce(mapped, +)

# Output the result

print(result)Explanation: Map: The map function applies the operation .x * 2 (multiplying


each element by 2) to each vector in list_of_vectors. Reduce: The reduce function takes the
list of mapped vectors and applies the binary function + to sum them element-wise.

Output: [1] 24 30 36In this example: Each vector was first multiplied by 2 (map
operation).The reduce operation then combined the transformed vectors by adding them
element-wise. This pattern of using map followed by reduce is useful in many data
processing tasks where you first want to transform data and then aggregate it.

OPTIMIZING MAP REDUCE TASKS:

Optimizing map and reduce tasks in R involves improving both computational efficiency and
readability of the code, especially when dealing with large datasets or complex operations. R
offers several ways to optimize such tasks:

Key Strategies for Optimization:

 Vectorization: Use R’s vectorized operations to avoid explicit loops.Parallel


Processing: Take advantage of multi-core processors using parallel computing

11
libraries.Efficient Data Structures: Use efficient data structures such as matrices or
data tables to store and manipulate data.
 Memory Management: Use in-place operations to minimize memory overhead,
especially for large datasets.
 Example: Optimizing a Map-Reduce Task Using Parallelization Here, we'll optimize
the earlier map-reduce task with parallel processing using the parallel package.
 Problem Setup: We have a list of vectors, and we want to first multiply each vector
by 2 (map), and then sum the vectors element-wise (reduce).

Step 1: Parallel Map- Reduce We’ll use the parallel package to optimize the mapping process
by distributing the work across multiple cores.

Code:# Load necessary libraries

library(parallel)

library(purrr)

# Example list of vectors

list_of_vectors <- list(

c(1, 2, 3),

c(4, 5, 6),

c(7, 8, 9),

c(10, 11, 12)

# Detect number of cores available

num_cores <- detectCores() - 1 # Keep one core free for system tasks

# Step 1: Parallel map function using mclapply from the 'parallel' package

parallel_mapped <- mclapply(list_of_vectors, function(x) x * 2, mc.cores = num_cores)

# Step 2: Reduce operation using purrr's reduce

optimized_result <- reduce(parallel_mapped, +)

# Output the result

print(optimized_result)

Explanation: Parallel Mapping (mclapply): The mclapply function from the parallel
package distributes the map operation across multiple cores, speeding up the computation
when applied to a large list of vectors.

12
The number of cores is automatically detected and adjusted (num_cores), but you can specify
how many cores to use.

Reduction: Once the map operation is done, we apply the reduce function using the purrr
package to sum the vectors element-wise.

Output:

[1] 44 52 60

Step 2: Further Optimization with data. Table For even larger datasets, we can use efficient
data structures like data.t able to manage and process the data faster:# Load the data. table
package for optimized memory management

library(data.table)

# Convert the list of vectors to a data.table

dt <- rbindlist(lapply(list_of_vectors, as.data.table))

# Multiply each element by 2 in a vectorized operation

dt <- dt * 2

# Sum each column to reduce the data

optimized_dt_result <- colSums(dt)

# Output the result

print(optimized_dt_result)

Output:

V1 V2 V3

44 52 60

Performance Gains:

Parallelization: Using mclapply splits the work among multiple cores, which speeds up
computation when applied to large lists or complex operations. Efficient Data Structures:
The data. table library is optimized for large datasets, providing faster operations compared to
base R lists or data frames.

Vectorization: By vectorizing the map operation (dt * 2), the task is executed in a single
step without looping, which is much faster.

13

You might also like