0% found this document useful (0 votes)

10 views38 pages

Chapter 3 - 大数据管理

This document discusses distributed data processing and MapReduce. It describes how distributed processing uses multiple networked computers to work together on tasks. MapReduce is then introduced as a programming framework that allows for distributed and parallel processing of large datasets across computer clusters. Key advantages of MapReduce include its ability to handle parallel processing, reliability, and fault tolerance without requiring developers to manage complex distributed systems issues.

Uploaded by

gs68295

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views38 pages

Chapter 3 - 大数据管理

Uploaded by

gs68295

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 38

BIG DATA MANAGEMENT

DISTRIBUTED DATA PROCESSING and MAPREDUCE

Distributed Processing

 Distributed processing makes use of two or more (usually, many

more) computers that are networked together and all working
on a single task in a well-coordinated fashion.
 The individual computers involved can be ordinary desktop or
laptop machines, high-end machines, or specialized servers that
carry out specific tasks like storage and retrieval of datasets.
 In a complex distributed system, sub-components of the system
(a subgroup of networked computers) can be devoted to a
specific task while other groups concentrate on separate tasks.
Distributed Processing

 With proper communications links and instructions to

the machines, a series of distributed computers can
do the work of much more powerful stand-alone
systems, and can even reach processing power and
speeds of the fastest supercomputers.
 Many gaming systems rely on distributed processing
setups, where gamers' individual machines carry out
some of the processing in addition to more central
servers providing the gaming backbone.
Old Way

 When there was no MapReduce framework, parallel

and distributed processing used to happen in a
traditional way.
 As an example, let us take an example where we have
a weather log containing the daily average
temperature of the years from 2000 to 2015. Here, we
want to calculate the day having the highest
temperature each year.
Old Way

 The data will be split into smaller parts or blocks and

store them in different machines. Then, we will find
the highest temperature in each part stored in the
corresponding machine.
 At last, we will combine the results received from
each of the machines to have the final output. Let us
look at the challenges associated with this traditional
approach:
Old Way

 The data will be split into smaller parts or blocks and store them in
different machines. Then, we will find the highest temperature in
each part stored in the corresponding machine.
 At last, we will combine the results received from each of the
machines to have the final output. Let us look at the challenges
associated with this traditional approach:
Old Way

 Critical path problem: It is the amount of time taken to

finish the job without delaying the next milestone or actual
completion date. So, if, any of the machines delay the job,
the whole work gets delayed.
 Reliability problem: What if, any of the machines which are
working with a part of data fails? The management of this
failover becomes a challenge.
 Equal split issue: How will I divide the data into smaller
chunks so that each machine gets even part of data to work
with. In other words, how to equally divide the data such
that no individual machine is overloaded or underutilized.
Old Way

 Single split may fail: If any of the machines fail to

provide the output, I will not be able to calculate the
result. So, there should be a mechanism to ensure this
fault tolerance capability of the system.
 Aggregation of the result: There should be a mechanism
to aggregate the result generated by each of the
machines to produce the final output.
Old Way

 To overcome these issues, we have the MapReduce

framework which allows us to perform such parallel
computations without bothering about the issues like
reliability, fault tolerance etc.
 Therefore, MapReduce gives you the flexibility to write
code logic without caring about the design issues of the
system.
Advantages of Distributed Processing
in Business

 There are a number of advantages of a distributed

system over a centralized system that factor into a
business' decision to distribute their processing load.
Among the key factors:
 Cost
 Redundancy and Reliability
 Sustainability
Advantages of Distributed Processing
in Business

 Cost: Distributed, multi-component systems can be less costly

than a single, centralized system.
 In the SETI example, the cost savings are obvious. Rather than
invest in a large-scale mainframe system or supercomputer
costing hundreds of thousands of dollars, the SETI program
was able to make use of "free labor" provided by many
individuals who volunteered their processing resources.
 Even in a business setting, though, the use of multiple
personal computers networked together can be less of an
investment than a single large data processing system.
Advantages of Distributed Processing
in Business

 Redundancy and Reliability: If your one huge central

computer breaks down, you're out of luck. Information
processing comes to a halt until the system is back up and
running.
 Even central systems with robust backup capabilities are
still prone to disruptive failures.
 In a distributed framework, however, loss of one or a few
machines is not necessarily a big deal, as there are other
computers linked into the network that can pick up the
slack.
Advantages of Distributed Processing
in Business

 Sustainability: Networking numerous data processors to

perform a single task can result in energy savings over a
centralized data processing system.
 Remote data centers can be sited in environments that
are cool, thereby reducing the need for artificial cooling,
or that have an ample supply of "green electricity" such
as that produced by hydropower or geothermal energy.
 This is not only a cost-saving measure, but can lower the
overall greenhouse gas footprint of the system.
Google File System

 Many datasets are too large to fit on a single machine.

Unstructured data may not be easy to insert into a
database.
 Distributed file systems store data across a large
number of servers. The Google File System (GFS) is a
distributed file system used by Google in the early
2000s. It is designed to run on a large number of
cheap servers.
Google File System

 The purpose behind GFS was the ability to store and

access large files, and by large I mean files that can’t
be stored on a single hard drive.
 The idea is to divide these files into manageable
chunks of 64 MB and store these chunks on multiple
nodes, having a mapping between these chunks also
stored inside the file system.
Google File System
Google File System

 GFS assumes that it runs on many inexpensive

commodity components that can often fail, therefore
it should consistently perform failure monitoring and
recovery.
 It can store many large files simultaneously and
allows for two kinds of reads to them: small random
reads and large streaming reads. Instead of rewriting
files, GFS is optimized towards appending data to
existing files in the system.
Google File System

 The GFS master node stores the index of files, while

GFS chunk servers store the actual chunks in the
filesystems on multiple Linux nodes.
 The chunks that are stored in the GFS are replicated,
so the system can tolerate chunk server failures. Data
corruption is also detected using checksums, and GFS
tries to compensate for these events as soon as
possible.
MapReduce

 MapReduce is a programming framework that allows us

to perform distributed and parallel processing on large
data sets in a distributed environment.
MapReduce

 MapReduce is a programming model which consists of

writing map and reduce functions.
 Map accepts key/value pairs and produces a sequence of
key/value pairs.
 Then, the data is shuffled to group keys together. After
that, we reduce the accepted values with the same key
and produce a new key/value pair.
MapReduce
MapReduce

 A Word Count Example of MapReduce

 Let us understand, how a MapReduce works by taking an
example where I have a text file called example.txt whose
contents are as follows:

 Dear, Bear, River, Car, Car, River, Deer, Car and Bear

 Now, suppose, we have to perform a word count on the

sample.txt using MapReduce. So, we will be finding unique
words and the number of occurrences of those unique
words.
MapReduce

 A Word Count Example of MapReduce

MapReduce

 A Word Count Example of MapReduce

 First, we divide the input into three splits as shown in the
figure. This will distribute the work among all the map nodes.
 Then, we tokenize the words in each of the mappers and give a
hardcoded value (1) to each of the tokens or words. The
rationale behind giving a hardcoded value equal to 1 is that
every word, in itself, will occur once.
 Now, a list of key-value pair will be created where the key is
nothing but the individual words and value is one. So, for the
first line (Dear Bear River) we have 3 key-value pairs — Dear, 1;
Bear, 1; River, 1. The mapping process remains the same on all
the nodes.
MapReduce

 A Word Count Example of MapReduce

 After the mapper phase, a partition process takes place where sorting
and shuffling happen so that all the tuples with the same key are sent
to the corresponding reducer.
 So, after the sorting and shuffling phase, each reducer will have a
unique key and a list of values corresponding to that very key. For
example, Bear, [1,1]; Car, [1,1,1].., etc.
 Now, each Reducer counts the values which are present in that list of
values. As shown in the figure, reducer gets a list of values which is
[1,1] for the key Bear. Then, it counts the number of ones in the very
list and gives the final output as — Bear, 2.
 Finally, all the output key/value pairs are then collected and written in
the output file.
Advantages of MapReduce

 Parallel Processing
 In MapReduce, we are dividing the job among multiple
nodes and each node works with a part of the job
simultaneously.
 So, MapReduce is based on Divide and Conquer paradigm
which helps us to process the data using different
machines.
 As the data is processed by multiple machines instead of a
single machine in parallel, the time taken to process the
data gets reduced by a tremendous amount as shown in
the figure below (2).
Advantages of MapReduce

 Parallel Processing
Advantages of MapReduce

 Data Locality
 Instead of moving data to the processing unit, we are
moving the processing unit to the data in the
MapReduce Framework.
 In the traditional system, we used to bring data to the
processing unit and process it.
Advantages of MapReduce

 But, as the data grew and became very huge, bringing

this huge amount of data to the processing unit posed
the following issues:

 Moving huge data to processing is costly and deteriorates

the network performance.
 Processing takes time as the data is processed by a single
unit which becomes the bottleneck.
 Master node can get over-burdened and may fail.
Advantages of MapReduce

 Now, MapReduce allows us to overcome the above issues

by bringing the processing unit to the data.
 So, as you can see in the above image that the data is
distributed among multiple nodes where each node
processes the part of the data residing on it.
 This allows us to have the following advantages:
 It is very cost effective to move the processing unit to the data.
 The processing time is reduced as all the nodes are working
with their part of the data in parallel.
 Every node gets a part of the data to process and therefore,
there is no chance of a node getting overburdened.
Usage of MapReduce in industry

 Analysis of logs, data analysis, recommendation

mechanisms, fraud detection, user behavior analysis,
genetic algorithms, scheduling problems, resource
planning among others, are applications that use
MapReduce.
 Some practical applications that use MapReduce are:
 Social Network  Data Warehouse
 Electronic Commerce
 Fraud Detection
 Search and Advertisements
Usage of MapReduce in industry

 Social Networks
 Many of the social network features, such as who visited
your LinkedIn profile, who read your post on Facebook or
Twitter, can be evaluated using the MapReduce,
programming model.
Usage of MapReduce in industry

 Entertainment
 Netflix uses Hadoop and MapReduce to solve problems
such as discovering the most popular movies, based on
what you watched, what do you like?
 Providing suggestions to registered users taken into
account their interests.
 MapReduce can determine how users are watching movies,
analyzing their logs and clicks.
Usage of MapReduce in industry

 Electronic Commerce
 Many e-commerce providers, such as the Amazon, Walmart,
and eBay, use the MapReduce programming model to
identify favorite products based on users’ interests or
buying behavior.
 It includes creating product recommendation mechanisms
for e-commerce catalogs, analyzing site records, purchase
history, user interaction logs, and so on.
Usage of MapReduce in industry

 Electronic Commerce
 It’s used to establish a user’s sentimental profile for a
particular product by reviewing comments or reviews or
analyzing search logs by identifying which items are most
popular based on the search and which products are
missing.
 Many Internet service providers use MapReduce to analyze
site records and understand site visits, engagement,
locations, mobile devices, and browsers.
Usage of MapReduce in industry

 Fraud Detection
 Hadoop and MapReduce are used in the financial industries,
including companies such as banks, insurance providers,
payment locations for fraud detection, trend identification
or business metrics through transaction analysis.
 Banks analyze the data of the credit card and the related
expenses, for categorization of these expenses and make
recommendations for different offers, analyzing
anonymous purchasing behavior.
Usage of MapReduce in industry

 Search and Advertisement

 It can be to used to analyze and understand search
behavior, trends, and missing results for specific keywords.
 Google and Yahoo use MapReduce to understand users’
behavior, such as popular searches over a period of an
event such as presidential elections.
 Google AdWords uses MapReduce to understand the
impressions of ads served, click-through rates, and
engagement behavior of users.
Usage of MapReduce in industry

 Data Warehouse
 We can utilize MapReduce to analyze large data volumes in
data warehouses while implementing specific business logic
for data insights.

Ir MR 1
No ratings yet
Ir MR 1
34 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
Parallel & Distributed Computing
100% (1)
Parallel & Distributed Computing
52 pages
2 MapReduce continue
No ratings yet
2 MapReduce continue
12 pages
Map Reduce: Simplified Processing On Large Clusters
No ratings yet
Map Reduce: Simplified Processing On Large Clusters
29 pages
Map Reduce
No ratings yet
Map Reduce
28 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
37 pages
Take A Close Look At: Ma Ed
No ratings yet
Take A Close Look At: Ma Ed
42 pages
Cloud 4 Unit
No ratings yet
Cloud 4 Unit
26 pages
Unit-2 (MapReduce-I)
No ratings yet
Unit-2 (MapReduce-I)
28 pages
Bda Ia1 Scheme
No ratings yet
Bda Ia1 Scheme
7 pages
Ch02a Mapreduce
No ratings yet
Ch02a Mapreduce
53 pages
Map Reduce
No ratings yet
Map Reduce
42 pages
Lecture 10 Map Reduce
No ratings yet
Lecture 10 Map Reduce
42 pages
1s07 Map Reduce Presentation 2019
No ratings yet
1s07 Map Reduce Presentation 2019
43 pages
TM2 ch02 Mapreduce
No ratings yet
TM2 ch02 Mapreduce
51 pages
Map Reduce Notes and Learning
No ratings yet
Map Reduce Notes and Learning
48 pages
Chapter4 - MapReduce
No ratings yet
Chapter4 - MapReduce
29 pages
Mapreduce and Hadoop Distributed File System
No ratings yet
Mapreduce and Hadoop Distributed File System
36 pages
Traditional Way Vs Map Reduce Way and Steps in Mapreduce (Word Count) - 1
No ratings yet
Traditional Way Vs Map Reduce Way and Steps in Mapreduce (Word Count) - 1
4 pages
BDA 2 (1)
No ratings yet
BDA 2 (1)
35 pages
Hadoop Spark
No ratings yet
Hadoop Spark
34 pages
Week 02
No ratings yet
Week 02
115 pages
By Christian Mechem and Geoff Crowley
No ratings yet
By Christian Mechem and Geoff Crowley
11 pages
Cloud Computing Unit - 3 Final
No ratings yet
Cloud Computing Unit - 3 Final
43 pages
Key Ideas Behind Mapreduce 3. What Is Mapreduce? 4. Hadoop Implementation of Mapreduce 5. Anatomy of A Mapreduce Job Run
No ratings yet
Key Ideas Behind Mapreduce 3. What Is Mapreduce? 4. Hadoop Implementation of Mapreduce 5. Anatomy of A Mapreduce Job Run
27 pages
Chapter 4
No ratings yet
Chapter 4
71 pages
Lecture 1 - Map Reduce
No ratings yet
Lecture 1 - Map Reduce
31 pages
BDA-Lec5
No ratings yet
BDA-Lec5
40 pages
Introduction To: Ma Ed
No ratings yet
Introduction To: Ma Ed
42 pages
Introduction To Map Reduce
No ratings yet
Introduction To Map Reduce
50 pages
Hadoop 2
No ratings yet
Hadoop 2
27 pages
Google BigQuery Analytics
From Everand
Google BigQuery Analytics
Jordan Tigani
3/5 (1)
Chapter 4
No ratings yet
Chapter 4
53 pages
Map reduce
No ratings yet
Map reduce
35 pages
Ditp ch2
No ratings yet
Ditp ch2
2 pages
Term Paper Java
No ratings yet
Term Paper Java
14 pages
Module 3 (Part-1) - Big Data
No ratings yet
Module 3 (Part-1) - Big Data
46 pages
Big Data Notes (All Lectures)
No ratings yet
Big Data Notes (All Lectures)
44 pages
Distributed and Cloud Computing
No ratings yet
Distributed and Cloud Computing
58 pages
Lecture 3 MapReduce Spark
No ratings yet
Lecture 3 MapReduce Spark
62 pages
02-Hadoop
No ratings yet
02-Hadoop
117 pages
BDA-UNIT-1
No ratings yet
BDA-UNIT-1
32 pages
Module2 C MapReduceParadigm
No ratings yet
Module2 C MapReduceParadigm
74 pages
Chapter 6
No ratings yet
Chapter 6
57 pages
MapReduce - Simpli Ed Data Processing On Large Clusters
No ratings yet
MapReduce - Simpli Ed Data Processing On Large Clusters
22 pages
MapReduce Introduction
No ratings yet
MapReduce Introduction
34 pages
Parallel Data Processing in The Cloud
No ratings yet
Parallel Data Processing in The Cloud
25 pages
Mapreduce: Simpli - Ed Data Processing On Large Clusters
No ratings yet
Mapreduce: Simpli - Ed Data Processing On Large Clusters
4 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
9 pages
Unit 5 Lecture 5
No ratings yet
Unit 5 Lecture 5
21 pages
Big Data Computing
No ratings yet
Big Data Computing
36 pages
Parallel Programming, Mapreduce Model: Unit Ii
No ratings yet
Parallel Programming, Mapreduce Model: Unit Ii
47 pages
CLOUD COMPUTING UNIT 3
No ratings yet
CLOUD COMPUTING UNIT 3
10 pages
Chapter Five Hadoop Mapreduce & HDFS
No ratings yet
Chapter Five Hadoop Mapreduce & HDFS
44 pages
Map Reduce
No ratings yet
Map Reduce
44 pages
Ecs765p W2
No ratings yet
Ecs765p W2
55 pages
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
55 pages
Paper Map Reduce
No ratings yet
Paper Map Reduce
16 pages
AAAI2011 Tutorial Slides
No ratings yet
AAAI2011 Tutorial Slides
213 pages
Navistar, Inc.: Electrical Circuit Diagrams
No ratings yet
Navistar, Inc.: Electrical Circuit Diagrams
217 pages
EPREUVE D'ANGLAIS CLASSE DE 4EME 2EME DEVOIR DU 2EME TRIMESTRE 2023-2024 CPEG SAINT JUSTIN
No ratings yet
EPREUVE D'ANGLAIS CLASSE DE 4EME 2EME DEVOIR DU 2EME TRIMESTRE 2023-2024 CPEG SAINT JUSTIN
2 pages
Celex 32016L1629 en TXT
No ratings yet
Celex 32016L1629 en TXT
59 pages
PDMS Attributes List 12
0% (1)
PDMS Attributes List 12
142 pages
Talent Acquisition
No ratings yet
Talent Acquisition
60 pages
The Economic Survey 2009
No ratings yet
The Economic Survey 2009
261 pages
CH4 Control System
No ratings yet
CH4 Control System
10 pages
Articles of Incorporation - Timeline Text
100% (2)
Articles of Incorporation - Timeline Text
6 pages
Pnis07-Role and Importance of Business Processes in The
No ratings yet
Pnis07-Role and Importance of Business Processes in The
14 pages
So Stac Application To Online Business Start Ups 3
No ratings yet
So Stac Application To Online Business Start Ups 3
9 pages
Excel For Beginners
No ratings yet
Excel For Beginners
6 pages
Vibration Refers To Mechanical Oscillations About An Equilibrium Point
No ratings yet
Vibration Refers To Mechanical Oscillations About An Equilibrium Point
5 pages
Example
No ratings yet
Example
1 page
UFOs and Aliens - Exceptional Cases of Alien Contact (PDFDrive)
No ratings yet
UFOs and Aliens - Exceptional Cases of Alien Contact (PDFDrive)
217 pages
Minimum Static Strength Requirements
No ratings yet
Minimum Static Strength Requirements
2 pages
L R U D: Due: 5 PM, November 10Th (Friday)
No ratings yet
L R U D: Due: 5 PM, November 10Th (Friday)
2 pages
The Financial Management Practices of Small To Medium Enterprises
No ratings yet
The Financial Management Practices of Small To Medium Enterprises
22 pages
Physical Features of India
No ratings yet
Physical Features of India
11 pages
Research Summary - Chinese-Canadian Famers and The Metro Vancouver Local Food Movement
No ratings yet
Research Summary - Chinese-Canadian Famers and The Metro Vancouver Local Food Movement
12 pages
Bench Mechanics and Hand Tools
No ratings yet
Bench Mechanics and Hand Tools
11 pages
Latihan Soal Pas
No ratings yet
Latihan Soal Pas
15 pages
CHM260 Chapter 1
No ratings yet
CHM260 Chapter 1
108 pages
Normal 64253934112bd
100% (1)
Normal 64253934112bd
2 pages
What Are The Types of Machine Learning?
100% (1)
What Are The Types of Machine Learning?
24 pages
5S Filipino &amp English
100% (9)
5S Filipino &amp English
8 pages
Relic Knights Rulebook Web PDF
No ratings yet
Relic Knights Rulebook Web PDF
32 pages
Boundary Conditions at A Naturally Permeable Wall
No ratings yet
Boundary Conditions at A Naturally Permeable Wall
11 pages
MAGNETIC EFFECTS OF ELECTRIC CURRENT -Mod2_WS_1
No ratings yet
MAGNETIC EFFECTS OF ELECTRIC CURRENT -Mod2_WS_1
2 pages
Group Screening Test English Phil Iri
No ratings yet
Group Screening Test English Phil Iri
20 pages
Exaltitude
No ratings yet
Exaltitude
15 pages

Chapter 3 - 大数据管理

Uploaded by

Chapter 3 - 大数据管理

Uploaded by

BIG DATA MANAGEMENT

DISTRIBUTED DATA PROCESSING and MAPREDUCE

 Distributed processing makes use of two or more (usually, many

 With proper communications links and instructions to

 When there was no MapReduce framework, parallel

 The data will be split into smaller parts or blocks and

 Critical path problem: It is the amount of time taken to

 Single split may fail: If any of the machines fail to

 To overcome these issues, we have the MapReduce

 There are a number of advantages of a distributed

 Cost: Distributed, multi-component systems can be less costly

 Redundancy and Reliability: If your one huge central

 Sustainability: Networking numerous data processors to

 Many datasets are too large to fit on a single machine.

 The purpose behind GFS was the ability to store and

 GFS assumes that it runs on many inexpensive

 The GFS master node stores the index of files, while

 MapReduce is a programming framework that allows us

 MapReduce is a programming model which consists of

 A Word Count Example of MapReduce

 Now, suppose, we have to perform a word count on the

 A Word Count Example of MapReduce

 A Word Count Example of MapReduce

 A Word Count Example of MapReduce

 But, as the data grew and became very huge, bringing

 Moving huge data to processing is costly and deteriorates

 Now, MapReduce allows us to overcome the above issues

 Analysis of logs, data analysis, recommendation

 Search and Advertisement

You might also like