0% found this document useful (0 votes)
10 views38 pages

Chapter 3 - 大数据管理

This document discusses distributed data processing and MapReduce. It describes how distributed processing uses multiple networked computers to work together on tasks. MapReduce is then introduced as a programming framework that allows for distributed and parallel processing of large datasets across computer clusters. Key advantages of MapReduce include its ability to handle parallel processing, reliability, and fault tolerance without requiring developers to manage complex distributed systems issues.

Uploaded by

gs68295
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views38 pages

Chapter 3 - 大数据管理

This document discusses distributed data processing and MapReduce. It describes how distributed processing uses multiple networked computers to work together on tasks. MapReduce is then introduced as a programming framework that allows for distributed and parallel processing of large datasets across computer clusters. Key advantages of MapReduce include its ability to handle parallel processing, reliability, and fault tolerance without requiring developers to manage complex distributed systems issues.

Uploaded by

gs68295
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 38

BIG DATA MANAGEMENT

DISTRIBUTED DATA PROCESSING and MAPREDUCE


Distributed Processing

 Distributed processing makes use of two or more (usually, many


more) computers that are networked together and all working
on a single task in a well-coordinated fashion.
 The individual computers involved can be ordinary desktop or
laptop machines, high-end machines, or specialized servers that
carry out specific tasks like storage and retrieval of datasets.
 In a complex distributed system, sub-components of the system
(a subgroup of networked computers) can be devoted to a
specific task while other groups concentrate on separate tasks.
Distributed Processing

 With proper communications links and instructions to


the machines, a series of distributed computers can
do the work of much more powerful stand-alone
systems, and can even reach processing power and
speeds of the fastest supercomputers.
 Many gaming systems rely on distributed processing
setups, where gamers' individual machines carry out
some of the processing in addition to more central
servers providing the gaming backbone.
Old Way

 When there was no MapReduce framework, parallel


and distributed processing used to happen in a
traditional way.
 As an example, let us take an example where we have
a weather log containing the daily average
temperature of the years from 2000 to 2015. Here, we
want to calculate the day having the highest
temperature each year.
Old Way

 The data will be split into smaller parts or blocks and


store them in different machines. Then, we will find
the highest temperature in each part stored in the
corresponding machine.
 At last, we will combine the results received from
each of the machines to have the final output. Let us
look at the challenges associated with this traditional
approach:
Old Way

 The data will be split into smaller parts or blocks and store them in
different machines. Then, we will find the highest temperature in
each part stored in the corresponding machine.
 At last, we will combine the results received from each of the
machines to have the final output. Let us look at the challenges
associated with this traditional approach:
Old Way

 Critical path problem: It is the amount of time taken to


finish the job without delaying the next milestone or actual
completion date. So, if, any of the machines delay the job,
the whole work gets delayed.
 Reliability problem: What if, any of the machines which are
working with a part of data fails? The management of this
failover becomes a challenge.
 Equal split issue: How will I divide the data into smaller
chunks so that each machine gets even part of data to work
with. In other words, how to equally divide the data such
that no individual machine is overloaded or underutilized.
Old Way

 Single split may fail: If any of the machines fail to


provide the output, I will not be able to calculate the
result. So, there should be a mechanism to ensure this
fault tolerance capability of the system.
 Aggregation of the result: There should be a mechanism
to aggregate the result generated by each of the
machines to produce the final output.
Old Way

 To overcome these issues, we have the MapReduce


framework which allows us to perform such parallel
computations without bothering about the issues like
reliability, fault tolerance etc.
 Therefore, MapReduce gives you the flexibility to write
code logic without caring about the design issues of the
system.
Advantages of Distributed Processing
in Business

 There are a number of advantages of a distributed


system over a centralized system that factor into a
business' decision to distribute their processing load.
Among the key factors:
 Cost
 Redundancy and Reliability
 Sustainability
Advantages of Distributed Processing
in Business

 Cost: Distributed, multi-component systems can be less costly


than a single, centralized system.
 In the SETI example, the cost savings are obvious. Rather than
invest in a large-scale mainframe system or supercomputer
costing hundreds of thousands of dollars, the SETI program
was able to make use of "free labor" provided by many
individuals who volunteered their processing resources.
 Even in a business setting, though, the use of multiple
personal computers networked together can be less of an
investment than a single large data processing system.
Advantages of Distributed Processing
in Business

 Redundancy and Reliability: If your one huge central


computer breaks down, you're out of luck. Information
processing comes to a halt until the system is back up and
running.
 Even central systems with robust backup capabilities are
still prone to disruptive failures.
 In a distributed framework, however, loss of one or a few
machines is not necessarily a big deal, as there are other
computers linked into the network that can pick up the
slack.
Advantages of Distributed Processing
in Business

 Sustainability: Networking numerous data processors to


perform a single task can result in energy savings over a
centralized data processing system.
 Remote data centers can be sited in environments that
are cool, thereby reducing the need for artificial cooling,
or that have an ample supply of "green electricity" such
as that produced by hydropower or geothermal energy.
 This is not only a cost-saving measure, but can lower the
overall greenhouse gas footprint of the system.
Google File System

 Many datasets are too large to fit on a single machine.


Unstructured data may not be easy to insert into a
database.
 Distributed file systems store data across a large
number of servers. The Google File System (GFS) is a
distributed file system used by Google in the early
2000s. It is designed to run on a large number of
cheap servers.
Google File System

 The purpose behind GFS was the ability to store and


access large files, and by large I mean files that can’t
be stored on a single hard drive.
 The idea is to divide these files into manageable
chunks of 64 MB and store these chunks on multiple
nodes, having a mapping between these chunks also
stored inside the file system.
Google File System
Google File System

 GFS assumes that it runs on many inexpensive


commodity components that can often fail, therefore
it should consistently perform failure monitoring and
recovery.
 It can store many large files simultaneously and
allows for two kinds of reads to them: small random
reads and large streaming reads. Instead of rewriting
files, GFS is optimized towards appending data to
existing files in the system.
Google File System

 The GFS master node stores the index of files, while


GFS chunk servers store the actual chunks in the
filesystems on multiple Linux nodes.
 The chunks that are stored in the GFS are replicated,
so the system can tolerate chunk server failures. Data
corruption is also detected using checksums, and GFS
tries to compensate for these events as soon as
possible.
MapReduce

 MapReduce is a programming framework that allows us


to perform distributed and parallel processing on large
data sets in a distributed environment.
MapReduce

 MapReduce is a programming model which consists of


writing map and reduce functions.
 Map accepts key/value pairs and produces a sequence of
key/value pairs.
 Then, the data is shuffled to group keys together. After
that, we reduce the accepted values with the same key
and produce a new key/value pair.
MapReduce
MapReduce

 A Word Count Example of MapReduce


 Let us understand, how a MapReduce works by taking an
example where I have a text file called example.txt whose
contents are as follows:

 Dear, Bear, River, Car, Car, River, Deer, Car and Bear

 Now, suppose, we have to perform a word count on the


sample.txt using MapReduce. So, we will be finding unique
words and the number of occurrences of those unique
words.
MapReduce

 A Word Count Example of MapReduce


MapReduce

 A Word Count Example of MapReduce


 First, we divide the input into three splits as shown in the
figure. This will distribute the work among all the map nodes.
 Then, we tokenize the words in each of the mappers and give a
hardcoded value (1) to each of the tokens or words. The
rationale behind giving a hardcoded value equal to 1 is that
every word, in itself, will occur once.
 Now, a list of key-value pair will be created where the key is
nothing but the individual words and value is one. So, for the
first line (Dear Bear River) we have 3 key-value pairs — Dear, 1;
Bear, 1; River, 1. The mapping process remains the same on all
the nodes.
MapReduce

 A Word Count Example of MapReduce


 After the mapper phase, a partition process takes place where sorting
and shuffling happen so that all the tuples with the same key are sent
to the corresponding reducer.
 So, after the sorting and shuffling phase, each reducer will have a
unique key and a list of values corresponding to that very key. For
example, Bear, [1,1]; Car, [1,1,1].., etc.
 Now, each Reducer counts the values which are present in that list of
values. As shown in the figure, reducer gets a list of values which is
[1,1] for the key Bear. Then, it counts the number of ones in the very
list and gives the final output as — Bear, 2.
 Finally, all the output key/value pairs are then collected and written in
the output file.
Advantages of MapReduce

 Parallel Processing
 In MapReduce, we are dividing the job among multiple
nodes and each node works with a part of the job
simultaneously.
 So, MapReduce is based on Divide and Conquer paradigm
which helps us to process the data using different
machines.
 As the data is processed by multiple machines instead of a
single machine in parallel, the time taken to process the
data gets reduced by a tremendous amount as shown in
the figure below (2).
Advantages of MapReduce

 Parallel Processing
Advantages of MapReduce

 Data Locality
 Instead of moving data to the processing unit, we are
moving the processing unit to the data in the
MapReduce Framework.
 In the traditional system, we used to bring data to the
processing unit and process it.
Advantages of MapReduce

 But, as the data grew and became very huge, bringing


this huge amount of data to the processing unit posed
the following issues:

 Moving huge data to processing is costly and deteriorates


the network performance.
 Processing takes time as the data is processed by a single
unit which becomes the bottleneck.
 Master node can get over-burdened and may fail.
Advantages of MapReduce

 Now, MapReduce allows us to overcome the above issues


by bringing the processing unit to the data.
 So, as you can see in the above image that the data is
distributed among multiple nodes where each node
processes the part of the data residing on it.
 This allows us to have the following advantages:
 It is very cost effective to move the processing unit to the data.
 The processing time is reduced as all the nodes are working
with their part of the data in parallel.
 Every node gets a part of the data to process and therefore,
there is no chance of a node getting overburdened.
Usage of MapReduce in industry

 Analysis of logs, data analysis, recommendation


mechanisms, fraud detection, user behavior analysis,
genetic algorithms, scheduling problems, resource
planning among others, are applications that use
MapReduce.
 Some practical applications that use MapReduce are:
 Social Network  Data Warehouse
 Electronic Commerce
 Fraud Detection
 Search and Advertisements
Usage of MapReduce in industry

 Social Networks
 Many of the social network features, such as who visited
your LinkedIn profile, who read your post on Facebook or
Twitter, can be evaluated using the MapReduce,
programming model.
Usage of MapReduce in industry

 Entertainment
 Netflix uses Hadoop and MapReduce to solve problems
such as discovering the most popular movies, based on
what you watched, what do you like?
 Providing suggestions to registered users taken into
account their interests.
 MapReduce can determine how users are watching movies,
analyzing their logs and clicks.
Usage of MapReduce in industry

 Electronic Commerce
 Many e-commerce providers, such as the Amazon, Walmart,
and eBay, use the MapReduce programming model to
identify favorite products based on users’ interests or
buying behavior.
 It includes creating product recommendation mechanisms
for e-commerce catalogs, analyzing site records, purchase
history, user interaction logs, and so on.
Usage of MapReduce in industry

 Electronic Commerce
 It’s used to establish a user’s sentimental profile for a
particular product by reviewing comments or reviews or
analyzing search logs by identifying which items are most
popular based on the search and which products are
missing.
 Many Internet service providers use MapReduce to analyze
site records and understand site visits, engagement,
locations, mobile devices, and browsers.
Usage of MapReduce in industry

 Fraud Detection
 Hadoop and MapReduce are used in the financial industries,
including companies such as banks, insurance providers,
payment locations for fraud detection, trend identification
or business metrics through transaction analysis.
 Banks analyze the data of the credit card and the related
expenses, for categorization of these expenses and make
recommendations for different offers, analyzing
anonymous purchasing behavior.
Usage of MapReduce in industry

 Search and Advertisement


 It can be to used to analyze and understand search
behavior, trends, and missing results for specific keywords.
 Google and Yahoo use MapReduce to understand users’
behavior, such as popular searches over a period of an
event such as presidential elections.
 Google AdWords uses MapReduce to understand the
impressions of ads served, click-through rates, and
engagement behavior of users.
Usage of MapReduce in industry

 Data Warehouse
 We can utilize MapReduce to analyze large data volumes in
data warehouses while implementing specific business logic
for data insights.

You might also like