0% found this document useful (0 votes)

61 views

Large-Scale Data Management: Cs525: Special Topics in Dbs

This document provides an overview of Hadoop and MapReduce. It begins by explaining that Hadoop is an open-source software framework that allows distributed processing of large datasets across clusters of computers. It then discusses key aspects of Hadoop including its master-slave architecture, use of HDFS for storage, and MapReduce programming model. Examples of how MapReduce jobs work are also provided.

Uploaded by

Pindiganti

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

61 views

Large-Scale Data Management: Cs525: Special Topics in Dbs

Uploaded by

Pindiganti

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 22

CS525: Special Topics in DBs

Large-Scale Data
Management
Hadoop/MapReduce Computing
Paradigm
Spring 2013
WPI, Mohamed Eltabakh
1

Large-Scale Data
Analytics
MapReduce computing paradigm (E.g., Hadoop) vs. Traditional
database systems

Database
vs.

Many enterprises are turning to Hadoop

Especially applications generating big data

Web applications, social networks, scientific applications

Why Hadoop is able to

compete?
Database
vs.

Scalability (petabytes of
data, thousands of machines)

Performance (tons of
indexing, tuning, data
organization tech.)

Flexibility in accepting all

data formats (no schema)

Features:
- Provenance tracking
- Annotation management
- .

Efficient and simple faulttolerant mechanism

Commodity inexpensive
hardware

What is Hadoop
Hadoop is a software framework for distributed processing
of large datasets across large clusters of computers
Large datasets Terabytes or petabytes of data
Large clusters hundreds or thousands of nodes

Hadoop is open-source implementation for Google

MapReduce
Hadoop is based on a simple programming model called
MapReduce
Hadoop is based on a simple data model, any data will fit

What is Hadoop
(Contd)
Hadoop framework consists on two main layers
Distributed file system (HDFS)
Execution engine (MapReduce)

Hadoop Master/Slave
Architecture
Hadoop is designed as a master-slave shared-nothing architecture

Master node (single node)

Many slave nodes

Design Principles of
Hadoop
Need to process big data
Need to parallelize computation across thousands
of nodes
Commodity hardware
Large number of low-end cheap machines working in
parallel to solve a computing problem

This is in contrast to Parallel DBs

Small number of high-end expensive machines

Design Principles of
Hadoop
Automatic parallelization & distribution
Hidden from the end-user

Fault tolerance and automatic recovery

Nodes/tasks will fail and will recover automatically

Clean and simple programming abstraction

Users only provide two functions map and
reduce
8

How Uses MapReduce/Hadoop

Google: Inventors of MapReduce computing
paradigm
Yahoo: Developing Hadoop open-source of
MapReduce
IBM, Microsoft, Oracle
Facebook, Amazon, AOL, NetFlex
Many others + universities and research labs

Hadoop: How it Works

Hadoop Architecture
Distributed file system (HDFS)
Execution engine (MapReduce)

Master node (single node)

Many slave nodes

Hadoop Distributed File

System (HDFS)

Centralized namenode
- Maintains metadata info about files
File F

Blocks (64 MB)

Many datanode (1000s)
- Store the actual data
- Files are divided into blocks
- Each block is replicated N
times
(Default = 3)
12

Main Properties of
HDFS
Large: A HDFS instance may consist of thousands of
server machines, each storing part of the file systems data
Replication: Each data block is replicated many times
(default is 3)
Failure: Failure is the norm rather than exception
Fault Tolerance: Detection of faults and quick, automatic
recovery from them is a core architectural goal of HDFS
Namenode is consistently checking Datanodes

Map-Reduce Execution Engine

(Example: Color Count)
Input blocks
on HDFS

Produces (k,
v)
( , 1)
Map

Shuffle &
Sorting based
on k
Parse-hash

Consumes(k, [v])
(
,
[1,1,1,1,1,1..])
Produces(k, v)
(
, 100)
Reduce

Map

Parse-hash
Reduce

Map

Parse-hash
Reduce

Map

Parse-hash

Users only provide the Map and Reduce functions

Properties of MapReduce
Engine
Job Tracker is the master node (runs with the namenode)
Receives the users job
Decides on how many tasks will run (number of mappers)
Decides on where to run each mapper (concept of locality)
Node 1 Node 2 Node 3

This file has 5 Blocks run 5 map tasks

Where to run the task reading block 1
Try to run it on Node 1 or Node 3

Properties of MapReduce
Engine (Contd)
Task Tracker is the slave node (runs on each datanode)
Receives the task from Job Tracker
Runs the task until completion (either map or reduce task)
Always in communication with the Job Tracker reporting progress

M ap

P a rse -h a sh
R educe

M ap

In this example, 1 mapreduce job consists of 4

map tasks and 3 reduce
tasks

P a rse -h a sh
R educe

M ap

P a rse -h a sh
R edu ce

M ap

P a rse -h a sh

Key-Value Pairs
Mappers and Reducers are users code (provided functions)
Just need to obey the Key-Value pairs interface
Mappers:
Consume <key, value> pairs
Produce <key, value> pairs

Reducers:
Consume <key, <list of values>>
Produce <key, value>

Shuffling and Sorting:

Hidden phase between mappers and reducers
Groups all similar keys from all mappers, sorts and passes them to a
certain reducer in the form of <key, <list of values>>
17

MapReduce Phases

Deciding on what will be the key and what will be the

value developers responsibility
18

Example 1: Word
Count

Job: Count the occurrences of each word in

a data set

Reduce
Tasks

Map
Tasks
19

Example 2: Color Count

Job: Count the number of each color in a data set
Input blocks
on HDFS

Produces (k,
v)
( , 1)

Shuffle &
Sorting based
on k

Map

Parse-hash

Map

Parse-hash

Map

Consumes(k, [v])
(
,
[1,1,1,1,1,1..])
Produces(k, v)
(
, 100)
Part0001
Reduce

Reduce

Part0002

Reduce

Part0003

Parse-hash

Thats the output file,

it has 3 parts on
probably 3 different
machines

Example 3: Color Filter

Job: Select only the blue and the green colors
Each map task will select
Input blocks
Produces (k,
only the blue or green
on HDFS
v)
colors
( , 1)
Map

Write to HDFS
Part0001

Map

Write to HDFS
Part0002

Map

Write to HDFS
Part0003

Map

Write to HDFS
Part0004

No need for reduce phase

Thats the output file,

it has 4 parts on
probably 4 different
machines

Bigger Picture: Hadoop vs. Other

Systems
Distributed Databases

Hadoop

Computing
Model

Notion of transactions
Transaction is the unit of work
ACID properties, Concurrency
control

Notion of jobs
Job is the unit of work
No concurrency control

Data Model

Structured data with known

schema
Read/Write mode

Any data will fit in any format

(un)(semi)structured
ReadOnly mode

Cost Model

Expensive servers

Cheap commodity machines

Fault Tolerance

Failures are rare

Recovery mechanisms

Failures are common over

thousands of machines
Simple yet efficient fault
tolerance

Cloud Computing
Key
- Efficiency,
A computing model
where anyoptimizations,
computing fineCharacteristics
infrastructure can tuning
run on the cloud
Hardware & Software are provided as remote services
Elastic: grows and shrinks based on the users

demand
Example: Amazon EC2

- Scalability, flexibility, fault

tolerance

Ab Initio EME Concepts
100% (3)
Ab Initio EME Concepts
66 pages
Abinitio Vectors
50% (2)
Abinitio Vectors
18 pages
Ab Initio Best Practices and Useful Tips PDF
No ratings yet
Ab Initio Best Practices and Useful Tips PDF
3 pages
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
From Everand
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
alasdair gilchrist
5/5 (1)
Ab Initio - Session6
33% (3)
Ab Initio - Session6
110 pages
Hadoop Map Reduce Concept
No ratings yet
Hadoop Map Reduce Concept
23 pages
Lecture 5 - Hadoop and Mapreduce
No ratings yet
Lecture 5 - Hadoop and Mapreduce
30 pages
Hadoop Trainting in Hyderabad@KellyTechnologies
No ratings yet
Hadoop Trainting in Hyderabad@KellyTechnologies
23 pages
Lecture 5 - Hadoop and Mapreduce
No ratings yet
Lecture 5 - Hadoop and Mapreduce
30 pages
Big Data
No ratings yet
Big Data
67 pages
Chapter 25
No ratings yet
Chapter 25
43 pages
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
No ratings yet
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
65 pages
Unit 5
No ratings yet
Unit 5
35 pages
L02-Hadoop Framework
No ratings yet
L02-Hadoop Framework
40 pages
1 MapReduce introduction with example
No ratings yet
1 MapReduce introduction with example
52 pages
2 Hadoop Ecosystem
No ratings yet
2 Hadoop Ecosystem
41 pages
Map Reduce
No ratings yet
Map Reduce
69 pages
Data Mining With Hadoop and Hive Introduction To Architecture
No ratings yet
Data Mining With Hadoop and Hive Introduction To Architecture
39 pages
Unit - III Advanced Analytics Technology and Tools
No ratings yet
Unit - III Advanced Analytics Technology and Tools
44 pages
HadoopMapreduce Summerization
No ratings yet
HadoopMapreduce Summerization
24 pages
Hadoop: Er. Gursewak Singh Dsce
No ratings yet
Hadoop: Er. Gursewak Singh Dsce
15 pages
Exploring Bigdata With Hadoop: Dr.A.Bazila Banu Associate Professor Department of Cse
No ratings yet
Exploring Bigdata With Hadoop: Dr.A.Bazila Banu Associate Professor Department of Cse
23 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
Unit V Cloud Technologies and Advancements
No ratings yet
Unit V Cloud Technologies and Advancements
33 pages
Shortnotes For Cloud
No ratings yet
Shortnotes For Cloud
22 pages
TM2 ch02 Mapreduce
No ratings yet
TM2 ch02 Mapreduce
51 pages
U-3 Big Data
No ratings yet
U-3 Big Data
23 pages
Large-Scale Data Analytics: Traditional Database Systems
No ratings yet
Large-Scale Data Analytics: Traditional Database Systems
11 pages
Chapter 4 MapReduce
No ratings yet
Chapter 4 MapReduce
82 pages
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
No ratings yet
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
40 pages
Hadoop
No ratings yet
Hadoop
34 pages
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
No ratings yet
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
53 pages
Report Title: Wasit University
No ratings yet
Report Title: Wasit University
8 pages
Bda Unit 1
No ratings yet
Bda Unit 1
13 pages
11 Lecture
No ratings yet
11 Lecture
22 pages
A New Way To Store and Analyze Data: Presented By:: Harsha Jain
No ratings yet
A New Way To Store and Analyze Data: Presented By:: Harsha Jain
20 pages
Hadoop Spark
No ratings yet
Hadoop Spark
34 pages
9 Hadoop PDF
No ratings yet
9 Hadoop PDF
59 pages
Chapter Five Hadoop Mapreduce & HDFS
No ratings yet
Chapter Five Hadoop Mapreduce & HDFS
44 pages
AAAI2011 Tutorial Slides
No ratings yet
AAAI2011 Tutorial Slides
213 pages
Unit 5
No ratings yet
Unit 5
7 pages
Lecture4 IntroMapReduce PDF
No ratings yet
Lecture4 IntroMapReduce PDF
75 pages
CC_unit4_52e39303-d867-4b14-b5bf-38bc746359c6
No ratings yet
CC_unit4_52e39303-d867-4b14-b5bf-38bc746359c6
14 pages
Hadoop and MR Programming: DR G Sudha Sadasivam Professor Cse, PSGCT
No ratings yet
Hadoop and MR Programming: DR G Sudha Sadasivam Professor Cse, PSGCT
71 pages
Introduction To Map Reduce
No ratings yet
Introduction To Map Reduce
50 pages
Hadoopintro
No ratings yet
Hadoopintro
31 pages
IDS Unit3
No ratings yet
IDS Unit3
16 pages
02 Unit-II Hadoop Architecture and HDFS
No ratings yet
02 Unit-II Hadoop Architecture and HDFS
18 pages
Hadoop
No ratings yet
Hadoop
5 pages
Hdfs MR Wordcount
No ratings yet
Hdfs MR Wordcount
16 pages
Hadoop (Mapreduce)
No ratings yet
Hadoop (Mapreduce)
43 pages
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
No ratings yet
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
54 pages
Big Data
No ratings yet
Big Data
43 pages
Hadoop by Dr. Kamal Gulati
No ratings yet
Hadoop by Dr. Kamal Gulati
33 pages
Week 02
No ratings yet
Week 02
115 pages
IDS Unit3
No ratings yet
IDS Unit3
19 pages
Notes Bug Data and of Apache
No ratings yet
Notes Bug Data and of Apache
4 pages
Unit 3 & 4 big data
No ratings yet
Unit 3 & 4 big data
18 pages
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
QuickStart Guide to Db2 Development with Python
From Everand
QuickStart Guide to Db2 Development with Python
Roger E. Sanders
No ratings yet
Mastering DuckDB: High-Performance Analytics Made Easy
From Everand
Mastering DuckDB: High-Performance Analytics Made Easy
Robert Johnson
No ratings yet
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
IBM Cognos 8 Planning
From Everand
IBM Cognos 8 Planning
Jason Edwards
No ratings yet
Hadoop Beginner's Guide
From Everand
Hadoop Beginner's Guide
Garry Turkington
4/5 (7)
Ab Initio - Session3
75% (4)
Ab Initio - Session3
90 pages
Unix Session11
No ratings yet
Unix Session11
56 pages
Unix - Session09 & 10
100% (1)
Unix - Session09 & 10
111 pages
AbInitio SCM v1.1
No ratings yet
AbInitio SCM v1.1
13 pages
AbInitio String Functions
100% (3)
AbInitio String Functions
13 pages
Checklist For Teradata SQLs
No ratings yet
Checklist For Teradata SQLs
4 pages
AbInitio SCM v1.1
No ratings yet
AbInitio SCM v1.1
13 pages
Teradata Day1
100% (1)
Teradata Day1
21 pages
Large-Scale Data Management: Cs525: Special Topics in Dbs
No ratings yet
Large-Scale Data Management: Cs525: Special Topics in Dbs
22 pages
Tera Data
No ratings yet
Tera Data
1 page
Module 7: Analyze Secondary Index Criteria
No ratings yet
Module 7: Analyze Secondary Index Criteria
24 pages
Teradata Day1
100% (1)
Teradata Day1
21 pages
Teradata Architecture
100% (1)
Teradata Architecture
7 pages
Save
No ratings yet
Save
1 page
Resume
No ratings yet
Resume
3 pages
CV Srinu Flex
No ratings yet
CV Srinu Flex
4 pages
Insurance Domain Competency
No ratings yet
Insurance Domain Competency
15 pages
Tpump: After Completing This Module, You Will Be Able To
No ratings yet
Tpump: After Completing This Module, You Will Be Able To
21 pages

Large-Scale Data Management: Cs525: Special Topics in Dbs

Uploaded by

Large-Scale Data Management: Cs525: Special Topics in Dbs

Uploaded by

CS525: Special Topics in DBs

Many enterprises are turning to Hadoop

Especially applications generating big data

Web applications, social networks, scientific applications

Why Hadoop is able to

Flexibility in accepting all

Efficient and simple faulttolerant mechanism

Hadoop is open-source implementation for Google

Master node (single node)

Many slave nodes

This is in contrast to Parallel DBs

Fault tolerance and automatic recovery

Clean and simple programming abstraction

How Uses MapReduce/Hadoop

Hadoop: How it Works

Master node (single node)

Many slave nodes

Hadoop Distributed File

Blocks (64 MB)

Map-Reduce Execution Engine

Users only provide the Map and Reduce functions

This file has 5 Blocks run 5 map tasks

In this example, 1 mapreduce job consists of 4

Shuffling and Sorting:

Deciding on what will be the key and what will be the

Job: Count the occurrences of each word in

Example 2: Color Count

Thats the output file,

Example 3: Color Filter

No need for reduce phase

Thats the output file,

Bigger Picture: Hadoop vs. Other

Structured data with known

Any data will fit in any format

Cheap commodity machines

Failures are rare

Failures are common over

- Scalability, flexibility, fault

You might also like