Big Data Hadoop Insight
Big Data Hadoop Insight
Research Centre
9/20/2014
Big Data
HDFS
Hadoop
9/20/2014
Hadoop Overview
Inputs & Outputs
Data Types
What is MapReduce (MR)
Example
Functionalities of MR
Speculative Execution
Hadoop Streaming
Hadoop Job Scheduling
9/20/2014
9/20/2014
9/20/2014
9/20/2014
Real
Time
Near
Real
Time
Hourly
Daily
Weekly
Monthly
Quarterly
Yearly
3 Years
5 Years
10 Years
Highly
Summarized
Visualization &
Dashboards
Aggregated
Analytic
Marts & Cubes
Detailed
Events / Facts
Predictive
Analytics
Core ERP
& Legacy Applications
& Data Warehouse
Unstructured
Web /
Telemetry
Big Data
Hadoop etc.
Consumption
Source
9/20/2014
Real Time
GB
Daily
Monthly
TB
Yearly
PB
9/20/2014
9/20/2014
9/20/2014
10
Financial Services
Healthcare
Detect fraud
Personalize banking/insurance
products
Personalized medicine
Retail
In-store behavior analysis
Cross selling
Optimize pricing, placement, design
Optimize inventory and distribution
9/20/2014
11
Government
Location-based marketing
Reduce fraud
Social segmentation
Sentiment analysis
Manufacturing
Design to value
Crowd-sourcing
Digital factory for lean manufacturing
Improve service via product sensor data
9/20/2014
12
9/20/2014
13
9/20/2014
14
9/20/2014
128 MB
128 MB
36 MB
15
9/20/2014
16
9/20/2014
17
File 1
Create
Complete
B1
B2
n1
B1
n1
B2
n1
B1
n2
B1
n2
B2
n2
B3
n3
B2
n3
B3
n3
B3
n4
Namenode
Rack 1
9/20/2014
B3
n4
n4
Rack 2
Rack 3
18
9/20/2014
19
9/20/2014
20
9/20/2014
21
Command
Usage
Syntax
cat
chgrp
chmod
Change the permissions of files. With -R, make the hadoop dfs -chmod [-R] <MODE[,MODE]... |
change recursively through the directory structure OCTALMODE> URI [URI ]
chown
copyFromLocal
copyToLocal
cp
du
dus
expunge
get
getmerge
ls (or) lsr
9/20/2014
22
Command
Usage
Syntax
mkdir
movefromLocal
mv
setrep
stat
tail
text
touchz
put
rm (or) rmr
test
9/20/2014
23
Low latency data access. It is not optimized for low latency data access it
trades latency to increase the throughput of the data.
Lots of small files. Since block size is 64 MB and lots of small files(will
waste blocks) will increase the memory requirements of namenode.
Multiple writers and arbitrary modification. There is no support for
multiple writers in HDFS and files are written to by a single writer after
end of each file.
9/20/2014
24
Hadoop Overview
Inputs & Outputs
Data Types
What is MR
Example
Functionalities of MR
Speculative Execution
How Hadoop runs MR
Hadoop Streaming
Hadoop Job Scheduling
9/20/2014
25
Hadoop is a framework which provides open source libraries for distributed computing
using simple single map-reduce interface and its own distributed filesystem called HDFS. It
facilitates scalability and takes cares of detecting and handling failures.
9/20/2014
26
1.0.X
1.1.X
2.X.X
9/20/2014
27
9/20/2014
28
Risk Modeling:
How business/industry can better understand
Recommendation Engine:
How to predict customer preferences.
9/20/2014
29
AD Targeting:
9/20/2014
30
Threat Analysis:
Trade Surveillance:
Help business spot the rogue trader.
Search Quality:
Delivering more relevant search results to customers.
9/20/2014
31
Sorts the outputs of the maps, which are then input to the reduce tasks.
Takes care of scheduling tasks, monitoring them and re-executes the failed tasks.
9/20/2014
32
The MapReduce framework operates exclusively on <key, value> pairs, that is, the
framework views the input to the job as a set of <key, value> pairs and produces a set of
<key, value> pairs as the output of the job, conceivably of different types.
The key and value classes have to be serializable by the framework and hence need to
implement the Writable interface. Additionally, the key classes have to implement the
9/20/2014
33
9/20/2014
34
9/20/2014
35
9/20/2014
36
Serialization is the process of turning structured objects into a byte stream for transmission over
a network or for writing to persistent storage.
9/20/2014
37
8.
BytesWritable.
9.
NullWritable.
10. MD5Hash
11. ObjectWritable
12. GenericWritable
Apart from the above there are four Writable Collection types
1.
ArrayWritable
2.
TwoDArrayWritable
3.
MapWritable
4.
SortedMapWritable
9/20/2014
38
MapperClass
Input Data
Input Data
Format
<K1, V1>
Mapper
<K2, V2>
ReducerClass
<K2, List(V2)>
Reducer
<K3, V3>
9/20/2014
39
Mapper implementation:
Combiner implementation:
Lines: 18 - 25
The first map emits:
< Hello, 1>
< World, 1>
< Bye, 1>
< World, 1>
Line: 46
Output of first map emits:
< Bye, 1>
< Hello, 1>
< World, 2>
9/20/2014
Reducer implementation:
Lines: 29 - 35
Output of job:
< Bye, 1>
< Goodbye, 1>
< Hadoop, 2>
< Hello, 2>
< World, 2>
40
Name
Value
Description
mapred.map.tasks.
speculative.execution
true
Mapred.reduce.tasks.
speculative.execution
true
9/20/2014
41
9/20/2014
42
9/20/2014
43
9/20/2014
44
Default Scheduler
Scheduling tries to balance map and reduce load on all tasktrackers in the cluster.
Capacity Scheduler
Within a queue, jobs with higher priority will have access to the queue's resources before jobs with
lower priority.
In order to prevent one or more users from monopolizing its resources, each queue enforces a limit on
the percentage of resources allocated to a user at any given time, if there is competition for them.
Fair Scheduler
Each pool is guaranteed a minimum capacity and excess is shared by all jobs using a fairness algorithm.
Scheduler tries to ensure that over time, all jobs receive the same number of resources.
9/20/2014
45
Thank you !!
Data Science
Analytics &
Research Centre
9/20/2014
46