SlideShare a Scribd company logo
Apache Hadoop
DFS and Map Reduce
Víctor Sánchez Anguix
Universitat Politècnica de València
MSc. In Artificial Intelligence, Pattern Recognition, and Digital
Image
Course 2014/2015
Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Who has not heard
about Hadoop?
Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Who knows exactly
what is Hadoop?
Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Being simplistic:
What is Apache Hadoop?
DFS
Map
Reduce
Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Google publishes paper about GFS (2003).
http://research.google.com/archive/gfs.html
➢ Distributed data among cluster of computers
➢ Fault tolerant
➢ Highly scalable with commodity hardware
A bit of history: Distributed File
System (DFS)
Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Google publishes paper about MR (2004).
http://research.google.
com/archive/mapreduce.html
➢ Algorithm for processing distributed data in
parallel
➢ Simple in concept, extremely useful in
practice
A bit of history: Map Reduce (MR)
Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Doug Cutting and
Mike Caffarella →
Apache Nutch
➢ Doug Cutting goes
to Yahoo
➢ Yahoo implements
Apache Hadoop
A bit of history: Hadoop is born
Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Framework for distributed computing
➢ Still based on DFS and MR
➢ It is the main actor in Big Data
➢ Last major release: Apache Hadoop 2.6.0
(Nov 2014)
http://hadoop.apache.org/
Apache Hadoop now
Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
DFS architecture
Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Interacting with Hadoop DFS:
creating dirs
➢ Examples:
hdfs dfs -mkdir data
hdfs dfs -mkdir results
Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Interacting with Hadoop DFS:
uploading files
➢ Examples:
hdfs dfs -put datasets/students.tsv data/students.tsv
hdfs dfs -put datasets/grades.tsv data/grades.tsv
Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Interacting with Hadoop DFS: listing
➢ Examples:
hdfs dfs -ls data
Found 2 items
-rw-r--r-- 3 sanguix supergroup 450 2015-02-09 10:50 data/grades.tsv
-rw-r--r-- 3 sanguix supergroup 194 2015-02-09 10:45 data/students.tsv
Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Interacting with Hadoop DFS: get a
file
➢ Examples:
hdfs dfs -get data/students.tsv
hdfs dfs -get data/grades.tsv
Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Interacting with Hadoop DFS:
deleting files
➢ Examples:
hdfs dfs -rm data/students.tsv
hdfs dfs -rm data/grades.tsv
Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Interacting with Hadoop DFS: space
use info
➢ Examples:
hdfs dfs -df -h
Filesystem Size Used Available Use%
hdfs://localhost 1.5 T 12 K 491.6 G 0%
Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Map Reduce: Overview
Input data
Input data
Input data
Map task
Map task
Map task
Reduce
task
Reduce
task
Reduce
task
Output data
Output data
Output data
chunk of data (key,value) value’
chunk of data (key,value) value’
Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Map: Transform data to (key, value)
Input data
Input data
Input data
Map task
Map task
Map task
chunk of data
chunk of data
Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Shuffle: Send (key, values)
Reduce
task
Reduce
task
Reduce
task
(key,value)
(key,value)
Map task
Map task
Map task
Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Reduce: Aggregating (key,values)
Reduce
task
Reduce
task
Reduce
task
Output data
Output data
Output data
value’
value’
Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Map Reduce
Input data
Input data
Input data
Map task
Map task
Map task
Reduce
task
Reduce
task
Reduce
task
Output data
Output data
Output data
chunk of data (key,value) value’
chunk of data (key,value) value’
Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Map Reduce example: word count
CHUNK 1
this class is about big
data and artificial
intelligence
CHUNK 2
there is nothing big
about this example
CHUNK 3
I am a big artificial
intelligence enthusiast
➢ The file is divided in
chunks to be
processed in
parallel
➢ Data is sent
untransformed to
map nodes
Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Map Reduce example: word count
this class is about big
data and artificial
intelligence
[this, class, is, about, big,
data, and, artificial,
intelligence]
Tokenize
(this,1), (class,1), (is,1),
(about,1), (big,1), (class, 1),
(is, 1), (about 1), (big, 1),
(data, 1), (and, 1), (artificial,1),
(intelligence, 1)
Prepare (key,value)
pairs
MAP TASK
Raw
chunk
Ready to shuffle
Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Map Reduce example: word countMap Reduce example: word count
(big,1)
(big,1)
(big,1)
(big,3)
Sum
REDUCE TASK
From
shuffle Output
Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Exercise: Matrix power
row column value
1 1 3.2
2 3 4.3
3 3 5.1
1 3 0.1
Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Map Reduce variants: No reduce
Input data
Input data
Input data
Map task
Map task
Map task
Output data
Output data
Output data
chunk of data (key,value)
chunk of data (key,value)
Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Map Reduce variants: chaining
Input
data
Input
data
Input
data
Map
task
Map
task
Map
task
Reduce
task
Reduce
task
Reduce
task
Output
data
Output
data
Output
data
Map
task
Map
task
Map
task
Reduce
task
Reduce
task
Reduce
task
Output
data
Output
data
Output
data
Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Maps are executed in parallel
➢ Reducers do not start until all maps are
finished
➢ Output is not finished until all reducers are
finished
➢ Bottleneck: Unbalanced map/reduce taks
○ Change key distribution
○ Increase reduces for increasing parallelism
Map Reduce: bottlenecks
Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Hadoop is implemented in Java
➢ It is possible to program jobs formed by maps
and reduces in Java
➢ We won’t go deep in these matters (bear with
me!)
Map Reduce in Hadoop
Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
http://hadoop.apache.org/
Hadoop architecture
Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
public class WordCount {
public static class TokenizerMapper
extends Mapper<Object, Text, Text,
IntWritable>{
private final static IntWritable one = new
IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value,
Context context) throws IOException,
InterruptedException {
StringTokenizer itr = new StringTokenizer
(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
Map Reduce job in Hadoop
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,
IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key,
Iterable<IntWritable> values, Context context )
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
...
Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
public static void main(String[] args) throws
Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path
(args[0]));
FileOutputFormat.setOutputPath(job, new Path
(args[1]));
System.exit(job.waitForCompletion(true) ? 0 :
1);
}
}
Map Reduce job in Hadoop
Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Compiling
javac -cp opt/hadoop/share/hadoop/common/hadoop-common-2.6.0.jar:
opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.6.0.jar -d
WordCount source/hadoop/WordCount.java
jar -cvf WordCount.jar -C WordCount/ .
➢ Submitting
hadoop jar WordCount.jar es.upv.dsic.iarfid.haia.WordCount
/user/your_username/data/students.tsv /user/your_username/wc
Compiling and submitting a MR job
Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Hadoop ecosystem
Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ http://hadoop.apache.org
➢ Hadoop in Practice. Alex Holmes. Ed. Manning
Publications
➢ Hadoop: The Definitive Guide. Tom White. Ed.
O’Reilly.
➢ StackOverflow
Extra information
Apache Hadoop
DFS and Map Reduce
Víctor Sánchez Anguix
Universitat Politècnica de València
MSc. In Artificial Intelligence, Pattern Recognition, and Digital
Image
Course 2014/2015
Ad

More Related Content

What's hot (20)

MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design Patterns
Donald Miner
 
Mapreduce in Search
Mapreduce in SearchMapreduce in Search
Mapreduce in Search
Amund Tveit
 
Fishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data LakeFishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data Lake
ArangoDB Database
 
10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop
Donald Miner
 
Large Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduceLarge Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduce
Hortonworks
 
Hadoop World 2011: Hadoop and Graph Data Management: Challenges and Opportuni...
Hadoop World 2011: Hadoop and Graph Data Management: Challenges and Opportuni...Hadoop World 2011: Hadoop and Graph Data Management: Challenges and Opportuni...
Hadoop World 2011: Hadoop and Graph Data Management: Challenges and Opportuni...
Cloudera, Inc.
 
Approximation algorithms for stream and batch processing
Approximation algorithms for stream and batch processingApproximation algorithms for stream and batch processing
Approximation algorithms for stream and batch processing
Gabriele Modena
 
Sf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBaseSf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBase
Cloudera, Inc.
 
Next generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labNext generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph lab
Impetus Technologies
 
Introduction to MapReduce Data Transformations
Introduction to MapReduce Data TransformationsIntroduction to MapReduce Data Transformations
Introduction to MapReduce Data Transformations
swooledge
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011
Milind Bhandarkar
 
Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014
Vijay Srinivas Agneeswaran, Ph.D
 
Apache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision TreesApache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision Trees
Carol McDonald
 
How LinkedIn Uses Scalding for Data Driven Product Development
How LinkedIn Uses Scalding for Data Driven Product DevelopmentHow LinkedIn Uses Scalding for Data Driven Product Development
How LinkedIn Uses Scalding for Data Driven Product Development
Sasha Ovsankin
 
EDHREC @ Data Science MD
EDHREC @ Data Science MDEDHREC @ Data Science MD
EDHREC @ Data Science MD
Donald Miner
 
Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2
Gabriele Modena
 
Big data analysis using spark r published
Big data analysis using spark r publishedBig data analysis using spark r published
Big data analysis using spark r published
Dipendra Kusi
 
Map Reduce introduction
Map Reduce introductionMap Reduce introduction
Map Reduce introduction
Muralidharan Deenathayalan
 
Data science and OSS
Data science and OSSData science and OSS
Data science and OSS
Kevin Crocker
 
Distributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark MeetupDistributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark Meetup
Vijay Srinivas Agneeswaran, Ph.D
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design Patterns
Donald Miner
 
Mapreduce in Search
Mapreduce in SearchMapreduce in Search
Mapreduce in Search
Amund Tveit
 
Fishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data LakeFishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data Lake
ArangoDB Database
 
10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop
Donald Miner
 
Large Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduceLarge Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduce
Hortonworks
 
Hadoop World 2011: Hadoop and Graph Data Management: Challenges and Opportuni...
Hadoop World 2011: Hadoop and Graph Data Management: Challenges and Opportuni...Hadoop World 2011: Hadoop and Graph Data Management: Challenges and Opportuni...
Hadoop World 2011: Hadoop and Graph Data Management: Challenges and Opportuni...
Cloudera, Inc.
 
Approximation algorithms for stream and batch processing
Approximation algorithms for stream and batch processingApproximation algorithms for stream and batch processing
Approximation algorithms for stream and batch processing
Gabriele Modena
 
Sf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBaseSf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBase
Cloudera, Inc.
 
Next generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph labNext generation analytics with yarn, spark and graph lab
Next generation analytics with yarn, spark and graph lab
Impetus Technologies
 
Introduction to MapReduce Data Transformations
Introduction to MapReduce Data TransformationsIntroduction to MapReduce Data Transformations
Introduction to MapReduce Data Transformations
swooledge
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011
Milind Bhandarkar
 
Apache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision TreesApache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision Trees
Carol McDonald
 
How LinkedIn Uses Scalding for Data Driven Product Development
How LinkedIn Uses Scalding for Data Driven Product DevelopmentHow LinkedIn Uses Scalding for Data Driven Product Development
How LinkedIn Uses Scalding for Data Driven Product Development
Sasha Ovsankin
 
EDHREC @ Data Science MD
EDHREC @ Data Science MDEDHREC @ Data Science MD
EDHREC @ Data Science MD
Donald Miner
 
Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2Full stack analytics with Hadoop 2
Full stack analytics with Hadoop 2
Gabriele Modena
 
Big data analysis using spark r published
Big data analysis using spark r publishedBig data analysis using spark r published
Big data analysis using spark r published
Dipendra Kusi
 
Data science and OSS
Data science and OSSData science and OSS
Data science and OSS
Kevin Crocker
 

Viewers also liked (8)

Experimentation Platform on Hadoop
Experimentation Platform on HadoopExperimentation Platform on Hadoop
Experimentation Platform on Hadoop
DataWorks Summit
 
Architechture of a social network for 30M users
Architechture of a social network for 30M usersArchitechture of a social network for 30M users
Architechture of a social network for 30M users
Fotostrana
 
PHP High Availability High Performance
PHP High Availability High PerformancePHP High Availability High Performance
PHP High Availability High Performance
Amazee Labs
 
Big Data Storage Challenges and Solutions
Big Data Storage Challenges and SolutionsBig Data Storage Challenges and Solutions
Big Data Storage Challenges and Solutions
WSO2
 
Hadoop Innovation Summit 2014
Hadoop Innovation Summit 2014Hadoop Innovation Summit 2014
Hadoop Innovation Summit 2014
Data Con LA
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Rohit Kulkarni
 
Big Data - O que é o hadoop, map reduce, hdfs e hive
Big Data - O que é o hadoop, map reduce, hdfs e hiveBig Data - O que é o hadoop, map reduce, hdfs e hive
Big Data - O que é o hadoop, map reduce, hdfs e hive
Flavio Fonte, PMP, ITIL
 
Lecture 10 distributed database management system
Lecture 10   distributed database management systemLecture 10   distributed database management system
Lecture 10 distributed database management system
emailharmeet
 
Experimentation Platform on Hadoop
Experimentation Platform on HadoopExperimentation Platform on Hadoop
Experimentation Platform on Hadoop
DataWorks Summit
 
Architechture of a social network for 30M users
Architechture of a social network for 30M usersArchitechture of a social network for 30M users
Architechture of a social network for 30M users
Fotostrana
 
PHP High Availability High Performance
PHP High Availability High PerformancePHP High Availability High Performance
PHP High Availability High Performance
Amazee Labs
 
Big Data Storage Challenges and Solutions
Big Data Storage Challenges and SolutionsBig Data Storage Challenges and Solutions
Big Data Storage Challenges and Solutions
WSO2
 
Hadoop Innovation Summit 2014
Hadoop Innovation Summit 2014Hadoop Innovation Summit 2014
Hadoop Innovation Summit 2014
Data Con LA
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Rohit Kulkarni
 
Big Data - O que é o hadoop, map reduce, hdfs e hive
Big Data - O que é o hadoop, map reduce, hdfs e hiveBig Data - O que é o hadoop, map reduce, hdfs e hive
Big Data - O que é o hadoop, map reduce, hdfs e hive
Flavio Fonte, PMP, ITIL
 
Lecture 10 distributed database management system
Lecture 10   distributed database management systemLecture 10   distributed database management system
Lecture 10 distributed database management system
emailharmeet
 
Ad

Similar to Apache Hadoop: DFS and Map Reduce (20)

Data science and analytics, computer science
Data science and analytics, computer scienceData science and analytics, computer science
Data science and analytics, computer science
abishakathiresan1712
 
Evolution of spark framework for simplifying data analysis.
Evolution of spark framework for simplifying data analysis.Evolution of spark framework for simplifying data analysis.
Evolution of spark framework for simplifying data analysis.
Anirudh Gangwar
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
gothicane
 
Python in big data world
Python in big data worldPython in big data world
Python in big data world
Rohit
 
Big Data & Hadoop. Simone Leo (CRS4)
Big Data & Hadoop. Simone Leo (CRS4)Big Data & Hadoop. Simone Leo (CRS4)
Big Data & Hadoop. Simone Leo (CRS4)
CRS4 Research Center in Sardinia
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoop
Victoria López
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Desing Pathshala
 
Taste Java In The Clouds
Taste Java In The CloudsTaste Java In The Clouds
Taste Java In The Clouds
Jacky Chu
 
Hadoop
HadoopHadoop
Hadoop
Scott Leberknight
 
Hadoop MapReduce
Hadoop MapReduceHadoop MapReduce
Hadoop MapReduce
Urvashi Kataria
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with Pentaho
Mark Kromer
 
Hackathon bonn
Hackathon bonnHackathon bonn
Hackathon bonn
Emil Andreas Siemes
 
Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster
Amazon-style shopping cart analysis using MapReduce on a Hadoop clusterAmazon-style shopping cart analysis using MapReduce on a Hadoop cluster
Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster
Asociatia ProLinux
 
TheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the RescueTheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the Rescue
Shay Sofer
 
Data Science
Data ScienceData Science
Data Science
Subhajit75
 
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Jeff Magnusson
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris Schneider
Dmitry Makarchuk
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questions
Kalyan Hadoop
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
Donald Miner
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Prashant Gupta
 
Data science and analytics, computer science
Data science and analytics, computer scienceData science and analytics, computer science
Data science and analytics, computer science
abishakathiresan1712
 
Evolution of spark framework for simplifying data analysis.
Evolution of spark framework for simplifying data analysis.Evolution of spark framework for simplifying data analysis.
Evolution of spark framework for simplifying data analysis.
Anirudh Gangwar
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
gothicane
 
Python in big data world
Python in big data worldPython in big data world
Python in big data world
Rohit
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoop
Victoria López
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Desing Pathshala
 
Taste Java In The Clouds
Taste Java In The CloudsTaste Java In The Clouds
Taste Java In The Clouds
Jacky Chu
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with Pentaho
Mark Kromer
 
Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster
Amazon-style shopping cart analysis using MapReduce on a Hadoop clusterAmazon-style shopping cart analysis using MapReduce on a Hadoop cluster
Amazon-style shopping cart analysis using MapReduce on a Hadoop cluster
Asociatia ProLinux
 
TheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the RescueTheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the Rescue
Shay Sofer
 
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Jeff Magnusson
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris Schneider
Dmitry Makarchuk
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questions
Kalyan Hadoop
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
Donald Miner
 
Ad

Recently uploaded (20)

Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjksPpt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
panchariyasahil
 
Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..
yuvarajreddy2002
 
Flip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptxFlip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptx
mubashirkhan45461
 
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
gmuir1066
 
brainstorming-techniques-infographics.pptx
brainstorming-techniques-infographics.pptxbrainstorming-techniques-infographics.pptx
brainstorming-techniques-infographics.pptx
maritzacastro321
 
Classification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptxClassification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptx
wencyjorda88
 
computer organization and assembly language.docx
computer organization and assembly language.docxcomputer organization and assembly language.docx
computer organization and assembly language.docx
alisoftwareengineer1
 
Data Science Courses in India iim skills
Data Science Courses in India iim skillsData Science Courses in India iim skills
Data Science Courses in India iim skills
dharnathakur29
 
MASAkkjjkttuyrdquesjhjhjfc44dddtions.docx
MASAkkjjkttuyrdquesjhjhjfc44dddtions.docxMASAkkjjkttuyrdquesjhjhjfc44dddtions.docx
MASAkkjjkttuyrdquesjhjhjfc44dddtions.docx
santosh162
 
DPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdfDPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdf
inmishra17121973
 
Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...
Pixellion
 
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
ThanushsaranS
 
Stack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptxStack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptx
binduraniha86
 
Developing Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response ApplicationsDeveloping Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response Applications
VICTOR MAESTRE RAMIREZ
 
03 Daniel 2-notes.ppt seminario escatologia
03 Daniel 2-notes.ppt seminario escatologia03 Daniel 2-notes.ppt seminario escatologia
03 Daniel 2-notes.ppt seminario escatologia
Alexander Romero Arosquipa
 
Call illuminati Agent in uganda+256776963507/0741506136
Call illuminati Agent in uganda+256776963507/0741506136Call illuminati Agent in uganda+256776963507/0741506136
Call illuminati Agent in uganda+256776963507/0741506136
illuminati Agent uganda call+256776963507/0741506136
 
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsAI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
Contify
 
Minions Want to eat presentacion muy linda
Minions Want to eat presentacion muy lindaMinions Want to eat presentacion muy linda
Minions Want to eat presentacion muy linda
CarlaAndradesSoler1
 
LLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bertLLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bert
ChadapornK
 
Induction Program of MTAB online session
Induction Program of MTAB online sessionInduction Program of MTAB online session
Induction Program of MTAB online session
LOHITH886892
 
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjksPpt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
panchariyasahil
 
Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..
yuvarajreddy2002
 
Flip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptxFlip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptx
mubashirkhan45461
 
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
gmuir1066
 
brainstorming-techniques-infographics.pptx
brainstorming-techniques-infographics.pptxbrainstorming-techniques-infographics.pptx
brainstorming-techniques-infographics.pptx
maritzacastro321
 
Classification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptxClassification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptx
wencyjorda88
 
computer organization and assembly language.docx
computer organization and assembly language.docxcomputer organization and assembly language.docx
computer organization and assembly language.docx
alisoftwareengineer1
 
Data Science Courses in India iim skills
Data Science Courses in India iim skillsData Science Courses in India iim skills
Data Science Courses in India iim skills
dharnathakur29
 
MASAkkjjkttuyrdquesjhjhjfc44dddtions.docx
MASAkkjjkttuyrdquesjhjhjfc44dddtions.docxMASAkkjjkttuyrdquesjhjhjfc44dddtions.docx
MASAkkjjkttuyrdquesjhjhjfc44dddtions.docx
santosh162
 
DPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdfDPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdf
inmishra17121973
 
Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...
Pixellion
 
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
ThanushsaranS
 
Stack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptxStack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptx
binduraniha86
 
Developing Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response ApplicationsDeveloping Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response Applications
VICTOR MAESTRE RAMIREZ
 
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsAI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
Contify
 
Minions Want to eat presentacion muy linda
Minions Want to eat presentacion muy lindaMinions Want to eat presentacion muy linda
Minions Want to eat presentacion muy linda
CarlaAndradesSoler1
 
LLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bertLLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bert
ChadapornK
 
Induction Program of MTAB online session
Induction Program of MTAB online sessionInduction Program of MTAB online session
Induction Program of MTAB online session
LOHITH886892
 

Apache Hadoop: DFS and Map Reduce

  • 1. Apache Hadoop DFS and Map Reduce Víctor Sánchez Anguix Universitat Politècnica de València MSc. In Artificial Intelligence, Pattern Recognition, and Digital Image Course 2014/2015
  • 2. Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Who has not heard about Hadoop?
  • 3. Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
  • 4. Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Who knows exactly what is Hadoop?
  • 5. Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image ➢ Being simplistic: What is Apache Hadoop? DFS Map Reduce
  • 6. Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image ➢ Google publishes paper about GFS (2003). http://research.google.com/archive/gfs.html ➢ Distributed data among cluster of computers ➢ Fault tolerant ➢ Highly scalable with commodity hardware A bit of history: Distributed File System (DFS)
  • 7. Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image ➢ Google publishes paper about MR (2004). http://research.google. com/archive/mapreduce.html ➢ Algorithm for processing distributed data in parallel ➢ Simple in concept, extremely useful in practice A bit of history: Map Reduce (MR)
  • 8. Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image ➢ Doug Cutting and Mike Caffarella → Apache Nutch ➢ Doug Cutting goes to Yahoo ➢ Yahoo implements Apache Hadoop A bit of history: Hadoop is born
  • 9. Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image ➢ Framework for distributed computing ➢ Still based on DFS and MR ➢ It is the main actor in Big Data ➢ Last major release: Apache Hadoop 2.6.0 (Nov 2014) http://hadoop.apache.org/ Apache Hadoop now
  • 10. Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image DFS architecture
  • 11. Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Interacting with Hadoop DFS: creating dirs ➢ Examples: hdfs dfs -mkdir data hdfs dfs -mkdir results
  • 12. Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Interacting with Hadoop DFS: uploading files ➢ Examples: hdfs dfs -put datasets/students.tsv data/students.tsv hdfs dfs -put datasets/grades.tsv data/grades.tsv
  • 13. Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Interacting with Hadoop DFS: listing ➢ Examples: hdfs dfs -ls data Found 2 items -rw-r--r-- 3 sanguix supergroup 450 2015-02-09 10:50 data/grades.tsv -rw-r--r-- 3 sanguix supergroup 194 2015-02-09 10:45 data/students.tsv
  • 14. Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Interacting with Hadoop DFS: get a file ➢ Examples: hdfs dfs -get data/students.tsv hdfs dfs -get data/grades.tsv
  • 15. Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Interacting with Hadoop DFS: deleting files ➢ Examples: hdfs dfs -rm data/students.tsv hdfs dfs -rm data/grades.tsv
  • 16. Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Interacting with Hadoop DFS: space use info ➢ Examples: hdfs dfs -df -h Filesystem Size Used Available Use% hdfs://localhost 1.5 T 12 K 491.6 G 0%
  • 17. Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Map Reduce: Overview Input data Input data Input data Map task Map task Map task Reduce task Reduce task Reduce task Output data Output data Output data chunk of data (key,value) value’ chunk of data (key,value) value’
  • 18. Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Map: Transform data to (key, value) Input data Input data Input data Map task Map task Map task chunk of data chunk of data
  • 19. Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Shuffle: Send (key, values) Reduce task Reduce task Reduce task (key,value) (key,value) Map task Map task Map task
  • 20. Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Reduce: Aggregating (key,values) Reduce task Reduce task Reduce task Output data Output data Output data value’ value’
  • 21. Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Map Reduce Input data Input data Input data Map task Map task Map task Reduce task Reduce task Reduce task Output data Output data Output data chunk of data (key,value) value’ chunk of data (key,value) value’
  • 22. Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Map Reduce example: word count CHUNK 1 this class is about big data and artificial intelligence CHUNK 2 there is nothing big about this example CHUNK 3 I am a big artificial intelligence enthusiast ➢ The file is divided in chunks to be processed in parallel ➢ Data is sent untransformed to map nodes
  • 23. Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Map Reduce example: word count this class is about big data and artificial intelligence [this, class, is, about, big, data, and, artificial, intelligence] Tokenize (this,1), (class,1), (is,1), (about,1), (big,1), (class, 1), (is, 1), (about 1), (big, 1), (data, 1), (and, 1), (artificial,1), (intelligence, 1) Prepare (key,value) pairs MAP TASK Raw chunk Ready to shuffle
  • 24. Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Map Reduce example: word countMap Reduce example: word count (big,1) (big,1) (big,1) (big,3) Sum REDUCE TASK From shuffle Output
  • 25. Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Exercise: Matrix power row column value 1 1 3.2 2 3 4.3 3 3 5.1 1 3 0.1
  • 26. Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Map Reduce variants: No reduce Input data Input data Input data Map task Map task Map task Output data Output data Output data chunk of data (key,value) chunk of data (key,value)
  • 27. Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Map Reduce variants: chaining Input data Input data Input data Map task Map task Map task Reduce task Reduce task Reduce task Output data Output data Output data Map task Map task Map task Reduce task Reduce task Reduce task Output data Output data Output data
  • 28. Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image ➢ Maps are executed in parallel ➢ Reducers do not start until all maps are finished ➢ Output is not finished until all reducers are finished ➢ Bottleneck: Unbalanced map/reduce taks ○ Change key distribution ○ Increase reduces for increasing parallelism Map Reduce: bottlenecks
  • 29. Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image ➢ Hadoop is implemented in Java ➢ It is possible to program jobs formed by maps and reduces in Java ➢ We won’t go deep in these matters (bear with me!) Map Reduce in Hadoop
  • 30. Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image http://hadoop.apache.org/ Hadoop architecture
  • 31. Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image public class WordCount { public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer (value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } } Map Reduce job in Hadoop public static class IntSumReducer extends Reducer<Text,IntWritable,Text, IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } } ...
  • 32. Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path (args[0])); FileOutputFormat.setOutputPath(job, new Path (args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } } Map Reduce job in Hadoop
  • 33. Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image ➢ Compiling javac -cp opt/hadoop/share/hadoop/common/hadoop-common-2.6.0.jar: opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.6.0.jar -d WordCount source/hadoop/WordCount.java jar -cvf WordCount.jar -C WordCount/ . ➢ Submitting hadoop jar WordCount.jar es.upv.dsic.iarfid.haia.WordCount /user/your_username/data/students.tsv /user/your_username/wc Compiling and submitting a MR job
  • 34. Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Hadoop ecosystem
  • 35. Apache Hadoop: DFS and Map Reduce. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image ➢ http://hadoop.apache.org ➢ Hadoop in Practice. Alex Holmes. Ed. Manning Publications ➢ Hadoop: The Definitive Guide. Tom White. Ed. O’Reilly. ➢ StackOverflow Extra information
  • 36. Apache Hadoop DFS and Map Reduce Víctor Sánchez Anguix Universitat Politècnica de València MSc. In Artificial Intelligence, Pattern Recognition, and Digital Image Course 2014/2015