Learning Objectives - In this module, you will understand Hadoop MapReduce framework and how MapReduce works on data stored in HDFS. Also, you will learn what are the different types of Input and Output formats in MapReduce framework and their usage.
This document provides an overview of Hadoop and MapReduce terminology and concepts. It describes the key components of Hadoop including HDFS, Zookeeper, and HBase. It explains the MapReduce programming model and how data is processed through mappers, reducers, and the shuffle and sort process. Finally, it provides a word count example program and describes how to run Hadoop jobs on a cluster.
This presentation will give you Information about :
1. Map/Reduce Overview and Architecture Installation
2. Developing Map/Red Jobs Input and Output Formats
3. Job Configuration Job Submission
4. Practicing Map Reduce Programs (atleast 10 Map Reduce
5. Algorithms )Data Flow Sources and Destinations
6. Data Flow Transformations Data Flow Paths
7. Custom Data Types
8. Input Formats
9. Output Formats
10. Partitioning Data
11. Reporting Custom Metrics
12. Distributing Auxiliary Job Data
Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It provides reliable storage through HDFS and distributed processing via MapReduce. HDFS handles storage and MapReduce provides a programming model for parallel processing of large datasets across a cluster. The MapReduce framework consists of a mapper that processes input key-value pairs in parallel, and a reducer that aggregates the output of the mappers by key.
Learning Objectives - In this module, you will learn Advance MapReduce concepts such as Counters, Custom Writables, Compression, Tuning, Error Handling, and how to deal with complex MapReduce programs.
A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system.
This document provides an overview of MapReduce concepts including:
1. It describes the anatomy of MapReduce including the map and reduce phases, intermediate data, and final outputs.
2. It explains key MapReduce terminology like jobs, tasks, task attempts, and the roles of the master and slave nodes.
3. It discusses MapReduce data types, input formats, record readers, partitioning, sorting, and output formats.
This document discusses various concepts related to Hadoop MapReduce including combiners, speculative execution, custom counters, input formats, multiple inputs/outputs, distributed cache, and joins. It explains that a combiner acts as a mini-reducer between the map and reduce stages to reduce data shuffling. Speculative execution allows redundant tasks to improve performance. Custom counters can track specific metrics. Input formats handle input splitting and reading. Multiple inputs allow different mappers for different files. Distributed cache shares read-only files across nodes. Joins can correlate large datasets on a common key.
Vibrant Technologies is headquarted in Mumbai,India.We are the best Hadoop training provider in Navi Mumbai who provides Live Projects to students.We provide Corporate Training also.We are Best Hadoop classes in Mumbai according to our students and corporates
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...IndicThreads
This document provides an overview of MapReduce and Hadoop. It describes the Map and Reduce functions, explaining that Map applies a function to each element of a list and Reduce reduces a list to a single value. It gives examples of Map and Reduce using employee salary data. It then discusses Hadoop and its core components HDFS for distributed storage and MapReduce for distributed processing. Key aspects covered include the NameNode, DataNodes, input/output formats, and the job launch process. It also addresses some common questions around small files, large files, and accessing SQL data from Hadoop.
Introduction to the Map-Reduce framework.pdfBikalAdhikari4
The document provides an introduction to the MapReduce programming model and framework. It describes how MapReduce is designed for processing large volumes of data in parallel by dividing work into independent tasks. Programs are written using functional programming idioms like map and reduce operations on lists. The key aspects are:
- Mappers process input records in parallel, emitting (key, value) pairs.
- A shuffle/sort phase groups values by key to same reducer.
- Reducers process grouped values to produce final output, aggregating as needed.
- This allows massive datasets to be processed across a cluster in a fault-tolerant way.
Hadoop Papyrus is an open source project that allows Hadoop jobs to be run using a Ruby DSL instead of Java. It reduces complex Hadoop procedures to just a few lines of Ruby code. The DSL describes the Map, Reduce, and Job details. Hadoop Papyrus invokes Ruby scripts using JRuby during the Map/Reduce processes running on the Java-based Hadoop framework. It also allows writing a single DSL script to define different processing for each phase like Map, Reduce, or job initialization.
This document discusses MapReduce and how it can be used to parallelize a word counting task over large datasets. It explains that MapReduce programs have two phases - mapping and reducing. The mapping phase takes input data and feeds each element to mappers, while the reducing phase aggregates the outputs from mappers. It also describes how Hadoop implements MapReduce by splitting files into splits, assigning splits to mappers across nodes, and using reducers to aggregate the outputs.
This is an quick introduction to Scalding and Monoids. Scalding is a Scala library that makes writing MapReduce jobs very easy. Monoids on the other hand promise parallelism and quality and they make some more challenging algorithms look very easy.
The talk was held at the Helsinki Data Science meetup on January 9th 2014.
MapReduce is a programming model used for processing large datasets in a distributed manner. It involves two key steps - the map step and the reduce step. The map step processes individual records in parallel to generate intermediate key-value pairs. The reduce step merges all intermediate values associated with the same key. Hadoop is an open-source implementation of MapReduce that runs jobs across large clusters of commodity hardware.
The document discusses key concepts related to Hadoop including its components like HDFS, MapReduce, Pig, Hive, and HBase. It provides explanations of HDFS architecture and functions, how MapReduce works through map and reduce phases, and how higher-level tools like Pig and Hive allow for more simplified programming compared to raw MapReduce. The summary also mentions that HBase is a NoSQL database that provides fast random access to large datasets on Hadoop, while HCatalog provides a relational abstraction layer for HDFS data.
This document provides an overview of key concepts in Hadoop including:
- Hadoop was tested on a 4000 node cluster with 32,000 cores and 16 petabytes of storage.
- MapReduce is Hadoop's programming model and consists of mappers that process input splits in parallel, and reducers that combine the outputs of the mappers.
- The JobTracker manages jobs, TaskTrackers run tasks on slave nodes, and Tasks are individual mappers or reducers. Data is distributed to nodes implicitly based on the HDFS file distribution. Configurations are set using a JobConf object.
This document summarizes machine learning concepts in Spark. It introduces Spark, its components including SparkContext, Resilient Distributed Datasets (RDDs), and common transformations and actions. Transformations like map, filter, join, and groupByKey are covered. Actions like collect, count, reduce are also discussed. A word count example in Spark using transformations and actions is provided to illustrate how to analyze text data in Spark.
This document provides an overview of MapReduce and Hadoop. It defines MapReduce terms like map and reduce. It describes how MapReduce processes data in parallel across multiple servers and tasks. Example MapReduce applications like word count are shown. The document also discusses the Hadoop implementation of MapReduce including code examples for map and reduce functions and the driver program. Fault tolerance, locality, and slow tasks in MapReduce systems are also covered.
MAPREDUCE ppt big data computing fall 2014 indranil gupta.pptzuhaibmohammed465
**MapReduce: A Comprehensive Overview**
MapReduce is a programming model and associated implementation for processing and generating large datasets with a parallel, distributed algorithm on a cluster. It was pioneered by Google and has become a cornerstone of big data processing due to its scalability and fault tolerance. This 1000-word description aims to delve into the fundamental principles, architecture, workflow, and applications of MapReduce.
### Introduction to MapReduce
MapReduce is designed to handle vast amounts of data across thousands of commodity servers in a distributed computing environment. It abstracts the complexity of distributed computing, making it easier to write programs that process large-scale data sets efficiently. The model divides the computation into two main phases: the **Map** phase and the **Reduce** phase.
### The Map Phase
In the Map phase, the input data is divided into smaller chunks and processed independently by map tasks. Each map task operates on a subset of the input data and generates intermediate key-value pairs. The key-value pairs are processed by user-defined functions called **mappers**. Mappers are designed to extract and transform the data into a format suitable for further processing in the Reduce phase.
### The Reduce Phase
Following the Map phase, the Reduce phase aggregates the intermediate key-value pairs produced by the mappers. The framework groups together all intermediate values associated with the same intermediate key and passes them to user-defined functions called **reducers**. Reducers process the grouped data and produce the final output, typically aggregating results or performing some form of computation across the dataset.
### Key Components of MapReduce
1. **Master Node (JobTracker)**: Manages the assignment of map and reduce tasks to worker nodes, monitors their progress, and handles job scheduling and coordination.
2. **Worker Nodes (TaskTrackers)**: Execute map and reduce tasks assigned by the JobTracker. They report task status and data locality information back to the JobTracker.
3. **Input Data**: Divided into manageable splits, each processed independently by map tasks.
4. **Intermediate Data**: Key-value pairs generated by map tasks and shuffled across the network to the appropriate reducers.
5. **Output Data**: Final result produced by reducers, typically stored in a distributed file system like Hadoop Distributed File System (HDFS).
### Workflow of MapReduce
1. **Splitting**: Input data is divided into splits, each processed by a map task.
2. **Mapping**: Mappers process each split independently, generating intermediate key-value pairs.
3. **Shuffling**: Intermediate data is shuffled across the network, grouped by keys, and sent to reducers.
4. **Reducing**: Reducers aggregate the shuffled data, computing the final output.
5. **Output**: Final results are written to the distributed file system.
### Advantages of MapReduce
- **Scalability**: Scale
This document provides an overview of the MapReduce framework in Hadoop. It begins by explaining how a MapReduce job works, with map tasks (M1, M2, M3) run in parallel during the map phase, followed by reduce tasks (R1, R2) during the reduce phase. Key-value pairs generated by the maps are shuffled and sorted before being input to the reducers. The document then discusses important classes and interfaces in the mapreduce package like InputFormat, OutputFormat, Mapper, Reducer and Job that control MapReduce jobs. It provides examples of configuration parameters and their default values.
This document provides a technical introduction to Hadoop, including:
- Hadoop has been tested on a 4000 node cluster with 32,000 cores and 16 petabytes of storage.
- Key Hadoop concepts are explained, including jobs, tasks, task attempts, mappers, reducers, and the JobTracker and TaskTrackers.
- The process of launching a MapReduce job is described, from the client submitting the job to the JobTracker distributing tasks to TaskTrackers and running the user-defined mapper and reducer classes.
This document provides a technical introduction to Hadoop, including:
- Hadoop has been tested on a 4000 node cluster with 32,000 cores and 16 petabytes of storage.
- Key Hadoop concepts are explained, including jobs, tasks, task attempts, mappers, reducers, and the JobTracker and TaskTracker processes.
- The flow of a MapReduce job is described, from the client submitting the job to the JobTracker, TaskTrackers running tasks on data splits using the mapper and reducer classes, and writing outputs.
Hadoop became the most common systm to store big data.
With Hadoop, many supporting systems emerged to complete the aspects that are missing in Hadoop itself.
Together they form a big ecosystem.
This presentation covers some of those systems.
While not capable to cover too many in one presentation, I tried to focus on the most famous/popular ones and on the most interesting ones.
This document discusses various concepts related to Hadoop MapReduce including combiners, speculative execution, custom counters, input formats, multiple inputs/outputs, distributed cache, and joins. It explains that a combiner acts as a mini-reducer between the map and reduce stages to reduce data shuffling. Speculative execution allows redundant tasks to improve performance. Custom counters can track specific metrics. Input formats handle input splitting and reading. Multiple inputs allow different mappers for different files. Distributed cache shares read-only files across nodes. Joins can correlate large datasets on a common key.
Vibrant Technologies is headquarted in Mumbai,India.We are the best Hadoop training provider in Navi Mumbai who provides Live Projects to students.We provide Corporate Training also.We are Best Hadoop classes in Mumbai according to our students and corporates
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...IndicThreads
This document provides an overview of MapReduce and Hadoop. It describes the Map and Reduce functions, explaining that Map applies a function to each element of a list and Reduce reduces a list to a single value. It gives examples of Map and Reduce using employee salary data. It then discusses Hadoop and its core components HDFS for distributed storage and MapReduce for distributed processing. Key aspects covered include the NameNode, DataNodes, input/output formats, and the job launch process. It also addresses some common questions around small files, large files, and accessing SQL data from Hadoop.
Introduction to the Map-Reduce framework.pdfBikalAdhikari4
The document provides an introduction to the MapReduce programming model and framework. It describes how MapReduce is designed for processing large volumes of data in parallel by dividing work into independent tasks. Programs are written using functional programming idioms like map and reduce operations on lists. The key aspects are:
- Mappers process input records in parallel, emitting (key, value) pairs.
- A shuffle/sort phase groups values by key to same reducer.
- Reducers process grouped values to produce final output, aggregating as needed.
- This allows massive datasets to be processed across a cluster in a fault-tolerant way.
Hadoop Papyrus is an open source project that allows Hadoop jobs to be run using a Ruby DSL instead of Java. It reduces complex Hadoop procedures to just a few lines of Ruby code. The DSL describes the Map, Reduce, and Job details. Hadoop Papyrus invokes Ruby scripts using JRuby during the Map/Reduce processes running on the Java-based Hadoop framework. It also allows writing a single DSL script to define different processing for each phase like Map, Reduce, or job initialization.
This document discusses MapReduce and how it can be used to parallelize a word counting task over large datasets. It explains that MapReduce programs have two phases - mapping and reducing. The mapping phase takes input data and feeds each element to mappers, while the reducing phase aggregates the outputs from mappers. It also describes how Hadoop implements MapReduce by splitting files into splits, assigning splits to mappers across nodes, and using reducers to aggregate the outputs.
This is an quick introduction to Scalding and Monoids. Scalding is a Scala library that makes writing MapReduce jobs very easy. Monoids on the other hand promise parallelism and quality and they make some more challenging algorithms look very easy.
The talk was held at the Helsinki Data Science meetup on January 9th 2014.
MapReduce is a programming model used for processing large datasets in a distributed manner. It involves two key steps - the map step and the reduce step. The map step processes individual records in parallel to generate intermediate key-value pairs. The reduce step merges all intermediate values associated with the same key. Hadoop is an open-source implementation of MapReduce that runs jobs across large clusters of commodity hardware.
The document discusses key concepts related to Hadoop including its components like HDFS, MapReduce, Pig, Hive, and HBase. It provides explanations of HDFS architecture and functions, how MapReduce works through map and reduce phases, and how higher-level tools like Pig and Hive allow for more simplified programming compared to raw MapReduce. The summary also mentions that HBase is a NoSQL database that provides fast random access to large datasets on Hadoop, while HCatalog provides a relational abstraction layer for HDFS data.
This document provides an overview of key concepts in Hadoop including:
- Hadoop was tested on a 4000 node cluster with 32,000 cores and 16 petabytes of storage.
- MapReduce is Hadoop's programming model and consists of mappers that process input splits in parallel, and reducers that combine the outputs of the mappers.
- The JobTracker manages jobs, TaskTrackers run tasks on slave nodes, and Tasks are individual mappers or reducers. Data is distributed to nodes implicitly based on the HDFS file distribution. Configurations are set using a JobConf object.
This document summarizes machine learning concepts in Spark. It introduces Spark, its components including SparkContext, Resilient Distributed Datasets (RDDs), and common transformations and actions. Transformations like map, filter, join, and groupByKey are covered. Actions like collect, count, reduce are also discussed. A word count example in Spark using transformations and actions is provided to illustrate how to analyze text data in Spark.
This document provides an overview of MapReduce and Hadoop. It defines MapReduce terms like map and reduce. It describes how MapReduce processes data in parallel across multiple servers and tasks. Example MapReduce applications like word count are shown. The document also discusses the Hadoop implementation of MapReduce including code examples for map and reduce functions and the driver program. Fault tolerance, locality, and slow tasks in MapReduce systems are also covered.
MAPREDUCE ppt big data computing fall 2014 indranil gupta.pptzuhaibmohammed465
**MapReduce: A Comprehensive Overview**
MapReduce is a programming model and associated implementation for processing and generating large datasets with a parallel, distributed algorithm on a cluster. It was pioneered by Google and has become a cornerstone of big data processing due to its scalability and fault tolerance. This 1000-word description aims to delve into the fundamental principles, architecture, workflow, and applications of MapReduce.
### Introduction to MapReduce
MapReduce is designed to handle vast amounts of data across thousands of commodity servers in a distributed computing environment. It abstracts the complexity of distributed computing, making it easier to write programs that process large-scale data sets efficiently. The model divides the computation into two main phases: the **Map** phase and the **Reduce** phase.
### The Map Phase
In the Map phase, the input data is divided into smaller chunks and processed independently by map tasks. Each map task operates on a subset of the input data and generates intermediate key-value pairs. The key-value pairs are processed by user-defined functions called **mappers**. Mappers are designed to extract and transform the data into a format suitable for further processing in the Reduce phase.
### The Reduce Phase
Following the Map phase, the Reduce phase aggregates the intermediate key-value pairs produced by the mappers. The framework groups together all intermediate values associated with the same intermediate key and passes them to user-defined functions called **reducers**. Reducers process the grouped data and produce the final output, typically aggregating results or performing some form of computation across the dataset.
### Key Components of MapReduce
1. **Master Node (JobTracker)**: Manages the assignment of map and reduce tasks to worker nodes, monitors their progress, and handles job scheduling and coordination.
2. **Worker Nodes (TaskTrackers)**: Execute map and reduce tasks assigned by the JobTracker. They report task status and data locality information back to the JobTracker.
3. **Input Data**: Divided into manageable splits, each processed independently by map tasks.
4. **Intermediate Data**: Key-value pairs generated by map tasks and shuffled across the network to the appropriate reducers.
5. **Output Data**: Final result produced by reducers, typically stored in a distributed file system like Hadoop Distributed File System (HDFS).
### Workflow of MapReduce
1. **Splitting**: Input data is divided into splits, each processed by a map task.
2. **Mapping**: Mappers process each split independently, generating intermediate key-value pairs.
3. **Shuffling**: Intermediate data is shuffled across the network, grouped by keys, and sent to reducers.
4. **Reducing**: Reducers aggregate the shuffled data, computing the final output.
5. **Output**: Final results are written to the distributed file system.
### Advantages of MapReduce
- **Scalability**: Scale
This document provides an overview of the MapReduce framework in Hadoop. It begins by explaining how a MapReduce job works, with map tasks (M1, M2, M3) run in parallel during the map phase, followed by reduce tasks (R1, R2) during the reduce phase. Key-value pairs generated by the maps are shuffled and sorted before being input to the reducers. The document then discusses important classes and interfaces in the mapreduce package like InputFormat, OutputFormat, Mapper, Reducer and Job that control MapReduce jobs. It provides examples of configuration parameters and their default values.
This document provides a technical introduction to Hadoop, including:
- Hadoop has been tested on a 4000 node cluster with 32,000 cores and 16 petabytes of storage.
- Key Hadoop concepts are explained, including jobs, tasks, task attempts, mappers, reducers, and the JobTracker and TaskTrackers.
- The process of launching a MapReduce job is described, from the client submitting the job to the JobTracker distributing tasks to TaskTrackers and running the user-defined mapper and reducer classes.
This document provides a technical introduction to Hadoop, including:
- Hadoop has been tested on a 4000 node cluster with 32,000 cores and 16 petabytes of storage.
- Key Hadoop concepts are explained, including jobs, tasks, task attempts, mappers, reducers, and the JobTracker and TaskTracker processes.
- The flow of a MapReduce job is described, from the client submitting the job to the JobTracker, TaskTrackers running tasks on data splits using the mapper and reducer classes, and writing outputs.
Hadoop became the most common systm to store big data.
With Hadoop, many supporting systems emerged to complete the aspects that are missing in Hadoop itself.
Together they form a big ecosystem.
This presentation covers some of those systems.
While not capable to cover too many in one presentation, I tried to focus on the most famous/popular ones and on the most interesting ones.
Clock Definitions Static Timing Analysis for VLSI EngineersJason J Pulikkottil
Rising and falling edge of the clock
For a +ve edge triggered design +ve (or rising) edge is called ‘leading edge’ whereas –ve (or falling) edge is called ‘trailing edge’.
For a -ve edge triggered design –ve (or falling) edge is called ‘leading edge’ whereas +ve (or rising) edge is called ‘trailing edge’.
basic clock
Minimum pulse width of the clock can be checked in PrimeTime by using commands given below:
set_min_pulse_width -high 2.5 [all_clocks]
set_min_pulse_width -low 2.0 [all_clocks]
These checks are generally carried out for post layout timing analysis. Once these commands are set, PrimeTime checks for high and low pulse widths and reports any violations.
Capture Clock Edge
The edge of the clock for which data is detected is known as capture edge.
Launch Clock Edge
This is the edge of the clock.
•Find approximate locations of a set of modules that need to be placed on a layout surface.
•Floorplan is one the critical & important steps in Physical design.
•Quality of the Chip / Design implementation depends on how good is the Floorplan.
•A good floorplan can make implementation process (place, cts, route) easy.
•On the other side a bad floorplan can create all kind of issues in the design (congestion, timing, noise, routing issues).
• Standard cells are designed based on power, area and performance.
• First step is cell architecture. Cell architecture is all about deciding cell height based on pitch &
library requirements. We have to first decide the track, pitch, β ratio, possible PMOS width and
NMOS width.
• Track : Track is generally used as a unit to define the height of the std cell.Track can be related
to lanes e.g. like we say 4 lane road, implies 4 vehicles can run in parallel. Similarly, 9 track
library implies 9 routing tracks are available for routing 9 wires in parallel with minimum pitch.
• Pitch : The distance between two tracks is called as pitch.
• Via : Vias are used to connect two different metal layers as shown in Fig. 1(a). In Fig.1(b), we
are connecting M1 and M2 using a Via. We don’t make tracks with minimum spacing as we will
get DRC error if there is any via overhang.
Physical design is process of transforming netlist into layout
which is manufacture-able [GDS]. Physical design process is
often referred as PnR (Place and Route) / APR (Automatic Place
& Route). Main steps in physical design are placement of all
logical cells, clock tree synthesis & routing. During this process
of physical design timing, power, design & technology
constraints have to be met. Further design might require being
optimized w.r.t area, power and performance.
Goals of Floor Plan:
1. Partition the design into functional blocks
2. Arrange the blocks on a chip
3. Place the Macros
4. Decide the location of the I/O pads
5. Decide the location and number of the power
pads
6. Decide the type of power distribution
This comprehensive Data Science course is designed to equip learners with the essential skills and knowledge required to analyze, interpret, and visualize complex data. Covering both theoretical concepts and practical applications, the course introduces tools and techniques used in the data science field, such as Python programming, data wrangling, statistical analysis, machine learning, and data visualization.
This project demonstrates the application of machine learning—specifically K-Means Clustering—to segment customers based on behavioral and demographic data. The objective is to identify distinct customer groups to enable targeted marketing strategies and personalized customer engagement.
The presentation walks through:
Data preprocessing and exploratory data analysis (EDA)
Feature scaling and dimensionality reduction
K-Means clustering and silhouette analysis
Insights and business recommendations from each customer segment
This work showcases practical data science skills applied to a real-world business problem, using Python and visualization tools to generate actionable insights for decision-makers.
Just-in-time: Repetitive production system in which processing and movement of materials and goods occur just as they are needed, usually in small batches
JIT is characteristic of lean production systems
JIT operates with very little “fat”
3. Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT>
• Maps input key/value pairs to a set of intermediate key/value pairs.
• Maps are the individual tasks which transform input records into a intermediate
records. The transformed intermediate records need not be of the same type as the
input records. A given input pair may map to zero or many output pairs.
• The Hadoop Map-Reduce framework spawns one map task for each InputSplit
generated by the InputFormat for the job.
• The framework first calls setup(org.apache.hadoop.mapreduce.Mapper.Context),
followed by map(Object, Object, Context) for each key/value pair in the InputSplit.
Finally cleanup(Context) is called.
http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapreduce/Mapper.ht
ml
4. public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
5. What is Writable?
• Hadoop defines its own “box” classes for
strings (Text), integers (IntWritable), etc.
• All values are instances of Writable
• All keys are instances of WritableComparable
6. Writable
• A serializable object which implements a simple,
efficient, serialization protocol, based on DataInput
and DataOutput.
• Any key or value type in the Hadoop Map-Reduce
framework implements this interface.
• Implementations typically implement a static
read(DataInput) method which constructs a new
instance, calls readFields(DataInput) and returns the
instance.
• http://hadoop.apache.org/docs/r2.2.0/api/or
g/apache/hadoop/io/Writable.html
8. public class MyWritable implements Writable {
// Some data
private int counter;
private long timestamp;
public void write(DataOutput out) throws IOException {
out.writeInt(counter);
out.writeLong(timestamp);
}
public void readFields(DataInput in) throws IOException {
counter = in.readInt();
timestamp = in.readLong();
}
public static MyWritable read(DataInput in) throws IOException {
MyWritable w = new MyWritable();
w.readFields(in);
return w;
}
}
9. public class MyWritableComparable implements WritableComparable {
// Some data
private int counter;
private long timestamp;
public void write(DataOutput out) throws IOException {
out.writeInt(counter);
out.writeLong(timestamp);
}
public void readFields(DataInput in) throws IOException {
counter = in.readInt();
timestamp = in.readLong();
}
public int compareTo(MyWritableComparable w) {
int thisValue = this.value;
int thatValue = ((IntWritable)o).value;
return (thisValue < thatValue ? -1 : (thisValue==thatValue ? 0 : 1));
}
}
10. Getting Data To The Mapper
Input file
InputSplit InputSplit InputSplit InputSplit
Input file
RecordReader RecordReader RecordReader RecordReader
Mapper
(intermediates)
Mapper
(intermediates)
Mapper
(intermediates)
Mapper
(intermediates)
InputFormat
11. public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: wordcount <in> <out>");
System.exit(2);
}
Job job = new Job(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
12. Reading Data
• Data sets are specified by InputFormats
– Defines input data (e.g., a directory)
– Identifies partitions of the data that form an
InputSplit
– Factory for RecordReader objects to extract (k, v)
records from the input source
13. Input Format
• InputFormat describes the input-specification for a Map-
Reduce job
• The Map-Reduce framework relies on the InputFormat of the
job to:
– Validate the input-specification of the job.
– Split-up the input file(s) into logical InputSplits, each of which is then
assigned to an individual Mapper.
– Provide the RecordReader implementation to be used to glean input
records from the logical InputSplit for processing by the Mapper.
http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapreduce/Inp
utFormat.html
14. FileInputFormat and Friends
• TextInputFormat
– Treats each ‘n’-terminated line of a file as a value
• KeyValueTextInputFormat
– Maps ‘n’- terminated text lines of “k SEP v”
• SequenceFileInputFormat
– Binary file of (k, v) pairs (passing data between the output
of one MapReduce job to the input of some other
MapReduce job)
• SequenceFileAsTextInputFormat
– Same, but maps (k.toString(), v.toString())
15. Filtering File Inputs
• FileInputFormat will read all files out of a
specified directory and send them to the
mapper
• Delegates filtering this file list to a method
subclasses may override
– e.g., Create your own “xyzFileInputFormat” to
read *.xyz from directory list
16. Record Readers
• Each InputFormat provides its own
RecordReader implementation
– Provides (unused?) capability multiplexing
• LineRecordReader
– Reads a line from a text file
• KeyValueRecordReader
– Used by KeyValueTextInputFormat
17. Input Split Size
• FileInputFormat will divide large files into
chunks
– Exact size controlled by mapred.min.split.size
• RecordReaders receive file, offset, and length
of chunk
• Custom InputFormat implementations may
override split size
– e.g., “NeverChunkFile”
19. public class ObjectPositionInputFormat extends
FileInputFormat<Text, Point3D> {
public RecordReader<Text, Point3D> getRecordReader(
InputSplit input, JobConf job, Reporter reporter)
throws IOException {
reporter.setStatus(input.toString());
return new ObjPosRecordReader(job, (FileSplit)input);
}
InputSplit[] getSplits(JobConf job, int numSplits) throuw IOException;
}
20. class ObjPosRecordReader implements RecordReader<Text, Point3D> {
public ObjPosRecordReader(JobConf job, FileSplit split) throws IOException
{}
public boolean next(Text key, Point3D value) throws IOException {
// get the next line}
public Text createKey() {
}
public Point3D createValue() {
}
public long getPos() throws IOException {
}
public void close() throws IOException {
}
public float getProgress() throws IOException {}
}
21. Sending Data To Reducers
• Map function produces Map.Context object
– Map.context() takes (k, v) elements
• Any (WritableComparable, Writable) can be
used
24. Partitioner
• int getPartition(key, val, numPartitions)
– Outputs the partition number for a given key
– One partition == values sent to one Reduce task
• HashPartitioner used by default
– Uses key.hashCode() to return partition num
• Job sets Partitioner implementation
25. public class MyPartitioner implements Partitioner<IntWritable,Text> {
@Override
public int getPartition(IntWritable key, Text value, int numPartitions) {
/* Pretty ugly hard coded partitioning function. Don't do that in practice,
it is just for the sake of understanding. */
int nbOccurences = key.get();
if( nbOccurences < 3 )
return 0;
else
return 1;
}
@Override
public void configure(JobConf arg0) {
}
}
job.setPartitionerClass(MyPartitioner.class);
26. Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT>
• Reduces a set of intermediate values which
share a key to a smaller set of values.
• Reducer has 3 primary phases:
– Shuffle
– Sort
– Reduce
• http://hadoop.apache.org/docs/r2.2.0/api/or
g/apache/hadoop/mapreduce/Reducer.html
27. public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
29. OutputFormat
• Analogous to InputFormat
• TextOutputFormat
– Writes “key valn” strings to output file
• SequenceFileOutputFormat
– Uses a binary format to pack (k, v) pairs
• NullOutputFormat
– Discards output
31. public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: wordcount <in> <out>");
System.exit(2);
}
Job job = new Job(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
32. Job
• The job submitter's view of the Job.
• It allows the user to configure the job, submit it,
control its execution, and query the state. The set
methods only work until the job is submitted,
afterwards they will throw an IllegalStateException.
• Normally the user creates the application, describes
various facets of the job via Job and then submits the
job and monitor its progress.
http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapreduce/Job.html