0% found this document useful (0 votes)

211 views

Unit-Iii: A Weather Dataset

The document describes analyzing weather data using MapReduce. It discusses how MapReduce works by breaking processing into map and reduce phases. The map phase extracts year and temperature from raw data files, outputting key-value pairs of year and temperature. The reduce phase takes these pairs grouped by year and finds the maximum temperature for each year. Java code examples show implementing the map function to parse data, and reduce function to find the maximum values. The MapReduce job allows running this large-scale data analysis in parallel across multiple machines.

Uploaded by

vinodkhar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

211 views

Unit-Iii: A Weather Dataset

Uploaded by

vinodkhar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

UNIT-III

MapReduce is a programming model for data processing. The model is simple, yet
not too simple to express useful programs in. Hadoop can run MapReduce programs written
in various languages; Most important, MapReduce programs are inherently parallel, thus
putting very large-scale data analysis into the hands of anyone with enough machines at their
disposal. MapReduce comes into its own for large datasets, so let’s start by looking at one.

A Weather Dataset
For our example, we will write a program that mines weather data. Weather sensors
collecting data every hour at many locations across the globe gather a large volume of log
data, which is a good candidate for analysis with MapReduce, since it is semistructured and
record-oriented.
Data Format
The data we will use is from the National Climatic Data Center (NCDC, http://www
.ncdc.noaa.gov/). The data is stored using a line-oriented ASCII format, in which each line is
a record. The format supports a rich set of meteorological elements, many of which are
optional or with variable data lengths. For simplicity, we shall focus on the basic elements,
such as temperature, which are always present and are of fixed width.
Example shows a sample line with some of the salient fields highlighted. The line has been
split into multiple lines to show each field: in the real file, fields are packed into one line with
no delimiters.
Data files are organized by date and weather station. There is a directory for each year
from 1901 to 2001, each containing a gzipped file for each weather station with its readings
for that year. For example, here are the first entries for 1990:
% ls raw/1990 | head
010010-99999-1990.gz
010014-99999-1990.gz
010015-99999-1990.gz
010016-99999-1990.gz
010017-99999-1990.gz
010030-99999-1990.gz
Since there are tens of thousands of weather stations, the whole dataset is made up of
a large number of relatively small files. It’s generally easier and more efficient to process a
smaller number of relatively large files, so the data was preprocessed so that each year’s
readings were concatenated into a single file.
Analyzing the Data with Hadoop
To take advantage of the parallel processing that Hadoop provides, we need to express
our query as a MapReduce job. After some local, small-scale testing, we will be able to run it
on a cluster of machines.
Map and Reduce
MapReduce works by breaking the processing into two phases: the map phase and
the reduce phase. Each phase has key-value pairs as input and output, the types of which

IT Dept Page 1

WWW.KVRSOFTWARES.BLOGSPOT.COM
may be chosen by the programmer. The programmer also specifies two functions: the map
function and the reduce function.
The input to our map phase is the raw NCDC data. We choose a text input format that
gives us each line in the dataset as a text value. The key is the offset of the beginning of the
line from the beginning of the file, but as we have no need for this, we ignore it.

Our map function is simple. We pull out the year and the air temperature, since these
are the only fields we are interested in. In this case, the map function is just a data preparation
phase, setting up the data in such a way that the reducer function can do its work on it:
finding the maximum temperature for each year. The map function is also a good place to
drop bad records: here we filter out temperatures that are missing, suspect, or erroneous.
To visualize the way the map works, consider the following sample lines of input data
(some unused columns have been dropped to fit the page, indicated by ellipses):
0067011990999991950051507004...9999999N9+00001+99999999999...
0043011990999991950051512004...9999999N9+00221+99999999999...
0043011990999991950051518004...9999999N9-00111+99999999999...
0043012650999991949032412004...0500001N9+01111+99999999999...
0043012650999991949032418004...0500001N9+00781+99999999999...
These lines are presented to the map function as the key-value pairs:
(0, 0067011990999991950051507004...9999999N9+00001+99999999999...)
(106, 0043011990999991950051512004...9999999N9+00221+99999999999...)
(212, 0043011990999991950051518004...9999999N9-00111+99999999999...)
(318, 0043012650999991949032412004...0500001N9+01111+99999999999...)
(424, 0043012650999991949032418004...0500001N9+00781+99999999999...)

The keys are the line offsets within the file, which we ignore in our map function. The map
function merely extracts the year and the air temperature (indicated in bold text), and emits
them as its output (the temperature values have been interpreted as integers):
(1950, 0)
(1950, 22)
(1950, −11)
(1949, 111)
(1949, 78)
The output from the map function is processed by the MapReduce framework before being
sent to the reduce function. This processing sorts and groups the key-value pairs by key.
So, continuing the example, our reduce function sees the following input:
(1949, [111, 78])
(1950, [0, 22, −11])
Each year appears with a list of all its air temperature readings. All the reduce function has to
do now is iterate through the list and pick up the maximum reading:
(1949, 111)
(1950, 22)
This is the final output: the maximum global temperature recorded in each year.

IT Dept Page 2

WWW.KVRSOFTWARES.BLOGSPOT.COM
The whole data flow is illustrated in the bellow Figure. At the bottom of the diagram
is a Unix pipeline

Figure 2-1. MapReduce logical data flow

Java MapReduce
Having run through how the MapReduce program works, the next step is to express it
in code. We need three things: a map function, a reduce function, and some code to run
the job. The map function is represented by an implementation of the Mapper interface,
which declares a map() method. Example -1 shows the implementation of our map function.

Example - 1. Mapper for maximum temperature example

import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
public class MaxTemperatureMapper extends MapReduceBase implements
Mapper <LongWritable, Text, Text, IntWritable>
{
private static final int MISSING = 9999;
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException
{
String line = value.toString();
String year = line.substring(15, 19);
int airTemperature;
if (line.charAt(87) == '+') { // parseInt doesn't like leading plus signs
airTemperature = Integer.parseInt(line.substring(88, 92));
} else {
airTemperature = Integer.parseInt(line.substring(87, 92));
}
String quality = line.substring(92, 93);
if (airTemperature != MISSING && quality.matches("[01459]")) {
output.collect(new Text(year), new IntWritable(airTemperature));
}
}
}

IT Dept Page 3

WWW.KVRSOFTWARES.BLOGSPOT.COM
The Mapper interface is a generic type, with four formal type parameters that specify
the input key, input value, output key, and output value types of the map function. For the
present example, the input key is a long integer offset, the input value is a line of text, the
output key is a year, and the output value is an air temperature (an integer). Rather than use
built-in Java types, Hadoop provides its own set of basic types that are optimized for network
serialization. These are found in the org.apache.hadoop.io package.
Here we use LongWritable, which corresponds to a Java Long, Text (like Java
String), and IntWritable (like Java Integer).

The map() method is passed a key and a value. We convert the Text value containing the line
of input into a Java String, then use its substring() method to extract the columns we are
interested in.
The map() method also provides an instance of OutputCollector to write the output to. In this
case, we write the year as a Text object (since we are just using it as a key), and the
temperature is wrapped in an IntWritable. We write an output record only if the temperature
is present and the quality code indicates the temperature reading is OK.

The reduce function is similarly defined using a Reducer, as illustrated in Example - 2.

Example - 2. Reducer for maximum temperature example

import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
public class MaxTemperatureReducer extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable>
{
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException
{
int maxValue = Integer.MIN_VALUE;
while (values.hasNext()) {
maxValue = Math.max(maxValue, values.next().get());
}
output.collect(key, new IntWritable(maxValue));
}
}

Again, four formal type parameters are used to specify the input and output types, this
time for the reduce function. The input types of the reduce function must match the output
types of the map function: Text and IntWritable. And in this case, the output types of the
reduce function are Text and IntWritable, for a year and its maximum temperature, which we
find by iterating through the temperatures and comparing each with a record of the highest
found so far.

IT Dept Page 4

WWW.KVRSOFTWARES.BLOGSPOT.COM
The third piece of code runs the MapReduce job (see Example -3).
Example -3. Application to find the maximum temperature in the weather dataset
import java.io.IOException;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
public class MaxTemperature
{
public static void main(String[] args) throws IOException
{
if (args.length != 2)
{
System.err.println("Usage: MaxTemperature <input path> <output path>");
System.exit(-1);
}
JobConf conf = new JobConf(MaxTemperature.class);
conf.setJobName("Max temperature");
FileInputFormat.addInputPath(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
conf.setMapperClass(MaxTemperatureMapper.class);
conf.setReducerClass(MaxTemperatureReducer.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
JobClient.runJob(conf);
}
}

A JobConf object forms the specification of the job. It gives you control over how the
job is run. When we run this job on a Hadoop cluster, we will package the code into a JAR
file (which Hadoop will distribute around the cluster). Rather than explicitly specify the name
of the JAR file, we can pass a class in the JobConf constructor, which Hadoop will use to
locate the relevant JAR file by looking for the JAR file containing this class.
Having constructed a JobConf object, we specify the input and output paths. An input
path is specified by calling the static addInputPath() method on FileInputFormat, and it
can be a single file, a directory (in which case, the input forms all the files in that directory),
or a file pattern. As the name suggests, addInputPath() can be called more
than once to use input from multiple paths.

The output path (of which there is only one) is specified by the static setOutput
Path() method on FileOutputFormat. It specifies a directory where the output files from the
reducer functions are written. The directory shouldn’t exist before running the job, as
Hadoop will complain and not run the job. This precaution is to prevent data loss (it can be
very annoying to accidentally overwrite the output of a long job with another).

IT Dept Page 5

WWW.KVRSOFTWARES.BLOGSPOT.COM
Next, we specify the map and reduce types to use via the setMapperClass() and
setReducerClass() methods. The setOutputKeyClass() and setOutputValueClass()
methods control the output types for the map and the reduce functions, which are often the
same, as they are in our case.
If they are different, then the map output types can be set using the methods
setMapOutputKeyClass() and setMapOutputValueClass(). The input types are controlled via
the input format, which we have not explicitly set since we are using the default
TextInputFormat. After setting the classes that define the map and reduce functions, we are
ready to run the job.
The static runJob() method on JobClient submits the job and waits for it to finish,
writing information about its progress to the console.

The output was written to the output directory, which contains one output file per
reducer. The job had a single reducer, so we find a single file, named part-00000:

% cat output/part-00000
1949 111
1950 22
This result is the same as when we went through it by hand earlier. We interpret this
as saying that the maximum temperature recorded in 1949 was 11.1°C, and in 1950 it was
2.2°C.

The new Java MapReduce API:

Release 0.20.0 of Hadoop included a new Java MapReduce API, sometimes referred
to as “Context Objects,” designed to make the API easier to evolve in the future. The new
API is type-incompatible with the old, however, so applications need to be rewritten to take
advantage of it.
There are several notable differences between the two APIs:
 The new API favours abstract classes over interfaces, since these are easier to evolve.
For example, you can add a method (with a default implementation) to an abstract class
without breaking old implementations of the class. In the new API, the Mapper and
Reducer interfaces are now abstract classes.
 The new API is in the org.apache.hadoop.mapreduce package (and subpackages). The
old API can still be found in org.apache.hadoop.mapred.
 The new API makes extensive use of context objects that allow the user code to
communicate with the MapReduce system. The MapContext, for example, essentially
unifies the role of the JobConf, the OutputCollector, and the Reporter.
 The new API supports both a “push” and a “pull” style of iteration. In both APIs, key-
value record pairs are pushed to the mapper, but in addition, the new API allows a
mapper to pull records from within the map() method. The same goes for the reducer. An
example of how the “pull” style can be useful is processing records in batches, rather
than one by one.
 Configuration has been unified. The old API has a special JobConf object for job
configuration, which is an extension of Hadoop’s vanilla Configuration object. In the
new API, this distinction is dropped, so job configuration is done through a
Configuration.
 Job control is performed through the Job class, rather than JobClient, which no longer
exists in the new API.

IT Dept Page 6

WWW.KVRSOFTWARES.BLOGSPOT.COM
 Output files are named slightly differently: part-m-nnnnn for map outputs, and partr-
nnnnn for reduce outputs (where nnnnn is an integer designating the part number,
starting from zero).

When converting your Mapper and Reducer classes to the new API, don’t forget to
change the signature of the map() and reduce() methods to the new form. Just changing your
class to extend the new Mapper or Reducer classes will not produce a compilation error or
warning, since these classes provide an identity form of the map() or reduce() method
(respectively). Your mapper or reducer code, however, will not be invoked, which can lead to
some hard-to-diagnose errors.

Combiner
A Combiner, also known as a semi-reducer, is an optional class that operates by accepting
the inputs from the Map class and thereafter passing the output key-value pairs to the
Reducer class.

The main function of a Combiner is to summarize the map output records with the same key.
The output (key-value collection) of the combiner will be sent over the network to the actual
Reducer task as input.

Combiner class
The Combiner class is used in between the Map class and the Reduce class to reduce the
volume of data transfer between Map and Reduce. Usually, the output of the map task is
large and the data transferred to the reduce task is high.
The following MapReduce task diagram shows the COMBINER PHASE.

IT Dept Page 7

WWW.KVRSOFTWARES.BLOGSPOT.COM
How Combiner Works?
Here is a brief summary on how MapReduce Combiner works −

 A combiner does not have a predefined interface and it must implement the Reducer
interface’s reduce() method.

 A combiner operates on each map output key. It must have the same output key-value
types as the Reducer class.

 A combiner can produce summary information from a large dataset because it

replaces the original Map output.

Although, Combiner is optional yet it helps segregating data into multiple groups for Reduce
phase, which makes it easier to process.

MapReduce Combiner Implementation

The following example provides a theoretical idea about combiners. Let us assume we have
the following input text file named input.txt for MapReduce.
What do you mean by Object
What do you know about Java
What is Java Virtual Machine
How Java enabled High Performance

The important phases of the MapReduce program with Combiner are discussed below.

Record Reader

This is the first phase of MapReduce where the Record Reader reads every line from the
input text file as text and yields output as key-value pairs.

Input − Line by line text from the input file.

Output − Forms the key-value pairs. The following is the set of expected key-value pairs.
<1, What do you mean by Object>
<2, What do you know about Java>
<3, What is Java Virtual Machine>
<4, How Java enabled High Performance>
Map Phase

The Map phase takes input from the Record Reader, processes it, and produces the output as
another set of key-value pairs.

Input − The following key-value pair is the input taken from the Record Reader.

<1, What do you mean by Object>

<2, What do you know about Java>
<3, What is Java Virtual Machine>
<4, How Java enabled High Performance>

IT Dept Page 8

WWW.KVRSOFTWARES.BLOGSPOT.COM
The Map phase reads each key-value pair, divides each word from the value using
StringTokenizer, and treats each word as key and the count of that word as value. The
following code snippet shows the Mapper class and the map function.

public static class WordCountMapper extends Mapper<Object, Text, Text, IntWritable>

{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context) throws IOException,
InterruptedException
{
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens())
{
word.set(itr.nextToken());
context.write(word, one);
}
}
}
Output − The expected output is as follows −
<What,1> <do,1> <you,1> <mean,1> <by,1> <Object,1>
<What,1> <do,1> <you,1> <know,1> <about,1> <Java,1>
<What,1> <is,1> <Java,1> <Virtual,1> <Machine,1>
<How,1> <Java,1> <enabled,1> <High,1> <Performance,1>

Combiner Phase

The Combiner phase takes each key-value pair from the Map phase, processes it, and
produces the output as key-value collection pairs.
Input − The following key-value pair is the input taken from the Map phase.
<What,1> <do,1> <you,1> <mean,1> <by,1> <Object,1>
<What,1> <do,1> <you,1> <know,1> <about,1> <Java,1>
<What,1> <is,1> <Java,1> <Virtual,1> <Machine,1>
<How,1> <Java,1> <enabled,1> <High,1> <Performance,1>

The Combiner phase reads each key-value pair, combines the common words as key and
values as collection. Usually, the code and operation for a Combiner is similar to that of a
Reducer. Following is the code snippet for Mapper, Combiner and Reducer class
declaration.

job.setMapperClass(WordCountMapper.class);
job.setCombinerClass(WordCountReducer.class);
job.setReducerClass(WordCountReducer.class);

IT Dept Page 9

WWW.KVRSOFTWARES.BLOGSPOT.COM
Output − The expected output is as follows −

<What,1,1,1> <do,1,1> <you,1,1> <mean,1> <by,1> <Object,1>

<know,1> <about,1> <Java,1,1,1>
<is,1> <Virtual,1> <Machine,1>
<How,1> <enabled,1> <High,1> <Performance,1>

Partitioner Phase

The partitioning phase takes place after the map phase and before the reduce phase. The
number of partitions is equal to the number of reducers. The data gets partitioned across
the reducers according to the partitioning function.

The difference between a partitioner and a combiner is that the partitioner divides
the data according to the number of reducers so that all the data in a single partition gets
executed by a single reducer. However, the combiner functions similar to the reducer and
processes the data in each partition. The combiner is an optimization to the reducer.

The default partitioning function is the hash partitioning function where the hashing
is done on the key. However it might be useful to partition the data according to some other
function of the key or the value.

Reducer Phase

The Reducer phase takes each key-value collection pair from the Combiner phase, processes
it, and passes the output as key-value pairs. Note that the Combiner functionality is same as
the Reducer.

Input − The following key-value pair is the input taken from the Combiner phase.

<What,1,1,1> <do,1,1> <you,1,1> <mean,1> <by,1> <Object,1>

<know,1> <about,1> <Java,1,1,1>
<is,1> <Virtual,1> <Machine,1>
<How,1> <enabled,1> <High,1> <Performance,1>

The Reducer phase reads each key-value pair. Following is the code snippet for the
Combiner.

public static class WordCountReducer extends Reducer<Text,IntWritable,Text,IntWritable>

{
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws
IOException, InterruptedException
{
int sum = 0;
for (IntWritable val : values)
{

IT Dept Page 10

WWW.KVRSOFTWARES.BLOGSPOT.COM
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}

Output − The expected output from the Reducer phase is as follows –

<What,3> <do,2> <you,2> <mean,1> <by,1> <Object,1>

<know,1> <about,1> <Java,3>
<is,1> <Virtual,1> <Machine,1>
<How,1> <enabled,1> <High,1> <Performance,1>
Record Writer

This is the last phase of MapReduce where the Record Writer writes every key-value pair
from the Reducer phase and sends the output as text.

Input − Each key-value pair from the Reducer phase along with the Output format.

Output − It gives you the key-value pairs in text format. Following is the expected output.

What 3
do 2
you 2
mean 1
by 1
Object 1
know 1
about 1
Java 3
is 1
Virtual 1
Machine 1
How 1
enabled 1
High 1
Performance 1

IT Dept Page 11

WWW.KVRSOFTWARES.BLOGSPOT.COM
IT Dept Page 12

WWW.KVRSOFTWARES.BLOGSPOT.COM

Important (Python Built-In Methods) (CheatSheet)
No ratings yet
Important (Python Built-In Methods) (CheatSheet)
6 pages
CSCI312 Big Data Management Singapore 2022-2 Assignment 2: Published On 24 April 2022
No ratings yet
CSCI312 Big Data Management Singapore 2022-2 Assignment 2: Published On 24 April 2022
10 pages
Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
You Are Asked To Write A MapReduce Program With Py...
No ratings yet
You Are Asked To Write A MapReduce Program With Py...
5 pages
Hadoop Hive - One
No ratings yet
Hadoop Hive - One
10 pages
Hadoop Python MapReduce Tutorial For Beginners
No ratings yet
Hadoop Python MapReduce Tutorial For Beginners
15 pages
Chi Square Test in Weka
67% (3)
Chi Square Test in Weka
40 pages
BIS 221T Practice Week 2 Editing A Report
No ratings yet
BIS 221T Practice Week 2 Editing A Report
12 pages
Unit-Iii: A Weather Dataset
No ratings yet
Unit-Iii: A Weather Dataset
12 pages
Memory Mapped IO & Dedicated IO
No ratings yet
Memory Mapped IO & Dedicated IO
4 pages
Module-1: Hdfs Basics Running Example Programs and Benchmarks Hadoop Mapreduce Framework Mapreduce Programming
No ratings yet
Module-1: Hdfs Basics Running Example Programs and Benchmarks Hadoop Mapreduce Framework Mapreduce Programming
33 pages
Unit 4: IT Dept
No ratings yet
Unit 4: IT Dept
21 pages
Introduction To Hadoop - Part Two: 1 Hadoop and Comma Separated Values (CSV) Files 1
No ratings yet
Introduction To Hadoop - Part Two: 1 Hadoop and Comma Separated Values (CSV) Files 1
38 pages
Bda Unit-Iii-R20
No ratings yet
Bda Unit-Iii-R20
44 pages
CP5261 Data Analytics Laboratory LTPC0042 Objectives
No ratings yet
CP5261 Data Analytics Laboratory LTPC0042 Objectives
80 pages
Unit 3
No ratings yet
Unit 3
14 pages
Unit 4 Da
No ratings yet
Unit 4 Da
57 pages
DATA ANALYTICS Lab
No ratings yet
DATA ANALYTICS Lab
3 pages
Digital Forensic
No ratings yet
Digital Forensic
238 pages
Lecture 4: Mapreduce and Hadoop: Indranil Gupta (Indy)
No ratings yet
Lecture 4: Mapreduce and Hadoop: Indranil Gupta (Indy)
37 pages
Hadoop Interviews Q
No ratings yet
Hadoop Interviews Q
9 pages
(Ebook) Case-Based Reasoning: Processes, Suitability and Applications : Processes, Suitability and Applications by Antonia M. Leeland ISBN 9781617288142, 1617288144 2024 scribd download
100% (5)
(Ebook) Case-Based Reasoning: Processes, Suitability and Applications : Processes, Suitability and Applications by Antonia M. Leeland ISBN 9781617288142, 1617288144 2024 scribd download
59 pages
Gradsusr Moist Con Transp
No ratings yet
Gradsusr Moist Con Transp
7 pages
HBase Interview Questions
No ratings yet
HBase Interview Questions
12 pages
Mining Data Streams (Part 2)
No ratings yet
Mining Data Streams (Part 2)
56 pages
Unit-3 (HDFS)
No ratings yet
Unit-3 (HDFS)
59 pages
Apache Sqoop
No ratings yet
Apache Sqoop
21 pages
3170722_BDA_Lab Manual(1)
No ratings yet
3170722_BDA_Lab Manual(1)
78 pages
MapReduce Example
No ratings yet
MapReduce Example
3 pages
Datatypes in Hive
No ratings yet
Datatypes in Hive
31 pages
Topspin: NMR Data Publishing User Manual
No ratings yet
Topspin: NMR Data Publishing User Manual
142 pages
DW
No ratings yet
DW
29 pages
WRF Python
No ratings yet
WRF Python
226 pages
A Tour of C 3rd Edition Stroustrup B. download
100% (1)
A Tour of C 3rd Edition Stroustrup B. download
62 pages
Flume Case Study
No ratings yet
Flume Case Study
2 pages
Map Reduce
No ratings yet
Map Reduce
10 pages
BDA Unit - II
No ratings yet
BDA Unit - II
66 pages
Mining Data Streams
No ratings yet
Mining Data Streams
67 pages
Apache Iotdb: Time-Series Database For Internet of Things
No ratings yet
Apache Iotdb: Time-Series Database For Internet of Things
4 pages
Quastor System Design Book __ NeetCode Newsletter
No ratings yet
Quastor System Design Book __ NeetCode Newsletter
523 pages
Programmation Météo en Python
No ratings yet
Programmation Météo en Python
50 pages
Final Practice Set
No ratings yet
Final Practice Set
31 pages
The Casa Cookbook
No ratings yet
The Casa Cookbook
20 pages
Unit-V Pig Programming
No ratings yet
Unit-V Pig Programming
123 pages
Hadoop - Hive
No ratings yet
Hadoop - Hive
190 pages
CDO Reference Card
No ratings yet
CDO Reference Card
3 pages
Cloudera Certification Dump - 410-Anil
100% (3)
Cloudera Certification Dump - 410-Anil
49 pages
3 Lecture 3-ETL
100% (1)
3 Lecture 3-ETL
42 pages
Routinemap Patterns of Life in Spatiotemporal Visualization
No ratings yet
Routinemap Patterns of Life in Spatiotemporal Visualization
10 pages
WRF Modeling System Overview
No ratings yet
WRF Modeling System Overview
37 pages
Hadoop Tutorial - YDN
No ratings yet
Hadoop Tutorial - YDN
14 pages
MITRES 6 009IAP12 Lab1 PDF
No ratings yet
MITRES 6 009IAP12 Lab1 PDF
9 pages
Machine Learning With Spark
No ratings yet
Machine Learning With Spark
26 pages
Manual Hadoop HIve Installation
No ratings yet
Manual Hadoop HIve Installation
4 pages
Slicing. Both, Numpy Array Indexing and Slicing Will Be Discussed in The Remainder
No ratings yet
Slicing. Both, Numpy Array Indexing and Slicing Will Be Discussed in The Remainder
50 pages
Bigdata 15cs82 Vtu Module 1 2 Notes
57% (14)
Bigdata 15cs82 Vtu Module 1 2 Notes
49 pages
100 Interview Questions On Hadoop PDF
No ratings yet
100 Interview Questions On Hadoop PDF
24 pages
Concept Hierarchies
No ratings yet
Concept Hierarchies
6 pages
Brochure for ATAL Workshop
No ratings yet
Brochure for ATAL Workshop
3 pages
Bda Material Unit 3
No ratings yet
Bda Material Unit 3
14 pages
Bda Unit 3
No ratings yet
Bda Unit 3
22 pages
Unit 2 PDF
No ratings yet
Unit 2 PDF
22 pages
Unit I Introduction To Big Data: 1.1 Definition
No ratings yet
Unit I Introduction To Big Data: 1.1 Definition
16 pages
Alumni Directory PDF
No ratings yet
Alumni Directory PDF
40 pages
Warning System For Driver
No ratings yet
Warning System For Driver
11 pages
Alumni Directory PDF
No ratings yet
Alumni Directory PDF
40 pages
Datastructure Material
No ratings yet
Datastructure Material
60 pages
DLD 1unit PDF
No ratings yet
DLD 1unit PDF
69 pages
Java Unit II, III & IV
No ratings yet
Java Unit II, III & IV
61 pages
A Study On Signal Group Processing of AUTOSAR COM Module: Journal of Physics Conference Series June 2013
No ratings yet
A Study On Signal Group Processing of AUTOSAR COM Module: Journal of Physics Conference Series June 2013
7 pages
Makita MR051 Radio
No ratings yet
Makita MR051 Radio
24 pages
Adc Lab Matlab STUDENT
No ratings yet
Adc Lab Matlab STUDENT
13 pages
Unit - 4 CSS: Computer Engineering
No ratings yet
Unit - 4 CSS: Computer Engineering
45 pages
Troubleshooting Vsans, Domains, and FSPF: Cisco Mds 9000 Family Troubleshooting Guide, Release 3.X Ol-9285-04
No ratings yet
Troubleshooting Vsans, Domains, and FSPF: Cisco Mds 9000 Family Troubleshooting Guide, Release 3.X Ol-9285-04
36 pages
MViP-II User Manual 1v0
No ratings yet
MViP-II User Manual 1v0
46 pages
Middleware in Cloud Computing
No ratings yet
Middleware in Cloud Computing
7 pages
SBI Notice 06 09
No ratings yet
SBI Notice 06 09
13 pages
Sample Thesis Computer Engineering
100% (3)
Sample Thesis Computer Engineering
4 pages
SAHIL_KOSHRIYA22223110
No ratings yet
SAHIL_KOSHRIYA22223110
4 pages
1.3.2 Wired and Wireless Networks, Protocols and Layers - Workbook OCR GCSE ANSWERS
No ratings yet
1.3.2 Wired and Wireless Networks, Protocols and Layers - Workbook OCR GCSE ANSWERS
1 page
Learning Path and Time: B010-PLC Basics
No ratings yet
Learning Path and Time: B010-PLC Basics
10 pages
Ec 2008
No ratings yet
Ec 2008
38 pages
Oscilloscope Using Arduino
No ratings yet
Oscilloscope Using Arduino
12 pages
Code Printer Properties
No ratings yet
Code Printer Properties
6 pages
Escene hs108v2 Manual
No ratings yet
Escene hs108v2 Manual
35 pages
VDI ACCESS Manual V1.4
No ratings yet
VDI ACCESS Manual V1.4
6 pages
Audiocodes Session Border Controllers Mediant™ 800: Hybrid SBC and Media Gateway
No ratings yet
Audiocodes Session Border Controllers Mediant™ 800: Hybrid SBC and Media Gateway
2 pages
IC693ALG223
No ratings yet
IC693ALG223
17 pages
Jos Jewellers
No ratings yet
Jos Jewellers
4 pages
UAT Slides
100% (1)
UAT Slides
35 pages
4CAE000143 REC650 Technical Summary-A4
No ratings yet
4CAE000143 REC650 Technical Summary-A4
2 pages
Api Flutter
No ratings yet
Api Flutter
10 pages
Viva Questions of FT-104C (IT Fundamental) MBA I SEM
No ratings yet
Viva Questions of FT-104C (IT Fundamental) MBA I SEM
3 pages
disc07-sols
No ratings yet
disc07-sols
6 pages
Frequently Asked Questions and Issues Related To The "Digitalsystemsvm" Virtual Machine
No ratings yet
Frequently Asked Questions and Issues Related To The "Digitalsystemsvm" Virtual Machine
9 pages
Storage Deployment Engineer Interview Questions
No ratings yet
Storage Deployment Engineer Interview Questions
2 pages
Satellite - L305D-S5974 Specs
No ratings yet
Satellite - L305D-S5974 Specs
3 pages
2021 PAN Edu 318 Prisma Access
No ratings yet
2021 PAN Edu 318 Prisma Access
1 page

Unit-Iii: A Weather Dataset

Uploaded by

Unit-Iii: A Weather Dataset

Uploaded by

UNIT-III

Figure 2-1. MapReduce logical data flow

Example - 1. Mapper for maximum temperature example

The reduce function is similarly defined using a Reducer, as illustrated in Example - 2.

Example - 2. Reducer for maximum temperature example

The new Java MapReduce API:

 A combiner can produce summary information from a large dataset because it

MapReduce Combiner Implementation

Input − Line by line text from the input file.

<1, What do you mean by Object>

public static class WordCountMapper extends Mapper<Object, Text, Text, IntWritable>

<What,1,1,1> <do,1,1> <you,1,1> <mean,1> <by,1> <Object,1>

<What,1,1,1> <do,1,1> <you,1,1> <mean,1> <by,1> <Object,1>

public static class WordCountReducer extends Reducer<Text,IntWritable,Text,IntWritable>

Output − The expected output from the Reducer phase is as follows –

<What,3> <do,2> <you,2> <mean,1> <by,1> <Object,1>

You might also like