MapReduce Questions
MapReduce Questions
{ @Override
if (job == null) {
return -1;
}
job.setInputFormatClass(SequenceFileInputFormat.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputFormatClass(MapFileOutputFormat.class);
SequenceFileOutputFormat.setCompressOutput(job, true);
SequenceFileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);
SequenceFileOutputFormat.setOutputCompressionType(job,
CompressionType.BLOCK);
return job.waitForCompletion(true) ? 0 : 1;
https://www.javatpoint.com/mapreduce-api
https://www.dataspoof.info/post/understanding-counter-in-mapreduce-along-
with-code/
https://techvidvan.com/tutorials/hadoop-counters-types-and-roles/
User-defined counters
Till now, we have discussed all in-build counters. But what if we want some
statistics which is not provided by existing Hadoop MapReduce counters. So,
Hadoop is providing an extra feature for the same. MapReduce user-defined
counters come to the rescue you in such a case. There are some pre-defined
ways in which we can calculate our user-defined counters based on the client’s
custom requirement. We can also say a custom counter to them. In java, we are
using the enum type for calculating custom or user-defined counters.
ODD_NUMBERS_COUNT 3
EVEN_NUMBERS_COUNT 3
Now, let’s see the program to see how do we have implemented the counters:
Mapper Class
1import org.apache.hadoop.io.IntWritable;
2import org.apache.hadoop.io.LongWritable;
3import org.apache.hadoop.io.Text;
4import org.apache.hadoop.mapreduce.Mapper;
5
6import java.io.IOException;
7
8// Mapper
9public class MapperClass extends Mapper<LongWritable, Text, Text, IntWritable>{
10 private Text item = new Text();
11 IntWritable sales = new IntWritable();
12 public void map(LongWritable key, Text value, Context context)
13 throws IOException, InterruptedException {
14 // Splitting the line on tab
15 int number = Integer.parseInt(value.toString());
16
17
18 if(number%2==0) {
19 context.getCounter(Driver.Counter.EVEN_COUNT).increment(1);
20 }else {
21 // incrementing counter
22 context.getCounter(Driver.Counter.ODD_COUNT).increment(1);
23
24 }
25 context.write(item, sales);
26 }
27}
Driver Class
1import org.apache.hadoop.conf.Configuration;
2import org.apache.hadoop.conf.Configured;
3import org.apache.hadoop.fs.Path;
4import org.apache.hadoop.io.IntWritable;
5import org.apache.hadoop.io.Text;
6import org.apache.hadoop.mapreduce.Job;
7import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
8import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
9import org.apache.hadoop.util.Tool;
10import org.apache.hadoop.util.ToolRunner;
11
12public class Driver extends Configured implements Tool {
13
14 enum Counter {
15 ODD_COUNT,
16 EVEN_COUNT
17 }
18
19 public static void main(String[] args) throws Exception {
20 int exitFlag = ToolRunner.run(new Driver(), args);
21 System.exit(exitFlag);
22 }
23
24 public int run(String[] args) throws Exception {
25 Configuration conf=new Configuration();
26 Job job = Job.getInstance(conf, "ODD-EVEN Counter");
27 job.setJarByClass(getClass());
28 job.setMapperClass(MapperClass.class);
29 job.setNumReduceTasks(0);
30 job.setOutputKeyClass(Text.class);
31 job.setOutputValueClass(IntWritable.class);
32 FileInputFormat.addInputPath(job, new Path(args[0]));
33 FileOutputFormat.setOutputPath(job, new Path(args[1]));
34 return job.waitForCompletion(true) ? 0 : 1;
35 }
36
37}
Here, we have two classes.
One is the Driver class, where we have created one enum to implement
counter. As we have discussed before, the enum name Counter is the counter
group name, and ODD_NUMBER and EVEN_NUMBER are two counters which we
will use to calculate even numbers and odd numbers.
Also, we have mentioned different configuration parameters like which is the
mapper class, and there is no reducer class, and some other basic details like
output key and output value class type, input path from where we will read input
and output path where we will store the output.
Another class is mapper class, where we will actually calculate the counter for
even numbers and odd numbers.
context.getCounter(Driver.Counter.EVEN_COUNT).increment(1);
This line is used to increment a counter for an even number, and same there is a
line to increment a counter for an odd number.
Here is the counter we are getting when we run for the input file mentioned
above.
Output
Partial Sort:-
The reducer output will be lot of files each of which is sorted within itself based
on the key.
Total Sort:
The reducer output will be a single file having all the output sorted based on the
key.
Secondary Sort:
In this case, we will be able to control the ordering of the values along with the
Secondary Sort
Secondary sort is a technique which can be used for sorting data on multiple
field. It relies on using a Composite Key which will contain all the values we
In this section we will read our “donations” Sequence File, and map each
reducing.
package : https://github.com/nicomak/[…]/secondarysort.
The MapReduce secondary sort job which is executed to get our query results is
Composite Key
attributes :
state (String) – This one will be used as our natural key (or primary key)
for partitioning
city (String) – A secondary key for sorting keys with same “state” natural
total (float) – Another secondary key for further sorting when the “city” is
the same
1. Partial Sort
Definition: A partial sort is when you sort only a portion of the dataset based on
a specific criterion, rather than sorting the entire dataset.
Use Case: It is useful when you want to retrieve the top N elements based on a
key.
Example:
You have a dataset of sales records, and you want to find the top 10
salespeople.
Diagram:
2. Total Sort
Definition: A total sort involves sorting the entire dataset based on a specified
key.
Use Case: This is applicable when you need a complete ordered view of the
data.
Example:
Definition: Secondary sort allows sorting by multiple criteria. After the primary
sort is applied, the secondary sort organizes the data further based on a
secondary key.
Use Case: Useful when you need to sort by one key and then by another, like
sorting by last name and then by first name.
Example:
Diagram: