0% found this document useful (0 votes)
20 views

MapReduce Questions

Map reducing questions for big data subject for computer science students

Uploaded by

nivithaswathi
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

MapReduce Questions

Map reducing questions for big data subject for computer science students

Uploaded by

nivithaswathi
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

1. What do you mean by side data? Mention its challenges.

Side data can be defined as extra read-only data needed by a job to


process the main dataset. The challenge is to make side data available to
all the map or reduce tasks in a convenient and efficient fashion.

2. Write down the uses of counters.


 Counters are a useful channel for gathering statistics about the job: for
quality control or for application level-statistics.
 They are also useful for problem diagnosis. If you are tempted to put a
log message into your map or reduce task, then it is often better to see
whether you can use a counter instead to record that a particular
condition occurred.
 Counter values are much easier to retrieve than log output for large
distributed jobs, you get a record of the number of times that condition
occurred, which is more work to obtain from a set of logfiles.
3. What is FileInputFormat?
FileInputFormat is the base class for all implementations of
InputFormat that use files as their data source. It provides two
things: a place to define which files are included as the input to a
job, and an implementation for generating splits for the input files.
The job of dividing splits into records is performed by subclasses.

4. Illustrate with a neat diagram, the use of separators in streaming


MapReduce Jobs.
5. Define partition function.
The partition function operates on the intermediate key and value
types (K2 and V2), and returns the partition index. In practice, the
partition is determined solely by the key (the value is ignored):
partition: (K2, V2) → integer
6. What is IdentityMapper?
IdentityMapper is a generic type, which allows it to work with any
key or value types, with the restriction that the map input and
output keys are of the same type, and the map input and output
values are of the same type.
7. Create a MapReduce program for sorting a Sequence File with
IntWritable keys using the default HashPartitioner.

MapReduce program for sorting a SequenceFile

public class SortByTemperatureToMapFile extends Configured implements Tool

{ @Override

public int run(String[] args) throws Exception {

Job job = JobBuilder.parseInputAndOutput(this, getConf(), args);

if (job == null) {

return -1;
}

job.setInputFormatClass(SequenceFileInputFormat.class);

job.setOutputKeyClass(IntWritable.class);

job.setOutputFormatClass(MapFileOutputFormat.class);

SequenceFileOutputFormat.setCompressOutput(job, true);

SequenceFileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);

SequenceFileOutputFormat.setOutputCompressionType(job,

CompressionType.BLOCK);

return job.waitForCompletion(true) ? 0 : 1;

8. Write a note on the MapReduce Library Classes.

MapReduce Library Classes

https://www.javatpoint.com/mapreduce-api

9. Write a note on Map-Side Joins.

10. Explain about Input Splits and Records.

11. Discuss the Counter in MapReduce with proper example.

https://www.dataspoof.info/post/understanding-counter-in-mapreduce-along-

with-code/

https://techvidvan.com/tutorials/hadoop-counters-types-and-roles/

User-defined counters
Till now, we have discussed all in-build counters. But what if we want some
statistics which is not provided by existing Hadoop MapReduce counters. So,
Hadoop is providing an extra feature for the same. MapReduce user-defined
counters come to the rescue you in such a case. There are some pre-defined
ways in which we can calculate our user-defined counters based on the client’s
custom requirement. We can also say a custom counter to them. In java, we are
using the enum type for calculating custom or user-defined counters.

In a Hadoop job, we can be defined as no. of enums as per our requirement.


Here, each enum is a counter group, and each field of enum is considered as a
counter in the particular counter group. So, this is compiled time approach. So,
we can’t define or change it at runtime. We need to specify it before the job run.
Dynamic Counters in Hadoop
Apart from enum-based user-defined counters, which are available at compile-
time, which means we can not change or add new counters at runtime, what if
we want to add new counters dynamically at runtime? Here are dynamic
counters which we can use at runtime. But, we can’t define them at compile
time.

Implementation of Counter in MapReduce


Now, let’s implement a sample program to create two counters,
ODD_NUMBERS_COUNT and EVEN_NUMBERS_COUNT. Suppose we have numbers
having a single number on each line; how we will calculate the count of even and
odd numbers. Let’s see now.

Let’s first take a sample Input file below detail:

So, the output should be

ODD_NUMBERS_COUNT 3

EVEN_NUMBERS_COUNT 3
Now, let’s see the program to see how do we have implemented the counters:

Mapper Class
1import org.apache.hadoop.io.IntWritable;
2import org.apache.hadoop.io.LongWritable;
3import org.apache.hadoop.io.Text;
4import org.apache.hadoop.mapreduce.Mapper;
5
6import java.io.IOException;
7
8// Mapper
9public class MapperClass extends Mapper<LongWritable, Text, Text, IntWritable>{
10 private Text item = new Text();
11 IntWritable sales = new IntWritable();
12 public void map(LongWritable key, Text value, Context context)
13 throws IOException, InterruptedException {
14 // Splitting the line on tab
15 int number = Integer.parseInt(value.toString());
16
17
18 if(number%2==0) {
19 context.getCounter(Driver.Counter.EVEN_COUNT).increment(1);
20 }else {
21 // incrementing counter
22 context.getCounter(Driver.Counter.ODD_COUNT).increment(1);
23
24 }
25 context.write(item, sales);
26 }
27}
Driver Class
1import org.apache.hadoop.conf.Configuration;
2import org.apache.hadoop.conf.Configured;
3import org.apache.hadoop.fs.Path;
4import org.apache.hadoop.io.IntWritable;
5import org.apache.hadoop.io.Text;
6import org.apache.hadoop.mapreduce.Job;
7import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
8import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
9import org.apache.hadoop.util.Tool;
10import org.apache.hadoop.util.ToolRunner;
11
12public class Driver extends Configured implements Tool {
13
14 enum Counter {
15 ODD_COUNT,
16 EVEN_COUNT
17 }
18
19 public static void main(String[] args) throws Exception {
20 int exitFlag = ToolRunner.run(new Driver(), args);
21 System.exit(exitFlag);
22 }
23
24 public int run(String[] args) throws Exception {
25 Configuration conf=new Configuration();
26 Job job = Job.getInstance(conf, "ODD-EVEN Counter");
27 job.setJarByClass(getClass());
28 job.setMapperClass(MapperClass.class);
29 job.setNumReduceTasks(0);
30 job.setOutputKeyClass(Text.class);
31 job.setOutputValueClass(IntWritable.class);
32 FileInputFormat.addInputPath(job, new Path(args[0]));
33 FileOutputFormat.setOutputPath(job, new Path(args[1]));
34 return job.waitForCompletion(true) ? 0 : 1;
35 }
36
37}
Here, we have two classes.

One is the Driver class, where we have created one enum to implement
counter. As we have discussed before, the enum name Counter is the counter
group name, and ODD_NUMBER and EVEN_NUMBER are two counters which we
will use to calculate even numbers and odd numbers.
Also, we have mentioned different configuration parameters like which is the
mapper class, and there is no reducer class, and some other basic details like
output key and output value class type, input path from where we will read input
and output path where we will store the output.

Another class is mapper class, where we will actually calculate the counter for
even numbers and odd numbers.

context.getCounter(Driver.Counter.EVEN_COUNT).increment(1);
This line is used to increment a counter for an even number, and same there is a
line to increment a counter for an odd number.

Here is the counter we are getting when we run for the input file mentioned
above.

Output

12. Explain the different ways of sorting datasets in MapReduce

Partial Sort:-
The reducer output will be lot of files each of which is sorted within itself based

on the key.

Total Sort:

The reducer output will be a single file having all the output sorted based on the

key.

Secondary Sort:

In this case, we will be able to control the ordering of the values along with the

keys.That is sorting can be done on two or more field values.

Secondary Sort

Secondary sort is a technique which can be used for sorting data on multiple

field. It relies on using a Composite Key which will contain all the values we

want to use for sorting.

In this section we will read our “donations” Sequence File, and map each

donation record to a (CompositeKey, DonationWritable) pair before shuffling and

reducing.

All classes used in this section can be viewed on GitHub in this

package : https://github.com/nicomak/[…]/secondarysort.

The MapReduce secondary sort job which is executed to get our query results is

in the OrderByCompositeKey.java (view) file from the same package.

Composite Key

Our query wants to sort the results on 3 values, so we create

a WritableComparable class called CompositeKey (view) with the following 3

attributes :

 state (String) – This one will be used as our natural key (or primary key)

for partitioning
 city (String) – A secondary key for sorting keys with same “state” natural

key within a partition

 total (float) – Another secondary key for further sorting when the “city” is

the same

1. Partial Sort

Definition: A partial sort is when you sort only a portion of the dataset based on
a specific criterion, rather than sorting the entire dataset.

Use Case: It is useful when you want to retrieve the top N elements based on a
key.

Example:

 You have a dataset of sales records, and you want to find the top 10
salespeople.

Diagram:

2. Total Sort

Definition: A total sort involves sorting the entire dataset based on a specified
key.

Use Case: This is applicable when you need a complete ordered view of the
data.

Example:

 Sorting a list of student records by their grades.


3. Secondary Sort

Definition: Secondary sort allows sorting by multiple criteria. After the primary
sort is applied, the secondary sort organizes the data further based on a
secondary key.

Use Case: Useful when you need to sort by one key and then by another, like
sorting by last name and then by first name.

Example:

 Sorting a list of employees by department (primary key) and then by their


hire date (secondary key).

Diagram:

13. Explain Java counters used in MapReduce


 Dynamic counters
 Readable counter names
 Retrieving counters

You might also like