0% found this document useful (0 votes)
18 views

Lab11 B

Uploaded by

l227486
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Lab11 B

Uploaded by

l227486
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

F Big Data A

LAB-11-A
Timings: 11:30 am - 2:30 pm

Lab Protocols:

1. This Lab Would hold tasks at the end. Cheating would result straight 0
2. Making noise in lab during demonstration would result in immediate termination
of session and start of Tasks.
3. Contact me on email for queries [email protected]
MapReduce for word count problem on Hadoop:
In MapReduce word count example, we find out the frequency of each word. Here, the role
of Mapper is to map the keys to the existing values and the role of Reducer is to aggregate
the keys of common values. So, everything is represented in the form of key-value pair.
Example:
Let’s solve a word count problem using MapReduce on Hadoop.
Step 1: Open Cloudera Quickstart VM.

Step 2: Create a .txt data file inside /home/cloudera directory that will be passed as an input to
MapReduce program. For simplicity purpose, we name it as word_count_data.txt.

Step 3: Create mapper.py and reducer.py files inside /home/cloudera directory.


Step 4: Test the MapReduce program(s) locally to check if everything works properly before
running on Hadoop. cat word_count_data.txt | python mapper.py | sort -k1,1 | python
reducer.py
For the above example, the output obtained is exactly the same as expected.
If you see all the words correctly mapped, sorted and reduced to their respective counts, then
your program is good to be tested on Hadoop.
Step 5: Configure Hadoop services and settings.
Now, we need to configure certain settings on Hadoop before we run the MapReduce program
for word count.
5a: Login to Cloudera Quickstart.
Open browser on Cloudera Quickstart VM and open quickstart.cloudera:7180/cmf/login.
Login by entering the credentials as cloudera for both, username and password.
Note: If you see the error “Unable to connect” while logging in to
quickstart.cloudera:7180/cmf/login, try restarting the CDH services.

Restart CDH services by typing the following command: sudo


/home/cloudera/cloudera-manager --express --force

5b: Start HDFS and YARN services.


Click the dropdown arrow and choose Start option for HDFS and YARN services.

You’ll see the following if both; HDFS and YARN services are started successfully.

HDFS service started successfully.


Step 6: Create a directory on HDFS
Now, we create a directory named word_count_map_reduce on HDFS where our input data
and its resulting output would be stored. Use the following command for it.
hdfs dfs -mkdir /word_count_map_reduce

Note: If the directory already exists, then either create a directory with new name or delete
the existing directory using the following command.
export HADOOP_USER_NAME=hdfs
hdfs dfs -rmr /word_count_map_reduce
List HDFS directory items using the following command.
hdfs dfs -ls /

Step 7: Move input data file to HDFS.


Copy the word_count_data.txt file to word_count_map_reduce directory on HDFS using the
following command.
hdfs dfs -put /home/cloudera/word_count_data.txt /word_count_map_reduce
Check if file was copied successfully to the desired location.
hdfs dfs -ls /word_count_map_reduce

Step 8: Download hadoop-streaming JAR 2.7.3.


Open browser and go to https://jar-download.com/artifacts/org.apache.hadoop/hadoop-
streaming?p=4 and download hadoop-streaming JAR 2.7.3 file.

Once the file is downloaded, unzip it inside /home/cloudera directory.


Double-check if the JAR file was unzipped successfully and is present inside /home/cloudera
directory.
ls

Step 9: Configure permissions to run MapReduce on Hadoop.


We’re almost ready to run our MapReduce job on Hadoop but before that, we need to give
permission to read, write and execute the Mapper and Reducer programs on Hadoop.
We also need to provide permission for the default user (cloudera) to write the output file inside
HDFS.
Run the following commands to do so:
chmod 777 mapper.py reducer.py
hdfs dfs -chown cloudera /word_count_map_reduce

Step 10: Run MapReduce on Hadoop.


We’re at the ultimate step of this program. Run the MapReduce job on Hadoop using the
following command.
hadoop jar /home/cloudera/hadoop-streaming-2.7.3.jar \
> -input /word_count_map_reduce/word_count_data.txt\
> -output /word_count_map_reduce/output \
> -mapper /home/cloudera/mapper.py \
If you see the output on terminal as shown in above two images, then the MapReduce job was
executed successfully.
Step 11: Read the MapReduce output.
Now, finally run the following command to read the output of MapReduce for word count of
the input data file you had created. hdfs dfs -cat /word_count_map_reduce/output/part-00000
Congratulations, the output for MapReduce on Hadoop is obtained exactly as expected. All the
words in the input data file have been mapped, sorted and reduced to their respective counts.

You might also like