hadoop_tutorialv3
hadoop_tutorialv3
Hadoop Tutorial
Due 11:59pm January 17, 2017
General Instructions
The purpose of this tutorial is (1) to get you started with Hadoop and (2) to get you
acquainted with the code and homework submission system. Completing the tutorial is
optional but by handing in the results in time students will earn 5 points. This tutorial is
to be completed individually.
Here you will learn how to write, compile, debug and execute a simple Hadoop program.
First part of the assignment serves as a tutorial and the second part asks you to write your
own Hadoop program.
Section 1 describes the virtual machine environment. Instead of the virtual machine, you
are welcome to setup your own pseudo-distributed or fully distributed cluster if you pre-
fer. Any version of Hadoop that is at least 1.0 will suffice. (For an easy way to set up a
cluster, try Cloudera Manager: http://archive.cloudera.com/cm5/installer/latest/
cloudera-manager-installer.bin.) If you choose to setup your own cluster, you are re-
sponsible for making sure the cluster is working properly. The TAs will be unable to help
you debug configuration issues in your own cluster.
Section 2 explains how to use the Eclipse environment in the virtual machine, including how
to create a project, how to run jobs, and how to debug jobs. Section 2.5 gives an end-to-end
example of creating a project, adding code, building, running, and debugging it.
Section 3 is the actual homework assignment. There are no deliverable for sections 1 and 2.
In section 3, you are asked to write and submit your own MapReduce job
This assignment requires you to upload the code and hand-in the output for Section 3.
All students should submit the output via Gradescope and upload the code via snap.
Gradescope: To register for Gradescope,
Upload the code: Put all the code for a single question into a single file and upload it at
http://snap.stanford.edu/submit/. You must aggregate all the code in a single
file (one file per question), and it must be a text file.
CS246: Mining Massive Datasets - Problem Set 0 2
Questions
• CentOS 6.4
• JDK 7 (1.7.0 67)
• Hadoop 2.5.0
• Eclipse 4.2.6 (Juno)
The virtual machine runs best with 4096MB of RAM, but has been tested to
function with 1024MB. Note that at 1024MB, while it did technically function,
it was very slow to start up.
CS246: Mining Massive Datasets - Problem Set 0 3
1. Standalone (or local) mode: There are no daemons used in this mode. Hadoop
uses the local file system as an substitute for HDFS file system. The jobs will run as
if there is 1 mapper and 1 reducer.
2. Pseudo-distributed mode: All the daemons run on a single machine and this setting
mimics the behavior of a cluster. All the daemons run on your machine locally using
the HDFS protocol. There can be multiple mappers and reducers.
In this homework we will show you how to run Hadoop jobs in Standalone mode (very useful
for developing and debugging) and also in Pseudo-distributed mode (to mimic the behavior
of a cluster environment).
(There is a plugin for Eclipse that makes it simple to create a new Hadoop project and
execute Hadoop jobs, but the plugin is only well maintained for Hadoop 1.0.4, which
is a rather old version of Hadoop. There is a project at https://github.com/winghc/
hadoop2x-eclipse-plugin that is working to update the plugin for Hadoop 2.0. You can
try it out if you like, but your milage may vary.)
To create a project:
1. Open Eclipse. If you just launched the VM, you may have to close the Firefox window
to find the Eclipse icon on the desktop.
2. Right-click on the training node in the Package Explorer and select Copy. See Figure
1.
CS246: Mining Massive Datasets - Problem Set 0 4
3. Right-click on the training node in the Package Explorer and select Paste . See Figure
2.
4. In the pop-up dialog, enter the new project name in the Project Name field and click
OK. See Figure 3.
CS246: Mining Massive Datasets - Problem Set 0 5
5. Modify or replace the stub classes found in the src directory as needed.
Once you’ve created your project and written the source code, to run the project in stand-
alone mode, do the following:
1. Right-click on the project and select Run As → Run Conf igurations. See Figure 4.
2. In the pop-up dialog, select the Java Application node and click the New launch con-
figuration button in the upper left corner. See Figure 5.
3. Enter a name in the Name field and the name of the main class in the Main class field.
See Figure 6.
4. Switch to the Arguments tab and input the required arguments. Click Apply. See
Figure 7. To run the job immediately, click on the Run button. Otherwise click Close
and complete the following step.
CS246: Mining Massive Datasets - Problem Set 0 7
5. Right-click on the project and select Run As → Java Application. See Figure 8.
6. In the pop-up dialog select the main class from the selection list and click OK. See
Figure 9.
CS246: Mining Massive Datasets - Problem Set 0 8
After you have setup the run configuration the first time, you can skip steps 1 and
2 above in subsequent runs, unless you need to change the arguments. You can also
create more than one launch configuration if you’d like, such as one for each set of
common arguments.
Once you’ve created your project and written the source code, to run the project in pseudo-
distributed mode, do the following:
2. In the pop-up dialog, expand the Java node and select JAR file. See Figure 11. Click
Next >
CS246: Mining Massive Datasets - Problem Set 0 10
3. Enter a path in the JAR file field and click Finish. See Figure 12.
CS246: Mining Massive Datasets - Problem Set 0 11
To debug an issue with a job, the easiest approach is to run the job in stand-alone mode
and use a debugger. To debug your job, do the following steps:
1. Right-click on the project and select Debug As → Java Application. See Figure 13.
CS246: Mining Massive Datasets - Problem Set 0 12
2. In the pop-up dialog select the main class from the selection list and click OK. See
Figure 14.
You can use the Eclipse debugging features to debug your job execution. See the additional
Eclipse tutorials at the end of section 2.6 for help using the Eclipse debugger.
When running your job in pseudo-distributed mode, the output from the job is logged in the
task tracker’s log files, which can be accessed most easily by pointing a web browser to port
8088 of the server, which will the localhost. From the job tracker web page, you can drill
down into the failing job, the failing task, the failed attempt, and finally the log files. Note
that the logs for stdout and stderr are separated, which can be useful when trying to isolate
specific debugging print statements.
In this section you will create a new Eclipse Hadoop project, compile, and execute it. The
program will count the frequency of all the words in a given large text file. In your virtual
machine, Hadoop, Java environment and Eclipse have already been pre-installed.
• Open Eclipse. If you just launched the VM, you may have to close the Firefox window
to find the Eclipse icon on the desktop.
• Right-click on the training node in the Package Explorer and select Copy. See Figure
15.
• Right-click on the training node in the Package Explorer and select Paste. See Figure
16.
CS246: Mining Massive Datasets - Problem Set 0 14
• In the pop-up dialog, enter the new project name in the Project Name field and click
OK. See Figure 17.
• Enter edu.stanford.cs246.wordcount in the Name field and click Finish. See Figure
19.
• Create a new class in that package called WordCount by right-clicking on the edu.stanford.cs246.wordco
node and selecting N ew → Class. See Figure 20.
CS246: Mining Massive Datasets - Problem Set 0 16
• In the pop-up dialog, enter WordCount as the Name. See Figure 21.
• In the Superclass field, enter Configured and click the Browse button. From the popup
CS246: Mining Massive Datasets - Problem Set 0 17
• In the Interfaces section, click the Add button. From the pop-up window select Tool −
org.apache.hadoop.util and click the OK button. See Figure 23.
CS246: Mining Massive Datasets - Problem Set 0 18
• Check the boxes for public static void main(String args[]) and Inherited abstract meth-
ods and click the Finish button. See Figure 24.
CS246: Mining Massive Datasets - Problem Set 0 19
• You will now have a rough skeleton of a Java file as in Figure 25. You can now add
code to this class to implement your Hadoop job.
• Rather than implement a job from scratch, copy the contents from http://snap.
stanford.edu/class/cs246-data-2014/WordCount.java and paste it into the WordCount.java
CS246: Mining Massive Datasets - Problem Set 0 20
file. See Figure 26. The code in WordCount.java calculates the frequency of each word
in a given dataset.
• Right-click on the project and select Run As → Run Conf igurations. See Figure 27.
CS246: Mining Massive Datasets - Problem Set 0 21
• In the pop-up dialog, select the Java Application node and click the New launch con-
figuration button in the upper left corner. See Figure 28.
• Enter a name in the Name field and WordCount in the Main class field. See Figure 29.
CS246: Mining Massive Datasets - Problem Set 0 22
• Switch to the Arguments tab and put pg100.txt output in the Program arguments
field. See Figure 30. Click Apply and Close.
• Right-click on the project and select Run As → Java Application. See Figure 31.
CS246: Mining Massive Datasets - Problem Set 0 23
You will see the command output in the console window, and if the job succeeds,
you’ll find the results in the ~/workspace/WordCount/output directory. If the job
fails complaining that it cannot find the input file, make sure that the pg100.txt file
is located in the ~/workspace/WordCount directory.
• In the pop-up dialog, expand the Java node and select JAR file. See Figure 34. Click
Next >
CS246: Mining Massive Datasets - Problem Set 0 25
• Enter /home/cloudera/wordcount.jar in the JAR file field and click Finish. See
Figure 35.
CS246: Mining Massive Datasets - Problem Set 0 26
If you see an error dialog warning that the project compiled with warnings, you can
simply click OK.
• Open a terminal in your VM and traverse to the folder /home/cloudera and run the
following commands:
hadoop fs -put workspace/WordCount/pg100.txt
hadoop jar WordCount.jar edu.stanford.cs246.wordcount.WordCount pg100.txt
output
• To view the job’s logs, open the browser in the VM and point it to http://localhost:
8088 as in Figure 38
• Click on the link for the completed job. See Figure 39.
CS246: Mining Massive Datasets - Problem Set 0 28
• Click the link for the map tasks. See Figure 40.
• Click the link for the first attempt. See Figure 41.
CS246: Mining Massive Datasets - Problem Set 0 29
• Click the link for the full logs. See Figure 42.
If you’d rather use your own development environment instead of working in the IDE, follow
these steps:
1. Make sure that you have an entry for localhost.localdomain in your /etc/hosts
file, e.g.
CS246: Mining Massive Datasets - Problem Set 0 30
2. Install a copy of Hadoop locally. The easiest way to do that is to simply download
the archive from http://archive.cloudera.com/cdh5/cdh/5/hadoop-latest.tar.
gz and unpack it.
3. In the unpacked archive, you’ll find a etc/hadoop directory. In that directory, open
the core-site.xml file and modify it as follows:
<?xml version=” 1 . 0 ” ?>
<?xml−s t y l e s h e e t type=” t e x t / x s l ” h r e f=” c o n f i g u r a t i o n . x s l ” ?>
<c o n f i g u r a t i o n>
<p r o p e r t y>
<name> f s . d e f a u l t . name</name>
<v a l u e>h d f s : / / 1 9 2 . 1 6 8 . 5 6 . 1 0 1 : 8 0 2 0</ v a l u e>
</ p r o p e r t y>
</ c o n f i g u r a t i o n>
4. Next, open the yarn-site.xml file in the same directory and modify it as follows:
<?xml version=” 1 . 0 ” ?>
<?xml−s t y l e s h e e t type=” t e x t / x s l ” h r e f=” c o n f i g u r a t i o n . x s l ” ?>
<c o n f i g u r a t i o n>
<p r o p e r t y>
<name>yarn . r e s o u r c e m a n a g e r . hostname</name>
<v a l u e>1 9 2 . 1 6 8 . 5 6 . 1 0 1</ v a l u e>
</ p r o p e r t y>
</ c o n f i g u r a t i o n>
You can now run the Hadoop binaries located in the bin directory in the archive, and
they will connect to the cluster running in your virtual machine.
• Write a Hadoop MapReduce program which outputs the number of words that start
with each letter. This means that for every letter we want to count the total number
of words that start with that letter. In your implementation ignore the letter case, i.e.,
consider all words as lower case. You can ignore all non-alphabetic characters.
What to hand-in: Submit the printout of the output file to Gradescope (https://gradescope.com),
and upload the source code at http://snap.stanford.edu/submit/.