Labs Lecture2
Labs Lecture2
Labs, Lecture 2
1
Lab: Running a MapReduce Job
Files and Directories Used in this Exercise
Files:
WordCount.java: A simple MapReduce driver class.
WordMapper.java: A mapper class for the job.
SumReducer.java: A reducer class for the job.
wc.jar: The compiled, assembled WordCount program
In this lab you will compile Java files, create a JAR, and run MapReduce jobs.
$ cd ~/workspace/wordcount/src
$ ls
$ ls stubs
2
WordCount.java: A simple MapReduce driver class.
WordMapper.java: A mapper class for the job.
SumReducer.java: A reducer class for the job.
Examine these files if you wish, but do not change them. Remain in this
directory while you execute the following commands.
$ hadoop classpath
This shows lists the locations where the Hadoop core API classes are installed.
Note: in the command above, the quotes around hadoop classpath are
backquotes. This runs the hadoop classpath command and uses its
output as part of the javac command.
5. Submit a MapReduce job to Hadoop using your JAR file to count the occurrences
of each word in Shakespeare:
This hadoop jar command names the JAR file to use (wc.jar), the class
whose main method should be invoked (stubs.WordCount), and the HDFS
input and output directories to use for the MapReduce job.
3
Your job reads all the files in your HDFS shakespeare directory, and places its
output in a new HDFS directory called wordcounts.
Your job halts right away with an exception, because Hadoop automatically fails
if your job tries to write its output into an existing directory. This is by design;
since the result of a MapReduce job may be expensive to reproduce, Hadoop
prevents you from accidentally overwriting previously existing files.
This lists the output files for your job. (Your job ran with only one Reducer, so
there should be one file, named part-r-00000, along with a _SUCCESS file
and a _logs directory.)
You can page through a few screens to see words and their frequencies in the
works of Shakespeare. (The spacebar will scroll the output by one screen; the
letter 'q' will quit the less utility.) Note that you could have specified
wordcounts/* just as well in this command.
4
Wildcards in HDFS file paths
Take care when using wildcards (e.g. *) when specifying HFDS filenames;
because of how Linux works, the shell will attempt to expand the wildcard
before invoking hadoop, and then pass incorrect references to local files instead
of HDFS files. You can prevent this by enclosing the wildcarded HDFS filenames
in single quotes, e.g. hadoop fs –cat 'wordcounts/*'
When the job completes, inspect the contents of the pwords HDFS directory.
5
1. Start another word count job like you did in the previous section:
2. While this job is running, open another terminal window and enter:
This lists the job ids of all running jobs. A job id looks something like:
job_200902131742_0002
3. Copy the job id, and then kill the running job by entering:
The JobTracker kills the job, and the program running in the original terminal
completes.