Homework Labs WithProfessorNotes
Homework Labs WithProfessorNotes
Table of Contents
Getting Started
Copyright 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
1.
The
VM
is
set
to
automatically
log
in
as
the
user
training.
Should
you
log
out
at
any
time,
you
can
log
back
in
as
the
user
training
with
the
password
training.
/user/training/shakespeare
The
dollar
sign
($)
at
the
beginning
of
each
line
indicates
the
Linux
shell
prompt.
The
actual
prompt
will
include
additional
information
(e.g.,
[training@localhost
workspace]$
)
but
this
is
omitted
from
these
instructions
for
brevity.
The
backslash
(\)
at
the
end
of
the
first
line
signifies
that
the
command
is
not
completed,
and
continues
on
the
next
line.
You
can
enter
the
code
exactly
as
shown
(on
two
lines),
or
you
can
enter
it
on
a
single
line.
If
you
do
the
latter,
you
should
not
type
in
the
backslash.
3.
Although
many
students
are
comfortable
using
UNIX
text
editors
like
vi
or
emacs,
some
might
prefer
a
graphical
text
editor.
To
invoke
the
graphical
editor
from
the
command
line,
type
gedit
followed
by
the
path
of
the
file
you
wish
to
edit.
Appending
&
to
the
command
allows
you
to
type
additional
commands
while
the
editor
is
still
open.
Here
is
an
example
of
how
to
edit
a
file
named
myfile.txt:
$ gedit myfile.txt &
In
this
lab
you
will
begin
to
get
acquainted
with
the
Hadoop
tools.
You
will
manipulate
files
in
HDFS,
the
Hadoop
Distributed
File
System.
Hadoop
Hadoop
is
already
installed,
configured,
and
running
on
your
virtual
machine.
Most
of
your
interaction
with
the
system
will
be
through
a
command-line
wrapper
called
hadoop.
If
you
run
this
program
with
no
arguments,
it
prints
a
help
message.
To
try
this,
run
the
following
command
in
a
terminal
window:
$ hadoop
The
hadoop
command
is
subdivided
into
several
subsystems.
For
example,
there
is
a
subsystem
for
working
with
files
in
HDFS
and
another
for
launching
and
managing
MapReduce
processing
jobs.
Relative paths
If you pass any relative (non-absolute) paths to FsShell commands (or use relative
paths in MapReduce programs), they are considered relative to your home directory.
6.
We
will
also
need
a
sample
web
server
log
file,
which
we
will
put
into
HDFS
for
use
in
future
labs.
This
file
is
currently
compressed
using
GZip.
Rather
than
extract
the
file
to
the
local
disk
and
then
upload
it,
we
will
extract
and
upload
in
one
step.
First,
create
a
directory
in
HDFS
in
which
to
store
it:
$ hadoop fs -mkdir weblog
7.
Now,
extract
and
upload
the
file
in
one
step.
The
-c
option
to
gunzip
uncompresses
to
standard
output,
and
the
dash
(-)
in
the
hadoop fs -put
command
takes
whatever
is
being
sent
to
its
standard
input
and
places
that
data
in
HDFS.
$ gunzip -c access_log.gz \
| hadoop fs -put - weblog/access_log
8.
Run
the
hadoop fs -ls
command
to
verify
that
the
log
file
is
in
your
HDFS
home
directory.
9.
The
access
log
file
is
quite
large
around
500
MB.
Create
a
smaller
version
of
this
file,
consisting
only
of
its
first
5000
lines,
and
store
the
smaller
version
in
HDFS.
You
can
use
the
smaller
version
for
testing
in
subsequent
labs.
$ hadoop fs -mkdir testlog
$ gunzip -c access_log.gz | head -n 5000 \
| hadoop fs -put - testlog/test_access_log
3.
Enter:
$ hadoop fs -cat shakespeare/histories | tail -n 50
This
prints
the
last
50
lines
of
Henry
IV,
Part
1
to
your
terminal.
This
command
is
handy
for
viewing
the
output
of
MapReduce
programs.
Very
often,
an
individual
output
file
of
a
MapReduce
program
is
very
large,
making
it
inconvenient
to
view
the
entire
file
in
the
terminal.
For
this
reason,
its
often
a
good
idea
to
pipe
the
output
of
the
fs -cat
command
into
head,
tail,
more,
or
less.
4.
To
download
a
file
to
work
with
on
the
local
filesystem
use
the
fs -get
command.
This
command
takes
two
arguments:
an
HDFS
path
and
a
local
path.
It
copies
the
HDFS
contents
into
the
local
filesystem:
$ hadoop fs -get shakespeare/poems ~/shakepoems.txt
$ less ~/shakepoems.txt
Other Commands
There
are
several
other
operations
available
with
the
hadoop fs
command
to
perform
most
common
filesystem
manipulations:
mv,
cp,
mkdir,
etc.
1.
Enter:
$ hadoop fs
This
displays
a
brief
usage
report
of
the
commands
available
within
FsShell.
Try
playing
around
with
a
few
of
these
commands
if
you
like.
10
In
this
lab
you
will
compile
Java
files,
create
a
JAR,
and
run
MapReduce
jobs.
In
addition
to
manipulating
files
in
HDFS,
the
wrapper
program
hadoop
is
used
to
launch
MapReduce
jobs.
The
code
for
a
job
is
contained
in
a
compiled
JAR
file.
Hadoop
loads
the
JAR
into
HDFS
and
distributes
it
to
the
worker
nodes,
where
the
individual
tasks
of
the
MapReduce
job
are
executed.
One
simple
example
of
a
MapReduce
job
is
to
count
the
number
of
occurrences
of
each
word
in
a
file
or
set
of
files.
In
this
lab
you
will
compile
and
submit
a
MapReduce
job
to
count
the
number
of
occurrences
of
every
word
in
the
works
of
Shakespeare.
11
12
Your
job
reads
all
the
files
in
your
HDFS
shakespeare
directory,
and
places
its
output
in
a
new
HDFS
directory
called
wordcounts.
6.
Try
running
this
same
command
again
without
any
change:
$ hadoop jar wc.jar solution.WordCount \
shakespeare wordcounts
Your
job
halts
right
away
with
an
exception,
because
Hadoop
automatically
fails
if
your
job
tries
to
write
its
output
into
an
existing
directory.
This
is
by
design;
since
the
result
of
a
MapReduce
job
may
be
expensive
to
reproduce,
Hadoop
prevents
you
from
accidentally
overwriting
previously
existing
files.
7.
Review
the
result
of
your
MapReduce
job:
$ hadoop fs -ls wordcounts
This
lists
the
output
files
for
your
job.
(Your
job
ran
with
only
one
Reducer,
so
there
should
be
one
file,
named
part-r-00000,
along
with
a
_SUCCESS
file
and
a
_logs
directory.)
8.
View
the
contents
of
the
output
for
your
job:
$ hadoop fs -cat wordcounts/part-r-00000 | less
You
can
page
through
a
few
screens
to
see
words
and
their
frequencies
in
the
works
of
Shakespeare.
(The
spacebar
will
scroll
the
output
by
one
screen;
the
letter
'q'
will
quit
the
less
utility.)
Note
that
you
could
have
specified
wordcounts/*
just
as
well
in
this
command.
13
14
1.
Start
another
word
count
job
like
you
did
in
the
previous
section:
$ hadoop jar wc.jar solution.WordCount shakespeare \
count2
2.
While
this
job
is
running,
open
another
terminal
window
and
enter:
$ mapred job -list
This
lists
the
job
ids
of
all
running
jobs.
A
job
id
looks
something
like:
job_200902131742_0002
3.
Copy
the
job
id,
and
then
kill
the
running
job
by
entering:
$ mapred job -kill jobid
The
JobTracker
kills
the
job,
and
the
program
running
in
the
original
terminal
completes.
15
In
this
lab,
you
will
write
a
MapReduce
job
that
reads
any
text
input
and
computes
the
average
length
of
all
words
that
start
with
each
character.
For
any
text
input,
the
job
should
report
the
average
length
of
words
that
begin
with
a,
b,
and
so
forth.
For
example,
for
input:
No now is definitely not the time
The
output
would
be:
N
2.0
3.0
10.0
2.0
3.5
(For the initial solution, your program should be case-sensitive as shown in this example.)
16
The Algorithm
The
algorithm
for
this
program
is
a
simple
one-pass
MapReduce
program:
The
Mapper
The
Mapper
receives
a
line
of
text
for
each
input
value.
(Ignore
the
input
key.)
For
each
word
in
the
line,
emit
the
first
letter
of
the
word
as
a
key,
and
the
length
of
the
word
as
a
value.
For
example,
for
input
value:
No now is definitely not the time
Your
Mapper
should
emit:
N
10
The
Reducer
Thanks
to
the
shuffle
and
sort
phase
built
in
to
MapReduce,
the
Reducer
receives
the
keys
in
sorted
order,
and
all
the
values
for
one
key
are
grouped
together.
So,
for
the
Mapper
output
above,
the
Reducer
receives
this:
17
(2)
(10)
(2)
(3,3)
(3,4)
2.0
10.0
2.0
3.0
3.5
18
You
may
wish
to
refer
back
to
the
wordcount
example
(in
the
wordcount
project
in
Eclipse
or
in
~/workspace/wordcount)
as
a
starting
point
for
your
Java
code.
Here
are
a
few
details
to
help
you
begin
your
Java
programming:
3. Define
the
driver
This
class
should
configure
and
submit
your
basic
job.
Among
the
basic
steps
here,
configure
the
job
with
the
Mapper
class
and
the
Reducer
class
you
will
write,
and
the
data
types
of
the
intermediate
and
final
keys.
4. Define
the
Mapper
Note
these
simple
string
operations
in
Java:
str.substring(0, 1)
str.length()
19
1. Verify
that
your
Java
code
does
not
have
any
compiler
errors
or
warnings.
The
Eclipse
software
in
your
VM
is
pre-configured
to
compile
code
automatically
without
performing
any
explicit
steps.
Compile
errors
and
warnings
appear
as
red
and
yellow
icons
to
the
left
of
the
code.
2. In
the
Package
Explorer,
open
the
Eclipse
project
for
the
current
lab
(i.e.
averagewordlength).
Right-click
the
default
package
under
the
src
entry
and
select
Export.
20
3. Select
Java
>
JAR
file
from
the
Export
dialog
box,
then
click
Next.
4. Specify
a
location
for
the
JAR
file.
You
can
place
your
JAR
files
wherever
you
like,
e.g.:
21
Note:
For
more
information
about
using
Eclipse,
see
the
Eclipse
Reference
in
Homework_EclipseRef.docx.
1.02
1.0588235294117647
1.0
1.5
1.5
1.5
1.0
1.5
1.0
3.891394576646375
5.139302507836991
6.629694233531706
This
example
uses
the
entire
Shakespeare
dataset
for
your
input;
you
can
also
try
it
with
just
one
of
the
files
in
the
dataset,
or
with
your
own
test
data.
Copyright 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
22
23
In
this
lab,
you
will
analyze
a
log
file
from
a
web
server
to
count
the
number
of
hits
made
from
each
unique
IP
address.
Your
task
is
to
count
the
number
of
hits
made
from
each
IP
address
in
the
sample
(anonymized)
web
server
log
file
that
you
uploaded
to
the
/user/training/weblog
directory
in
HDFS
when
you
completed
the
Using
HDFS
lab.
In
the
log_file_analysis
directory,
you
will
find
stubs
for
the
Mapper
and
Driver.
1.
Using
the
stub
files
in
the
log_file_analysis
project
directory,
write
Mapper
and
Driver
code
to
count
the
number
of
hits
made
from
each
IP
address
in
the
access
log
file.
Your
final
result
should
be
a
file
in
HDFS
containing
each
IP
address,
and
the
count
of
log
hits
from
that
address.
Note:
The
Reducer
for
this
lab
performs
the
exact
same
function
as
the
one
in
the
WordCount
program
you
ran
earlier.
You
can
reuse
that
code
or
you
can
write
your
own
if
you
prefer.
2.
Build
your
application
jar
file
following
the
steps
in
the
previous
lab.
24
3.
Test
your
code
using
the
sample
log
data
in
the
/user/training/weblog
directory.
Note:
You
may
wish
to
test
your
code
against
the
smaller
version
of
the
access
log
you
created
in
a
prior
lab
(located
in
the
/user/training/testlog
HDFS
directory)
before
you
run
your
code
against
the
full
log
which
can
be
quite
time
consuming.
25
In
this
lab
you
will
repeat
the
same
task
as
in
the
previous
lab:
writing
a
program
to
calculate
average
word
lengths
for
letters.
However,
you
will
write
this
as
a
streaming
program
using
a
scripting
language
of
your
choice
rather
than
using
Java.
Your
virtual
machine
has
Perl,
Python,
PHP,
and
Ruby
installed,
so
you
can
choose
any
of
theseor
even
shell
scriptingto
develop
a
Streaming
solution.
For
your
Hadoop
Streaming
program
you
will
not
use
Eclipse.
Launch
a
text
editor
to
write
your
Mapper
script
and
your
Reducer
script.
Here
are
some
notes
about
solving
the
problem
in
Hadoop
Streaming:
1.
The
Mapper
Script
The
Mapper
will
receive
lines
of
text
on
stdin.
Find
the
words
in
the
lines
to
produce
the
intermediate
output,
and
emit
intermediate
(key,
value)
pairs
by
writing
strings
of
the
form:
key <tab> value <newline>
These
strings
should
be
written
to
stdout.
2.
The
Reducer
Script
For
the
reducer,
multiple
values
with
the
same
key
are
sent
to
your
script
on
stdin
as
successive
lines
of
input.
Each
line
contains
a
key,
a
tab,
a
value,
and
a
newline.
All
lines
with
the
same
key
are
sent
one
after
another,
possibly
followed
by
lines
with
a
different
26
key,
until
the
reducing
input
is
complete.
For
example,
the
reduce
script
may
receive
the
following:
t
3.5
5.0
Observe
that
the
reducer
receives
a
key
with
each
input
line,
and
must
notice
when
the
key
changes
on
a
subsequent
line
(or
when
the
input
is
finished)
to
know
when
the
values
for
a
given
key
have
been
exhausted.
This
is
different
than
the
Java
version
you
worked
on
in
the
previous
lab.
3.
Run
the
streaming
program:
$ hadoop jar /usr/lib/hadoop-0.20-mapreduce/\
contrib/streaming/hadoop-streaming*.jar \
-input inputDir -output outputDir \
-file pathToMapScript -file pathToReduceScript \
-mapper mapBasename -reducer reduceBasename
(Remember,
you
may
need
to
delete
any
previous
output
before
running
your
program
by
issuing:
hadoop fs -rm -r dataToDelete.)
4.
Review
the
output
in
the
HDFS
directory
you
specified
(outputDir).
Professors
Note~
The
Perl
example
is
in:
~/workspace/wordcount/perl_solution
Copyright 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
27
Professors Note~
Solution in Python
You
can
find
a
working
solution
to
this
lab
written
in
Python
in
the
directory
~/workspace/averagewordlength/python_sample_solution.
To
run
the
solution,
change
directory
to
~/workspace/averagewordlength
and
run
this
command:
$ hadoop jar /usr/lib/hadoop-0.20-mapreduce\
/contrib/streaming/hadoop-streaming*.jar \
-input shakespeare -output avgwordstreaming \
-file python_sample_solution/mapper.py \
-file python_sample_solution/reducer.py \
-mapper mapper.py -reducer reducer.py
28
In
this
Exercise,
you
will
write
Unit
Tests
for
the
WordCount
code.
1.
Launch
Eclipse
(if
necessary)
and
expand
the
mrunit
folder.
2.
Examine
the
TestWordCount.java
file
in
the
mrunit
project
stubs
package.
Notice
that
three
tests
have
been
created,
one
each
for
the
Mapper,
Reducer,
and
the
entire
MapReduce
flow.
Currently,
all
three
tests
simply
fail.
3.
Run
the
tests
by
right-clicking
on
TestWordCount.java
in
the
Package
Explorer
panel
and
choosing
Run
As
>
JUnit
Test.
4.
Observe
the
failure.
Results
in
the
JUnit
tab
(next
to
the
Package
Explorer
tab)
should
indicate
that
three
tests
ran
with
three
failures.
5.
Now
implement
the
three
tests.
6.
Run
the
tests
again.
Results
in
the
JUnit
tab
should
indicate
that
three
tests
ran
with
no
failures.
7.
When
you
are
done,
close
the
JUnit
tab.
29
30
2.
Modify
the
AvgWordLength
driver
to
use
ToolRunner.
Refer
to
the
slides
for
details.
a. Implement
the
run
method
b. Modify
main
to
call
run
3.
Jar
your
solution
and
test
it
before
continuing;
it
should
continue
to
function
exactly
as
it
did
before.
Refer
to
the
Writing
a
Java
MapReduce
Program
lab
for
how
to
assemble
and
test
if
you
need
a
reminder.
31
5.
Modify
the
drivers
run
method
to
set
a
Boolean
configuration
parameter
called
caseSensitive.
(Hint:
Use
the
Configuration.setBoolean
method.)
6.
Test
your
code
twice,
once
passing
false
and
once
passing
true.
When
set
to
true,
your
final
output
should
have
both
upper
and
lower
case
letters;
when
false,
it
should
have
only
lower
case
letters.
Hint:
Remember
to
rebuild
your
Jar
file
to
test
changes
to
your
code.
32
In
this
lab,
you
will
add
a
Combiner
to
the
WordCount
program
to
reduce
the
amount
of
intermediate
data
sent
from
the
Mapper
to
the
Reducer.
Because
summing
is
associative
and
commutative,
the
same
class
can
be
used
for
both
the
Reducer
and
the
Combiner.
Implement a Combiner
1.
Copy
WordMapper.java
and
SumReducer.java
from
the
wordcount
project
to
the
combiner
project.
2.
Modify
the
WordCountDriver.java
code
to
add
a
Combiner
for
the
WordCount
program.
3.
Assemble
and
test
your
solution.
(The
output
should
remain
identical
to
the
WordCount
application
without
a
combiner.)
33
In
this
lab,
you
will
practice
running
a
job
locally
for
debugging
and
testing
purposes.
In
the
Using
ToolRunner
and
Passing
Parameters
lab,
you
modified
the
Average
Word
Length
program
to
use
ToolRunner.
This
makes
it
simple
to
set
job
configuration
properties
on
the
command
line.
34
35
5. On
the
Main
tab,
confirm
that
the
Project
and
Main
class
are
set
correctly
for
your
project,
e.g.
Project:
toolrunner
and
Main
class:
stubs.AvgWordLength
6. Select
the
Arguments
tab
and
enter
the
input
and
output
folders.
(These
are
local,
not
HDFS
folders,
and
are
relative
to
the
run
configurations
working
folder,
which
by
default
is
the
project
folder
in
the
Eclipse
workspace:
e.g.
~/workspace/toolrunner.)
7. Click
the
Run
button.
The
program
will
run
locally
with
the
output
displayed
in
the
Eclipse
console
window.
8. Review
the
job
output
in
the
local
output
folder
you
specified.
Note:
You
can
re-run
any
previous
configurations
using
the
Run
or
Debug
history
buttons
on
the
Eclipse
tool
bar.
36
37
4. In
the
task
summary,
click
map
to
view
the
map
tasks.
5. In
the
list
of
tasks,
click
on
the
map
task
to
view
the
details
of
that
task.
38
6. Under
Task
Logs,
click
All.
The
logs
should
include
both
INFO
and
DEBUG
messages.
E.g.:
39
Hints
1.
You
should
use
a
Map-only
MapReduce
job,
by
setting
the
number
of
Reducers
to
0
in
the
driver
code.
2.
For
input
data,
use
the
Web
access
log
file
that
you
uploaded
to
the
HDFS
/user/training/weblog
directory
in
the
Using
HDFS
lab.
Note:
Test
your
code
against
the
smaller
version
of
the
access
log
in
the
/user/training/testlog
directory
before
you
run
your
code
against
the
full
log
in
the
/user/training/weblog
directory.
3.
Use
a
counter
group
such
as
ImageCounter,
with
names
gif,
jpeg
and
other.
40
4.
In
your
driver
code,
retrieve
the
values
of
the
counters
after
the
job
has
completed
and
report
them
using
System.out.println.
5.
The
output
folder
on
HDFS
will
contain
Mapper
output
files
which
are
empty,
because
the
Mappers
did
not
write
any
data.
41
In
this
Exercise,
you
will
write
a
MapReduce
job
with
multiple
Reducers,
and
create
a
Partitioner
to
determine
which
Reducer
each
piece
of
Mapper
output
is
sent
to.
The Problem
In
the
More
Practice
with
Writing
MapReduce
Java
Programs
lab
you
did
previously,
you
built
the
code
in
log_file_analysis
project.
That
program
counted
the
number
of
hits
for
each
different
IP
address
in
a
web
log
file.
The
final
output
was
a
file
containing
a
list
of
IP
addresses,
and
the
number
of
hits
from
that
address.
This
time,
you
will
perform
a
similar
task,
but
the
final
output
should
consist
of
12
files,
one
each
for
each
month
of
the
year:
January,
February,
and
so
on.
Each
file
will
contain
a
list
of
IP
addresses,
and
the
number
of
hits
from
that
address
in
that
month.
We
will
accomplish
this
by
having
12
Reducers,
each
of
which
is
responsible
for
processing
the
data
for
a
particular
month.
Reducer
0
processes
January
hits,
Reducer
1
processes
February
hits,
and
so
on.
42
Note:
We
are
actually
breaking
the
standard
MapReduce
paradigm
here,
which
says
that
all
the
values
from
a
particular
key
will
go
to
the
same
Reducer.
In
this
example,
which
is
a
very
common
pattern
when
analyzing
log
files,
values
from
the
same
key
(the
IP
address)
will
go
to
multiple
Reducers,
based
on
the
month
portion
of
the
line.
43
You
may
wish
to
test
your
code
against
the
smaller
version
of
the
access
log
in
the
/user/training/testlog
directory
before
you
run
your
code
against
the
full
log
in
the
/user/training/weblog
directory.
However,
note
that
the
test
data
may
not
include
all
months,
so
some
result
files
will
be
empty.
44
In
this
lab,
you
will
create
a
custom
WritableComparable
type
that
holds
two
strings.
Test
the
new
type
by
creating
a
simple
program
that
reads
a
list
of
names
(first
and
last)
and
counts
the
number
of
occurrences
of
each
name.
The
mapper
should
accepts
lines
in
the
form:
lastname firstname other data
The
goal
is
to
count
the
number
of
times
a
lastname/firstname
pair
occur
within
the
dataset.
For
example,
for
input:
Smith Joe 1963-08-12 Poughkeepsie, NY
Smith Joe 1832-01-20 Sacramento, CA
Murphy Alice 2004-06-02 Berlin, MA
We
want
to
output:
(Smith,Joe)
2
(Murphy,Alice) 1
Copyright 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
45
Note: You will use your custom WritableComparable type in a future lab, so make sure it
is working with the test job now.
StringPairWritable
You
need
to
implement
a
WritableComparable
object
that
holds
the
two
strings.
The
stub
provides
an
empty
constructor
for
serialization,
a
standard
constructor
that
will
be
given
two
strings,
a
toString
method,
and
the
generated
hashCode
and
equals
methods.
You
will
need
to
implement
the
readFields,
write,
and
compareTo
methods
required
by
WritableComparables.
Note
that
Eclipse
automatically
generated
the
hashCode
and
equals
methods
in
the
stub
file.
You
can
generate
these
two
methods
in
Eclipse
by
right-clicking
in
the
source
code
and
choosing
Source
>
Generate
hashCode()
and
equals().
46
In
this
lab
you
will
practice
reading
and
writing
uncompressed
and
compressed
SequenceFiles.
First,
you
will
develop
a
MapReduce
application
to
convert
text
data
to
a
SequenceFile.
Then
you
will
modify
the
application
to
compress
the
SequenceFile
using
Snappy
file
compression.
When
creating
the
SequenceFile,
use
the
full
access
log
file
for
input
data.
(You
uploaded
the
access
log
file
to
the
HDFS
/user/training/weblog
directory
when
you
performed
the
Using
HDFS
lab.)
After
you
have
created
the
compressed
SequenceFile,
you
will
write
a
second
MapReduce
application
to
read
the
compressed
SequenceFile
and
write
a
text
file
that
contains
the
original
log
file
text.
47
48
5.
Verify
that
the
number
of
files
created
by
the
job
is
equivalent
to
the
number
of
blocks
required
to
store
the
uncompressed
SequenceFile.
7.
Compile
the
code
and
run
your
modified
MapReduce
job.
For
the
MapReduce
output,
specify
the
compressedsf
directory.
8.
Examine
the
first
portion
of
the
output
SequenceFile.
Notice
the
differences
between
the
uncompressed
and
compressed
SequenceFiles:
You cannot read the log file text in the compressed file.
9.
Compare
the
file
sizes
of
the
uncompressed
and
compressed
SequenceFiles
in
the
uncompressedsf
and
compressedsf
directories.
The
compressed
SequenceFiles
should
be
smaller.
49
50
In
this
lab,
you
will
write
a
MapReduce
job
that
produces
an
inverted
index.
For
this
lab
you
will
use
an
alternate
input,
provided
in
the
file
invertedIndexInput.tgz.
When
decompressed,
this
archive
contains
a
directory
of
files;
each
is
a
Shakespeare
play
formatted
as
follows:
0
HAMLET
1
2
3
DRAMATIS PERSONAE
4
5
6
CLAUDIUS
7
8
HAMLET
9
10
POLONIUS
51
...
Each
line
contains:
Line
number
separator:
a
tab
character
value:
the
line
of
text
This
format
can
be
read
directly
using
the
KeyValueTextInputFormat
class
provided
in
the
Hadoop
API.
This
input
format
presents
each
line
as
one
record
to
your
Mapper,
with
the
part
before
the
tab
character
as
the
key,
and
the
part
after
the
tab
as
the
value.
Given
a
body
of
text
in
this
form,
your
indexer
should
produce
an
index
of
all
the
words
in
the
text.
For
each
word,
the
index
should
have
a
list
of
all
the
locations
where
the
word
appears.
For
example,
for
the
word
honeysuckle
your
output
should
look
like
this:
honeysuckle
2kinghenryiv@1038,midsummernightsdream@2175,...
The index should contain such an entry for every word in the text.
52
Hints
You
may
like
to
complete
this
lab
without
reading
any
further,
or
you
may
find
the
following
hints
about
the
algorithm
helpful.
The Mapper
Your
Mapper
should
take
as
input
a
key
and
a
line
of
words,
and
emit
as
intermediate
values
each
word
as
key,
and
the
key
as
value.
For
example,
the
line
of
input
from
the
file
hamlet:
282 Have heaven and earth together
produces
intermediate
output:
53
Have
hamlet@282
heaven
hamlet@282
and
hamlet@282
earth
hamlet@282
together
hamlet@282
The Reducer
Your
Reducer
simply
aggregates
the
values
presented
to
it
for
the
same
key,
into
one
value.
Use
a
separator
like
,
between
the
values
listed.
54
In
this
lab,
you
will
write
an
application
that
counts
the
number
of
times
words
appear
next
to
each
other.
Test
your
application
using
the
files
in
the
shakespeare
folder
you
previously
copied
into
HDFS
in
the
Using
HDFS
lab.
Note
that
this
implementation
is
a
specialization
of
Word
Co-Occurrence
as
we
describe
it
in
the
notes;
in
this
case
we
are
only
interested
in
pairs
of
words
which
appear
directly
next
to
each
other.
1.
Change
directories
to
the
word_co-occurrence
directory
within
the
labs
directory.
2.
Complete
the
Driver
and
Mapper
stub
files;
you
can
use
the
standard
SumReducer
from
the
WordCount
project
as
your
Reducer.
Your
Mappers
intermediate
output
should
be
in
the
form
of
a
Text
object
as
the
key,
and
an
IntWritable
as
the
value;
the
key
will
be
word1,word2,
and
the
value
will
be
1.
Extra Credit
If
you
have
extra
time,
please
complete
these
additional
challenges:
Copyright 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
55
56
57
58
59
60
61
In
this
lab,
you
will
run
a
MapReduce
job
in
different
ways
to
see
the
effects
of
various
components
in
a
secondary
sort
program.
The
program
accepts
lines
in
the
form
lastname firstname birthdate
The
goal
is
to
identify
the
youngest
person
with
each
last
name.
For
example,
for
input:
Murphy Joanne 1963-08-12
Murphy Douglas 1832-01-20
Murphy Alice 2004-06-02
We
want
to
write
out:
Murphy Alice 2004-06-02
All
the
code
is
provided
to
do
this.
Following
the
steps
below
you
are
going
to
progressively
add
each
component
to
the
job
to
accomplish
the
final
goal.
62
63
10. Re-run
the
job,
adding
a
second
parameter
to
set
the
partitioner
class
to
use:
-Dmapreduce.partitioner.class=example.NameYearPartitioner
11. Review
the
output
again,
this
time
noting
that
all
records
with
the
same
last
name
have
been
partitioned
to
the
same
reducer.
However,
they
are
still
being
sorted
into
the
default
sort
order
(name,
year
ascending).
We
want
it
sorted
by
name
ascending/year
descending.
64
65
66
67
68
69
6.
Revise
the
previous
command
and
import
the
customers
table
into
HDFS.
7.
Revise
the
previous
command
and
import
the
products
table
into
HDFS.
8.
Revise
the
previous
command
and
import
the
orders
table
into
HDFS.
9.
Next,
you
will
import
the
order_details
table
into
HDFS.
The
command
is
slightly
different
because
this
table
only
holds
references
to
records
in
the
orders
and
products
table,
and
lacks
a
primary
key
of
its
own.
Consequently,
you
will
need
to
specify
the
--split-by
option
and
instruct
Sqoop
to
divide
the
import
work
among
map
tasks
based
on
values
in
the
order_id
field.
An
alternative
is
to
use
the
-m 1
option
to
force
Sqoop
to
import
all
the
data
with
a
single
task,
but
this
would
significantly
reduce
performance.
$ sqoop import \
--connect jdbc:mysql://localhost/dualcore \
--username training --password training \
--fields-terminated-by '\t' \
--warehouse-dir /dualcore \
--table order_details \
--split-by=order_id
70
Background Information
Dualcore
has
recently
started
using
online
advertisements
to
attract
new
customers
to
its
e-
commerce
site.
Each
of
the
two
ad
networks
they
use
provides
data
about
the
ads
theyve
placed.
This
includes
the
site
where
the
ad
was
placed,
the
date
when
it
was
placed,
what
keywords
triggered
its
display,
whether
the
user
clicked
the
ad,
and
the
per-click
cost.
Unfortunately,
the
data
from
each
network
is
in
a
different
format.
Each
file
also
contains
some
invalid
records.
Before
we
can
analyze
the
data,
we
must
first
correct
these
problems
by
using
Pig
to:
Reorder fields
Correct inconsistencies
71
72
5.
Load
the
first
two
columns
data
from
the
sample
file
as
character
data,
and
then
dump
that
data:
grunt> first_2_columns = LOAD 'sample1.txt' AS
(keyword:chararray, campaign_id:chararray);
grunt> DUMP first_2_columns;
6.
Use
the
DESCRIBE
command
in
Pig
to
review
the
schema
of
first_2_cols:
grunt> DESCRIBE first_2_columns;
The
schema
appears
in
the
Grunt
shell.
Use
the
DESCRIBE
command
while
performing
these
labs
any
time
you
would
like
to
review
schema
definitions.
7.
See
what
happens
if
you
run
the
DESCRIBE
command
on
data.
Recall
that
when
you
loaded
data,
you
did
not
define
a
schema.
grunt> DESCRIBE data;
8.
End
your
Grunt
shell
session:
grunt> QUIT;
73
Field
keyword
campaign_id
date
time
display_site
was_clicked
cpc
country
placement
Data
Type
chararray
chararray
chararray
chararray
chararray
int
int
chararray
chararray
Description
Keyword
that
triggered
ad
Uniquely
identifies
the
ad
Date
of
ad
display
Time
of
ad
display
Domain
where
ad
shown
Whether
ad
was
clicked
Cost
per
click,
in
cents
Name
of
country
in
which
ad
ran
Where
on
page
was
ad
displayed
Example
tablet
A3
05/29/2013
15:49:21
www.example.com
1
106
USA
TOP
2.
Once
you
have
edited
the
LOAD
statement,
try
it
out
by
running
your
script
in
local
mode:
$ pig -x local first_etl.pig
Make
sure
the
output
looks
correct
(i.e.,
that
you
have
the
fields
in
the
expected
order
and
the
values
appear
similar
in
format
to
that
shown
in
the
table
above)
before
you
continue
with
the
next
step.
3.
Make
each
of
the
following
changes,
running
your
script
in
local
mode
after
each
one
to
verify
that
your
change
is
correct:
a. Update
your
script
to
filter
out
all
records
where
the
country
field
does
not
contain
USA.
74
b. We
need
to
store
the
fields
in
a
different
order
than
we
received
them.
Use
a
FOREACH
GENERATE
statement
to
create
a
new
relation
containing
the
fields
in
the
same
order
as
shown
in
the
following
table
(the
country
field
is
not
included
since
all
records
now
have
the
same
value):
Index
0
1
2
3
4
5
6
7
Field
campaign_id
date
time
keyword
display_site
placement
was_clicked
cpc
Description
Uniquely
identifies
the
ad
Date
of
ad
display
Time
of
ad
display
Keyword
that
triggered
ad
Domain
where
ad
shown
Where
on
page
was
ad
displayed
Whether
ad
was
clicked
Cost
per
click,
in
cents
c. Update
your
script
to
convert
the
keyword
field
to
uppercase
and
to
remove
any
leading
or
trailing
whitespace
(Hint:
You
can
nest
calls
to
the
two
built-in
functions
inside
the
FOREACH
GENERATE
statement
from
the
last
statement).
4.
Add
the
complete
data
file
to
HDFS:
$ hadoop fs -put $ADIR/data/ad_data1.txt /dualcore
5.
Edit
first_etl.pig
and
change
the
path
in
the
LOAD
statement
to
match
the
path
of
the
file
you
just
added
to
HDFS
(/dualcore/ad_data1.txt).
6.
Next,
replace
DUMP
with
a
STORE
statement
that
will
write
the
output
of
your
processing
as
tab-delimited
records
to
the
/dualcore/ad_data1
directory.
7.
Run
this
script
in
Pigs
MapReduce
mode
to
analyze
the
entire
file
in
HDFS:
$ pig first_etl.pig
If
your
script
fails,
check
your
code
carefully,
fix
the
error,
and
then
try
running
it
again.
Dont
forget
that
you
must
remove
output
in
HDFS
from
a
previous
run
before
you
execute
the
script
again.
75
8.
Check
the
first
20
output
records
that
your
script
wrote
to
HDFS
and
ensure
they
look
correct
(you
can
ignore
the
message
cat:
Unable
to
write
to
output
stream;
this
simply
happens
because
you
are
writing
more
data
with
the
fs -cat
command
than
you
are
reading
with
the
head
command):
$ hadoop fs -cat /dualcore/ad_data1/part* | head -20
a. Are
the
fields
in
the
correct
order?
b. Are
all
the
keywords
now
in
uppercase?
Field
campaign_id
date
time
display_site
placement
was_clicked
cpc
keyword
Data
Type
chararray
chararray
chararray
chararray
chararray
int
int
chararray
Description
Uniquely
identifies
the
ad
Date
of
ad
display
Time
of
ad
display
Domain
where
ad
shown
Where
on
page
was
ad
displayed
Whether
ad
was
clicked
Cost
per
click,
in
cents
Keyword
that
triggered
ad
Example
A3
05/29/2013
15:49:21
www.example.com
TOP
Y
106
tablet
76
3.
Once
you
have
edited
the
LOAD
statement,
use
the
DESCRIBE
keyword
and
then
run
your
script
in
local
mode
to
check
that
the
schema
matches
the
table
above:
$ pig -x local second_etl.pig
4.
Replace
DESCRIBE
with
a
DUMP
statement
and
then
make
each
of
the
following
changes
to
second_etl.pig,
running
this
script
in
local
mode
after
each
change
to
verify
what
youve
done
before
you
continue
with
the
next
step:
d. This
ad
network
sometimes
logs
a
given
record
twice.
Add
a
statement
to
the
second_etl.pig
file
so
that
you
remove
any
duplicate
records.
If
you
have
done
this
correctly,
you
should
only
see
one
record
where
the
display_site
field
has
a
value
of
siliconwire.example.com.
e. As
before,
you
need
to
store
the
fields
in
a
different
order
than
you
received
them.
Use
a
FOREACH
GENERATE
statement
to
create
a
new
relation
containing
the
fields
in
the
same
order
you
used
to
write
the
output
from
first
ad
network
(shown
again
in
the
table
below)
and
also
use
the
UPPER
and
TRIM
functions
to
correct
the
keyword
field
as
you
did
earlier:
Index
0
1
2
3
4
5
6
7
Field
campaign_id
date
time
keyword
display_site
placement
was_clicked
cpc
Description
Uniquely
identifies
the
ad
Date
of
ad
display
Time
of
ad
display
Keyword
that
triggered
ad
Domain
where
ad
shown
Where
on
page
was
ad
displayed
Whether
ad
was
clicked
Cost
per
click,
in
cents
f. The
date
field
in
this
data
set
is
in
the
format
MM-DD-YYYY,
while
the
data
you
previously
wrote
is
in
the
format
MM/DD/YYYY.
Edit
the
FOREACH
GENERATE
statement
to
call
the
REPLACE(date, '-', '/')
function
to
correct
this.
77
5.
Once
you
are
sure
the
script
works
locally,
add
the
full
data
set
to
HDFS:
$ hadoop fs -put $ADIR/data/ad_data2.txt /dualcore
6.
Edit
the
script
to
have
it
LOAD
the
file
you
just
added
to
HDFS,
and
then
replace
the
DUMP
statement
with
a
STORE
statement
to
write
your
output
as
tab-delimited
records
to
the
/dualcore/ad_data2
directory.
7.
Run
your
script
against
the
data
you
added
to
HDFS:
$ pig second_etl.pig
8.
Check
the
first
15
output
records
written
in
HDFS
by
your
script:
$ hadoop fs -cat /dualcore/ad_data2/part* | head -15
a. Do
you
see
any
duplicate
records?
b. Are
the
fields
in
the
correct
order?
c. Are
all
the
keywords
in
uppercase?
78
79
3.
Open
the
low_cost_sites.pig
file
in
your
editor,
and
then
make
the
following
changes:
a. Modify
the
LOAD
statement
to
read
the
sample
data
in
the
test_ad_data.txt
file.
b. Add
a
line
that
creates
a
new
relation
to
include
only
records
where
was_clicked
has
a
value
of
1.
c. Group
this
filtered
relation
by
the
display_site
field.
d. Create
a
new
relation
that
includes
two
fields:
the
display_site
and
the
total
cost
of
all
clicks
on
that
site.
e. Sort
that
new
relation
by
cost
(in
ascending
order)
f. Display
just
the
first
three
records
to
the
screen
4.
Once
you
have
made
these
changes,
try
running
your
script
against
the
sample
data:
$ pig x local low_cost_sites.pig
5.
In
the
LOAD
statement,
replace
the
test_ad_data.txt
file
with
a
file
glob
(pattern)
that
will
load
both
the
/dualcore/ad_data1
and
/dualcore/ad_data2
directories
(and
does
not
load
any
other
data,
such
as
the
text
files
from
the
previous
lab).
6.
Once
you
have
made
these
changes,
try
running
your
script
against
the
data
in
HDFS:
$ pig low_cost_sites.pig
Question:
Which
three
sites
have
the
lowest
overall
cost?
80
81
82
83
3.
Once
you
have
made
these
changes,
try
running
your
script
against
the
data
in
HDFS:
$ pig project_next_campaign_cost.pig
Question:
What
is
the
maximum
you
expect
this
campaign
might
cost?
Professors
Note~
You
can
compare
your
solution
to
the
one
in
the
bonus_02/sample_solution/
subdirectory.
84
85
86
87
3.
Once
you
have
made
these
changes,
try
running
your
script
against
the
data
in
HDFS:
$ pig count_orders_by_period.pig
Question:
Does
the
data
suggest
that
the
advertising
campaign
we
started
in
May
led
to
a
substantial
increase
in
orders?
88
5.
Once
you
have
made
these
changes,
try
running
your
script
against
the
data
in
HDFS:
$ pig count_tablet_orders_by_period.pig
Question:
Does
the
data
show
an
increase
in
sales
of
the
advertised
product
corresponding
to
the
month
in
which
Dualcores
campaign
was
active?
89
90
Since
we
are
considering
the
total
sales
price
of
orders
in
addition
to
the
number
of
orders
a
customer
has
placed,
not
every
customer
with
at
least
five
orders
during
2012
will
qualify.
In
fact,
only
about
one
percent
of
the
customers
will
be
eligible
for
membership
in
one
of
these
three
groups.
During
this
lab,
you
will
write
the
code
needed
to
filter
the
list
of
orders
based
on
date,
group
them
by
customer
ID,
count
the
number
of
orders
per
customer,
and
then
filter
this
to
exclude
any
customer
who
did
not
have
at
least
five
orders.
You
will
then
join
this
information
with
the
order
details
and
products
data
sets
in
order
to
calculate
the
total
sales
of
those
orders
for
each
customer,
split
them
into
the
groups
based
on
the
criteria
described
above,
and
then
write
the
data
for
each
group
(customer
ID
and
total
sales)
into
a
separate
directory
in
HDFS.
91
92
Background Information
Dualcore
outsources
its
call
center
operations
and
costs
have
recently
risen
due
to
an
increase
in
the
volume
of
calls
handled
by
these
agents.
Unfortunately,
Dualcore
does
not
have
access
to
the
call
centers
database,
but
they
are
provided
with
recordings
of
these
calls
stored
in
MP3
format.
By
using
Pigs
STREAM
keyword
to
invoke
a
provided
Python
script,
you
can
extract
the
category
and
timestamp
from
the
files,
and
then
analyze
that
data
to
learn
what
is
causing
the
recent
increase
in
calls.
93
94
95
To
solve
this
problem,
Dualcore
will
open
a
new
distribution
center
to
improve
shipping
times.
The
ZIP
codes
for
the
three
proposed
sites
are
02118,
63139,
and
78237.
You
will
look
up
the
latitude
and
longitude
of
these
ZIP
codes,
as
well
as
the
ZIP
codes
of
customers
who
have
recently
ordered,
using
a
supplied
data
set.
Once
you
have
the
coordinates,
you
will
invoke
the
use
the
HaversineDistInMiles
UDF
distributed
with
DataFu
to
determine
how
far
each
customer
is
from
the
three
data
centers.
You
will
then
calculate
the
average
distance
for
all
customers
to
each
of
these
data
centers
in
order
to
propose
the
one
that
will
benefit
the
most
customers.
1.
Add
the
tab-delimited
file
mapping
ZIP
codes
to
latitude/longitude
points
to
HDFS:
$ hadoop fs -mkdir /dualcore/distribution
$ hadoop fs -put $ADIR/data/latlon.tsv \
/dualcore/distribution
2.
A
script
(create_cust_location_data.pig)
has
been
provided
to
find
the
ZIP
codes
for
customers
who
placed
orders
during
the
period
of
the
ad
campaign.
It
also
excludes
the
ones
who
are
already
close
to
the
current
facility,
as
well
as
customers
in
the
remote
states
of
Alaska
and
Hawaii
(where
orders
are
shipped
by
airplane).
The
Pig
Latin
code
joins
these
customers
ZIP
codes
with
the
latitude/longitude
data
set
uploaded
in
the
previous
step,
then
writes
those
three
columns
(ZIP
code,
latitude,
and
longitude)
as
the
result.
Examine
the
script
to
see
how
it
works,
and
then
run
it
to
create
the
customer
location
data
in
HDFS:
$ pig create_cust_location_data.pig
3.
You
will
use
the
HaversineDistInMiles
function
to
calculate
the
distance
from
each
customer
to
each
of
the
three
proposed
warehouse
locations.
This
function
requires
us
to
supply
the
latitude
and
longitude
of
both
the
customer
and
the
warehouse.
While
the
script
you
just
executed
created
the
latitude
and
longitude
for
each
customer,
you
must
create
a
data
set
containing
the
ZIP
code,
96
latitude,
and
longitude
for
these
warehouses.
Do
this
by
running
the
following
UNIX
command:
$ egrep '^02118|^63139|^78237' \
$ADIR/data/latlon.tsv > warehouses.tsv
4.
Next,
add
this
file
to
HDFS:
$ hadoop fs -put warehouses.tsv /dualcore/distribution
5.
Edit
the
calc_average_distances.pig
file.
The
UDF
is
already
registered
and
an
alias
for
this
function
named
DIST
is
defined
at
the
top
of
the
script,
just
before
the
two
data
sets
you
will
use
are
loaded.
You
need
to
complete
the
rest
of
this
script:
a. Create
a
record
for
every
combination
of
customer
and
proposed
distribution
center
location
b. Use
the
function
to
calculate
the
distance
from
the
customer
to
the
warehouse
c. Calculate
the
average
distance
for
all
customers
to
each
warehouse
d. Display
the
result
to
the
screen
6.
After
you
have
finished
implementing
the
Pig
Latin
code
described
above,
run
the
script:
$ pig calc_average_distances.pig
Question:
Which
of
these
three
proposed
ZIP
codes
has
the
lowest
average
mileage
to
Dualcores
customers?
97
Hive Prompt
To make it easier to copy queries and paste them into your terminal window, we
do not show the hive> prompt in subsequent steps. Steps prefixed with
$ should be executed on the UNIX command line; the rest should be run in Hive
unless otherwise noted.
Copyright 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
98
3.
Make
the
query
results
easier
to
read
by
setting
the
property
that
will
make
Hive
show
column
headers:
set hive.cli.print.header=true;
4.
All
you
know
about
the
winner
is
that
her
name
is
Bridget
and
she
lives
in
Kansas
City.
Use
Hive's
LIKE
operator
to
do
a
wildcard
search
for
names
such
as
"Bridget",
"Bridgette"
or
"Bridgitte".
Remember
to
filter
on
the
customer's
city.
Question:
Which
customer
did
your
query
identify
as
the
winner
of
the
$5,000
prize?
99
Dualcore
can
authorize
the
accounting
department
to
pay
the
$5,000
prize,
you
must
ensure
that
Bridget
is
eligible.
Since
this
query
involves
joining
data
from
several
tables,
its
a
perfect
case
for
running
it
as
a
Hive
script.
1.
Study
the
HiveQL
code
for
the
query
to
learn
how
it
works:
$ cat verify_tablet_order.hql
2.
Execute
the
HiveQL
script
using
the
hive
commands
-f
option:
$ hive -f verify_tablet_order.hql
Question:
Did
Bridget
order
the
advertised
tablet
in
May?
100
1.
Start
the
Firefox
Web
browser
by
clicking
the
orange
and
blue
icon
near
the
top
of
the
VM
window,
just
to
the
right
of
the
System
menu.
Once
Firefox
starts,
type
http://localhost:8888/
into
the
address
bar,
and
then
hit
the
enter
key.
2.
After
a
few
seconds,
you
should
see
Hues
login
screen.
Enter
training
in
both
the
username
and
password
fields,
and
then
click
the
Sign
In
button.
If
prompted
to
remember
the
password,
decline
by
hitting
the
ESC
key
so
you
can
practice
this
step
again
later
if
you
choose.
Although
several
Hue
applications
are
available
through
the
icons
at
the
top
of
the
page,
the
Beeswax
query
editor
is
shown
by
default.
3.
Select
default
from
the
database
list
on
the
left
side
of
the
page.
4.
Write
a
query
in
the
text
area
that
will
count
the
number
of
records
in
the
customers
table,
and
then
click
the
Execute
button.
Question:
How
many
customers
does
Dualcore
serve?
5.
Click
the
Query
Editor
link
in
the
upper
left
corner,
and
then
write
and
run
a
query
to
find
the
ten
states
with
the
most
customers.
Question:
Which
state
has
the
most
customers?
101
Which
top
three
products
has
Dualcore
sold
more
of
than
any
other?
Hint:
Remember
that
if
you
use
a
GROUP BY
clause
in
Hive,
you
must
group
by
all
fields
listed
in
the
SELECT
clause
that
are
not
part
of
an
aggregate
function.
What was Dualcores gross profit (sales price minus cost) in May, 2013?
The
results
of
the
above
queries
are
shown
in
cents.
Rewrite
the
gross
profit
query
to
format
the
value
in
dollars
and
cents
(e.g.,
$2000000.00).
To
do
this,
you
can
divide
the
profit
by
100
and
format
the
result
using
the
PRINTF
function
and
the
format
string
"$%.2f".
Professors
Note~
There
are
several
ways
you
could
write
each
query,
and
you
can
find
one
solution
for
each
problem
in
the
bonus_01/sample_solution/
directory.
102
103
3.
Start
Hive:
$ hive
4.
It
is
always
a
good
idea
to
validate
data
after
adding
it.
Execute
the
Hive
query
shown
below
to
count
the
number
of
suppliers
in
Texas:
SELECT COUNT(*) FROM suppliers WHERE state='TX';
The
query
should
show
that
nine
records
match.
104
2.
Run
the
following
Hive
query
to
verify
that
you
have
created
the
table
correctly.
SELECT job_title, COUNT(*) AS num
FROM employees
GROUP BY job_title
ORDER BY num DESC
LIMIT 3;
It
should
show
that
Sales
Associate,
Cashier,
and
Assistant
Manager
are
the
three
most
common
job
titles
at
Dualcore.
105
2.
Show
the
table
description
and
verify
that
its
fields
have
the
correct
order,
names,
and
types:
DESCRIBE ratings;
3.
Next,
open
a
separate
terminal
window
(File
->
Open
Terminal)
so
you
can
run
the
following
shell
command.
This
will
populate
the
table
directly
by
using
the
hadoop fs
command
to
copy
product
ratings
data
from
2012
to
that
directory
in
HDFS:
$ hadoop fs -put $ADIR/data/ratings_2012.txt \
/user/hive/warehouse/ratings
Leave
the
window
open
afterwards
so
that
you
can
easily
switch
between
Hive
and
the
command
prompt.
4.
Next,
verify
that
Hive
can
read
the
data
we
just
added.
Run
the
following
query
in
Hive
to
count
the
number
of
records
in
this
table
(the
result
should
be
464):
SELECT COUNT(*) FROM ratings;
5.
Another
way
to
load
data
into
a
Hive
table
is
through
the
LOAD DATA
command.
The
next
few
commands
will
lead
you
through
the
process
of
copying
a
local
file
to
HDFS
and
loading
it
into
Hive.
First,
copy
the
2013
ratings
data
to
HDFS:
$ hadoop fs -put $ADIR/data/ratings_2013.txt /dualcore
6.
Verify
that
the
file
is
there:
$ hadoop fs -ls /dualcore/ratings_2013.txt
7.
Use
the
LOAD DATA
statement
in
Hive
to
load
that
file
into
the
ratings
table:
LOAD DATA INPATH '/dualcore/ratings_2013.txt' INTO
TABLE ratings;
106
8.
The
LOAD DATA INPATH
command
moves
the
file
to
the
tables
directory.
Verify
that
the
file
is
no
longer
present
in
the
original
directory:
$ hadoop fs -ls /dualcore/ratings_2013.txt
9.
Verify
that
the
file
is
shown
alongside
the
2012
ratings
data
in
the
tables
directory:
$ hadoop fs -ls /user/hive/warehouse/ratings
10.
Finally,
count
the
records
in
the
ratings
table
to
ensure
that
all
21,997
are
available:
SELECT COUNT(*) FROM ratings;
107
108
109
Background Information
Customer
ratings
and
feedback
are
great
sources
of
information
for
both
customers
and
retailers
like
Dualcore.
However,
customer
comments
are
typically
free-form
text
and
must
be
handled
differently.
Fortunately,
Hive
provides
extensive
support
for
text
processing.
110
2.
Start
Hive
and
use
the
DESCRIBE
command
to
remind
yourself
of
the
tables
structure.
3.
We
want
to
find
the
product
that
customers
like
most,
but
must
guard
against
being
misled
by
products
that
have
few
ratings
assigned.
Run
the
following
query
to
find
the
product
with
the
highest
average
among
all
those
with
at
least
50
ratings:
SELECT prod_id, FORMAT_NUMBER(avg_rating, 2) AS
avg_rating
FROM (SELECT prod_id, AVG(rating) AS avg_rating,
COUNT(*) AS num
FROM ratings
GROUP BY prod_id) rated
WHERE num >= 50
ORDER BY avg_rating DESC
LIMIT 1;
4.
Rewrite,
and
then
execute,
the
query
above
to
find
the
product
with
the
lowest
average
among
products
with
at
least
50
ratings.
You
should
see
that
the
result
is
product
ID
1274673
with
an
average
rating
of
1.10.
111
112
4.
We
can
infer
that
customers
are
complaining
about
the
price
of
this
item,
but
the
comment
alone
doesnt
provide
enough
detail.
One
of
the
words
(red)
in
that
comment
was
also
found
in
the
list
of
trigrams
from
the
earlier
query.
Write
and
execute
a
query
that
will
find
all
distinct
comments
containing
the
word
red
that
are
associated
with
product
ID
1274673.
5.
The
previous
step
should
have
displayed
two
comments:
Why does the red one cost ten times more than the others?
The
second
comment
implies
that
this
product
is
overpriced
relative
to
similar
products.
Write
and
run
a
query
that
will
display
the
record
for
product
ID
1274673
in
the
products
table.
6.
Your
query
should
have
shown
that
the
product
was
a
16GB
USB
Flash
Drive
(Red)
from
the
Orion
brand.
Next,
run
this
query
to
identify
similar
products:
SELECT *
FROM products
WHERE name LIKE '%16 GB USB Flash Drive%'
AND brand='Orion';
The
query
results
show
that
there
are
three
almost
identical
products,
but
the
product
with
the
negative
reviews
(the
red
one)
costs
about
ten
times
as
much
as
the
others,
just
as
some
of
the
comments
said.
Based
on
the
cost
and
price
columns,
it
appears
that
doing
text
processing
on
the
product
ratings
has
helped
Dualcore
uncover
a
pricing
error.
113
114
3.
Populate
the
table
by
adding
the
log
file
to
the
tables
directory
in
HDFS:
$ hadoop fs -put $ADIR/data/access.log
/dualcore/web_logs
4.
Start
the
Hive
shell
in
another
terminal
window
5.
Verify
that
the
data
is
loaded
correctly
by
running
this
query
to
show
the
top
three
items
users
searched
for
on
Dualcores
Web
site:
SELECT term, COUNT(term) AS num FROM
(SELECT LOWER(REGEXP_EXTRACT(request,
'/search\\?phrase=(\\S+)', 1)) AS term
FROM web_logs
WHERE request REGEXP '/search\\?phrase=') terms
GROUP BY term
ORDER BY num DESC
LIMIT 3;
You
should
see
that
it
returns
tablet
(303),
ram
(153)
and
wifi
(148).
Note:
The
REGEXP
operator,
which
is
available
in
some
SQL
dialects,
is
similar
to
LIKE,
but
uses
regular
expressions
for
more
powerful
pattern
matching.
The
REGEXP
operator
is
synonymous
with
the
RLIKE
operator.
Description
115
1
2
3
4
/cart/checkout/step1-viewcart
/cart/checkout/step2-shippingcost
/cart/checkout/step3-payment
/cart/checkout/step4-receipt
1.
Run
the
following
query
in
Hive
to
show
the
number
of
requests
for
each
step
of
the
checkout
process:
SELECT COUNT(*), request
FROM web_logs
WHERE request REGEXP '/cart/checkout/step\\d.+'
GROUP BY request;
The
results
of
this
query
highlight
a
major
problem.
About
one
out
of
every
three
customers
abandons
their
cart
after
the
second
step.
This
might
mean
millions
of
dollars
in
lost
revenue,
so
lets
see
if
we
can
determine
the
cause.
2.
The
log
files
cookie
field
stores
a
value
that
uniquely
identifies
each
user
session.
Since
not
all
sessions
involve
checkouts
at
all,
create
a
new
table
containing
the
session
ID
and
number
of
checkout
steps
completed
for
just
those
sessions
that
do:
CREATE TABLE checkout_sessions AS
SELECT cookie, ip_address, COUNT(request) AS
steps_completed
FROM web_logs
WHERE request REGEXP '/cart/checkout/step\\d.+'
GROUP BY cookie, ip_address;
116
3.
Run
this
query
to
show
the
number
of
people
who
abandoned
their
cart
after
each
step:
SELECT steps_completed, COUNT(cookie) AS num
FROM checkout_sessions
GROUP BY steps_completed;
You
should
see
that
most
customers
who
abandoned
their
order
did
so
after
the
second
step,
which
is
when
they
first
learn
how
much
it
will
cost
to
ship
their
order.
117
118
1.
Write
a
HiveQL
statement
to
create
a
table
called
cart_items
with
two
fields:
cookie
and
prod_id
based
on
data
selected
the
web_logs
table.
Keep
the
following
in
mind
when
writing
your
statement:
a. The
prod_id
field
should
contain
only
the
seven-digit
product
ID
(Hint:
Use
the
REGEXP_EXTRACT
function)
b. Add
a
WHERE
clause
with
REGEXP
using
the
same
regular
expression
as
above
so
that
you
only
include
records
where
customers
are
adding
items
to
the
cart.
Professors
Note~
If
you
need
a
hint
on
how
to
write
the
statement,
look
at
the
file:
sample_solution/create_cart_items.hql
2.
Execute
the
HiveQL
statement
from
you
just
wrote.
3.
Verify
the
contents
of
the
new
table
by
running
this
query:
SELECT COUNT(DISTINCT cookie) FROM cart_items WHERE
prod_id=1273905;
Professors
Note~
If
this
doesnt
return
47,
then
compare
your
statement
to
the
file:
sample_solution/create_cart_items.hql.
Make
the
necessary
corrections,
and
then
re-run
your
statement
(after
dropping
the
cart_items
table).
119
more
analysis
later,
well
also
include
total
selling
price
and
total
wholesale
cost
in
addition
to
the
total
shipping
weight
for
all
items
in
the
cart.
1.
Run
the
following
HiveQL
to
create
a
table
called
cart_orders
with
the
information:
CREATE TABLE cart_orders AS
SELECT z.cookie, steps_completed, zipcode,
SUM(shipping_wt) as total_weight,
SUM(price) AS total_price,
SUM(cost) AS total_cost
FROM cart_zipcodes z
JOIN cart_items i
ON (z.cookie = i.cookie)
JOIN products p
ON (i.prod_id = p.prod_id)
GROUP BY z.cookie, zipcode, steps_completed;
120
1.
Before
you
can
use
a
UDF,
you
must
add
it
to
Hives
classpath.
Run
the
following
command
in
Hive
to
do
that:
ADD JAR geolocation_udf.jar;
2.
Next,
you
must
register
the
function
with
Hive
and
provide
the
name
of
the
UDF
class
as
well
as
the
alias
you
want
to
use
for
the
function.
Run
the
Hive
command
below
to
associate
our
UDF
with
the
alias
CALC_SHIPPING_COST:
CREATE TEMPORARY FUNCTION CALC_SHIPPING_COST AS
'com.cloudera.hive.udf.UDFCalcShippingCost';
121
3.
Now
create
a
new
table
called
cart_shipping
that
will
contain
the
session
ID,
number
of
steps
completed,
total
retail
price,
total
wholesale
cost,
and
the
estimated
shipping
cost
for
each
order
based
on
data
from
the
cart_orders
table:
CREATE TABLE cart_shipping AS
SELECT cookie, steps_completed, total_price,
total_cost,
CALC_SHIPPING_COST(zipcode, total_weight) AS
shipping_cost
FROM cart_orders;
4.
Finally,
verify
your
table
by
running
the
following
query
to
check
a
record:
SELECT * FROM cart_shipping WHERE
cookie='100002920697';
This
should
show
that
session
as
having
two
completed
steps,
a
total
retail
price
of
$263.77,
a
total
wholesale
cost
of
$236.98,
and
a
shipping
cost
of
$9.09.
Note:
The
total_price,
total_cost,
and
shipping_cost
columns
in
the
cart_shipping
table
contain
the
number
of
cents
as
integers.
Be
sure
to
divide
results
containing
monetary
amounts
by
100
to
get
dollars
and
cents.
122
Step #1: Start the Impala Shell and Refresh the Cache
1.
Issue
the
following
commands
to
start
Impala,
then
change
to
the
directory
for
this
lab:
$ sudo service impala-server start
$ sudo service impala-state-store start
$ cd $ADIR/exercises/interactive
2.
First,
start
the
Impala
shell:
$ impala-shell
3.
Since
you
created
tables
and
modified
data
in
Hive,
Impalas
cache
of
the
metastore
is
outdated.
You
must
refresh
it
before
continuing
by
entering
the
following
command
in
the
Impala
shell:
REFRESH;
123
cart_shipping
cookie
steps_completed
100054318085
6899
6292
425
100060397203
19218
17520
552
100062224714
7609
7155
556
100064732105
53137
50685
839
100107017704
44928
44200
720
...
...
...
...
...
You
should
see
that
abandoned
carts
mean
that
Dualcore
is
potentially
losing
out
on
more
than
$2
million
in
revenue!
Clearly
its
worth
the
effort
to
do
further
analysis.
Note:
The
total_price,
total_cost,
and
shipping_cost
columns
in
the
cart_shipping
table
contain
the
number
of
cents
as
integers.
Be
sure
to
divide
results
containing
monetary
amounts
by
100
to
get
dollars
and
cents.
124
2.
The
number
returned
by
the
previous
query
is
revenue,
but
what
counts
is
profit.
We
calculate
gross
profit
by
subtracting
the
cost
from
the
price.
Write
and
execute
a
query
similar
to
the
one
above,
but
which
reports
the
total
lost
profit
from
abandoned
carts.
Professors
Note~
If
you
need
a
hint
on
how
to
write
this
query,
you
can
check
the
file:
sample_solution/abandoned_checkout_profit.sql
After
running
your
query,
you
should
see
that
Dualcore
is
potentially
losing
$111,058.90
in
profit
due
to
customers
not
completing
the
checkout
process.
3.
How
does
this
compare
to
the
amount
of
profit
Dualcore
receives
from
customers
who
do
complete
the
checkout
process?
Modify
your
previous
query
to
consider
only
those
records
where
steps_completed = 4,
and
then
execute
it
in
the
Impala
shell.
Professors
Note~
Check
sample_solution/completed_checkout_profit.sql
for
a
hint.
The
result
should
show
that
Dualcore
earns
a
total
of
$177,932.93
on
completed
orders,
so
abandoned
carts
represent
a
substantial
proportion
of
additional
profits.
4.
The
previous
two
queries
show
the
total
profit
for
abandoned
and
completed
orders,
but
these
arent
directly
comparable
because
there
were
different
numbers
of
each.
It
might
be
the
case
that
one
is
much
more
profitable
than
the
other
on
a
per-order
basis.
Write
and
execute
a
query
that
will
calculate
the
average
profit
based
on
the
number
of
steps
completed
during
the
checkout
process.
Professors
Note~
If
you
need
help
writing
this
query,
check
the
file:
sample_solution/checkout_profit_by_step.sql
You
should
observe
that
carts
abandoned
after
step
two
represent
an
even
higher
average
profit
per
order
than
completed
orders.
125
cart_shipping
cookie
steps_completed
100054318085
6899
6292
425
100060397203
19218
17520
552
100062224714
7609
7155
556
100064732105
53137
50685
839
100107017704
44928
44200
720
...
...
...
...
...
You
will
see
that
the
shipping
cost
of
abandoned
orders
was
almost
10%
higher
than
for
completed
purchases.
Offering
free
shipping,
at
least
for
some
orders,
might
actually
bring
in
more
money
than
passing
on
the
cost
and
risking
abandoned
orders.
126
2.
Run
the
following
query
to
determine
the
average
profit
per
order
over
the
entire
month
for
the
data
you
are
analyzing
in
the
log
file.
This
will
help
you
to
determine
whether
Dualcore
could
absorb
the
cost
of
offering
free
shipping:
SELECT AVG(price - cost) AS profit
FROM products p
JOIN order_details d
ON (d.prod_id = p.prod_id)
JOIN orders o
ON (d.order_id = o.order_id)
WHERE YEAR(order_date) = 2013
AND MONTH(order_date) = 05;
products
order_details
prod_id
price cost
1273641
1839
1275
1273642
1949
1273643
orders
product_id
order_id
order_date
6547914
1273641
6547914
2013-05-01 00:02:08
721
6547914
1273644
6547915
2013-05-01 00:02:55
6547914
1273645
6547916
2013-05-01 00:06:15
2013-06-12 00:10:41
order_id
2149
845
1273644
2029
763
6547915
1273645
6547917
1273645
1909
1234
6547916
1273641
6547918
2013-06-12 00:11:30
...
...
...
...
...
...
...
You
should
see
that
the
average
profit
for
all
orders
during
May
was
$7.80.
An
earlier
query
you
ran
showed
that
the
average
shipping
cost
was
$8.83
for
completed
orders
and
$9.66
for
abandoned
orders,
so
clearly
Dualcore
would
lose
money
by
offering
free
shipping
on
all
orders.
However,
it
might
still
be
worthwhile
to
offer
free
shipping
on
orders
over
a
certain
amount.
127
3.
Run
the
following
query,
which
is
a
slightly
revised
version
of
the
previous
one,
to
determine
whether
offering
free
shipping
only
on
orders
of
$10
or
more
would
be
a
good
idea:
SELECT AVG(price - cost) AS profit
FROM products p
JOIN order_details d
ON (d.prod_id = p.prod_id)
JOIN orders o
ON (d.order_id = o.order_id)
WHERE YEAR(order_date) = 2013
AND MONTH(order_date) = 05
AND PRICE >= 1000;
You
should
see
that
the
average
profit
on
orders
of
$10
or
more
was
$9.09,
so
absorbing
the
cost
of
shipping
would
leave
very
little
profit.
4.
Repeat
the
previous
query,
modifying
it
slightly
each
time
to
find
the
average
profit
on
orders
of
at
least
$50,
$100,
and
$500.
You
should
see
that
there
is
a
huge
spike
in
the
amount
of
profit
for
orders
of
$500
or
more
(Dualcore
makes
$111.05
on
average
for
these
orders).
5.
How
much
does
shipping
cost
on
average
for
orders
totaling
$500
or
more?
Write
and
run
a
query
to
find
out.
Professors
Note~
The
file
sample_solution/avg_shipping_cost_50000.sql
contains
the
solution.
You
should
see
that
the
average
shipping
cost
is
$12.28,
which
happens
to
be
about
11%
of
the
profit
brought
in
on
those
orders.
6.
Since
Dualcore
wont
know
in
advance
who
will
abandon
their
cart,
they
would
have
to
absorb
the
$12.28
average
cost
on
all
orders
of
at
least
$500.
Would
the
extra
money
they
might
bring
in
from
abandoned
carts
offset
the
added
cost
of
Copyright 2010-2014 Cloudera, Inc. All rights reserved.
Not to be reproduced without prior written consent.
128
free
shipping
for
customers
who
would
have
completed
their
purchases
anyway?
Run
the
following
query
to
see
the
total
profit
on
completed
purchases:
SELECT SUM(total_price - total_cost) AS total_profit
FROM cart_shipping
WHERE total_price >= 50000
AND steps_completed = 4;
After
running
this
query,
you
should
see
that
the
total
profit
for
completed
orders
is
$107,582.97.
7.
Now,
run
the
following
query
to
find
the
potential
profit,
after
subtracting
shipping
costs,
if
all
customers
completed
the
checkout
process:
SELECT gross_profit - total_shipping_cost AS
potential_profit
FROM (SELECT
SUM(total_price - total_cost) AS
gross_profit,
SUM(shipping_cost) AS total_shipping_cost
FROM cart_shipping
WHERE total_price >= 50000) large_orders;
Since
the
result
of
$120,355.26
is
greater
than
the
current
profit
of
$107,582.97
Dualcore
currently
earns
from
completed
orders,
it
appears
that
they
could
earn
nearly
$13,000
more
by
offering
free
shipping
for
all
orders
of
at
least
$500.
Congratulations!
Your
hard
work
analyzing
a
variety
of
data
with
Hadoops
tools
has
helped
make
Dualcore
more
profitable
than
ever.
129