Hadoop Admin 171103e Exercise Manual
Hadoop Admin 171103e Exercise Manual
Table of Contents
General Notes ............................................................................................................. 1
Hands-On Exercise: Configuring Networking ......................................................... 5
Hands-On Exercise: Installing Cloudera Manager Server ...................................... 9
Hands-on Exercise: Creating a Hadoop Cluster .................................................... 15
Hands-On Exercise: Working with HDFS .............................................................. 27
Hands-On Exercise: Running YARN Applications ................................................. 32
Hands-On Exercise: Explore Hadoop Configurations and Daemon Logs ............ 41
Hands-On Exercise: Using Flume to Put Data into HDFS ..................................... 46
Hands-On Exercise: Importing Data with Sqoop .................................................. 53
Hands-On Exercise: Querying HDFS with Hive and Impala ................................. 57
Hands-On Exercise: Using Hue to Control Hadoop User Access .......................... 66
Hands-On Exercise: Configuring HDFS High Availability ..................................... 76
Hands-On Exercise: Using the Fair Scheduler ....................................................... 83
Hands-On Exercise: Breaking the Cluster ............................................................. 88
Hands-On Exercise: Verifying the Cluster’s Self-Healing Features ...................... 90
Hands-On Exercise: Taking HDFS Snapshots ........................................................ 93
Hands-On Exercise: Configuring Email Alerts ...................................................... 95
Troubleshooting Challenge: Heap O’ Trouble ....................................................... 97
Appendix: Continuing Exercises After Class ....................................................... 100
General Notes
• Should you need superuser access, you can use sudo as the training user without
entering a password. The training user has unlimited, passwordless sudo
privileges.
• The cluster machines have been configured so that you can make SSH connections to
them from the ConnectToCluster machine without entering a password.
• The hostnames of the five cluster machines are elephant, tiger, horse, monkey,
and lion.
Bonus Exercises
There are additional challenges for some of the exercises. If you finish the main
exercise, please attempt the additional steps.
Notational Convention
In some command-line steps in the exercises, you will see lines like this:
The backslash at the end of the first line signifies that the command is not completed,
and continues on the next line. You can enter the code exactly as shown (on two lines),
or you can enter it on a single line. If you do the latter, you should not type in the
backslash.
TIP: If you are using a Mac and you want to copy and paste text from a document
on your laptop to a terminal window in the desktop through the Browser, try this
approach:
• While attempting one of the exercises, you misconfigure your machines so badly that
attempting to do the next exercise is no longer possible.
• You have successfully completed several exercises, but then you receive an
emergency call from work and you must miss some time in class. When you return,
you realize that you have missed two or three exercises. But you want to do the
same exercise everyone else is doing so that you can benefit from the post-exercise
discussions.
The script is destructive: any work that you have done is overwritten when the script
runs. If you want to save files before running the script, you can copy the files you want
to save to a subdirectory under /home/training.
Before you attempt to run the script, verify that networking among the five hosts in
your cluster is working. If networking has not been configured correctly, you can rerun
the CM_config_hosts.sh script to reset the networking configuration prior to
running the reset_cluster.sh script.
Run the script on elephant only. You do not need to change to a directory to run the
script; it is in your shell’s executable PATH.
The script starts by prompting you to enter the number of an exercise. Specify the
exercise you want to perform after the script has run. Then confirm that you want to
reset the cluster (thus overwriting any work you have done).
The script will further prompt you to specify if you want to run only the steps that
simulate the previous exercise, or if you want to completely uninstall and reinstall the
cluster and then catch up to the specified exercise. Note that choosing to only complete
the previous exercise does not offer as strong an assurance of properly configuring your
cluster as a full reset would do. It is however a more expedient option.
After you have responded to the initial prompts, the script begins by cleaning up your
cluster—terminating Hadoop processes, removing Hadoop software, deleting Hadoop-
related files, and reverting other changes you might have made to the hosts in your
cluster. Once you have answered the initial questions, you do not need to answer any
more questions that you may see in the terminal (the scripts answers those questions
itself). Please note that as this system cleanup phase is running, you will see errors such
as “unrecognized service” and “No packages marked for removal.” These errors are
expected. They occur because the script attempts to remove anything pertinent that
might be on your cluster. The number of error messages that appear during this phase
of script execution depends on the state the cluster is in when you start the script.
Next, the script simulates steps for each exercise up to the one you want to perform.
Script completion time varies from five minutes to almost an hour depending on how
many exercise steps need to be simulated by the script.
Exercise Instructions
1. Connect to the ConnectToCluster machine.
$ connect_to_elephant.sh
• When prompted to confirm that you want to continue connecting, type yes and
then press the Enter key.
3. Start SSH sessions to connect to the other four cluster machines, logging in as the
training user.
$ connect_to_tiger.sh
• When the script prompts you to confirm that you want to continue connecting,
type yes and then press the Enter key. (You will also need to do this when you
connect to horse, monkey, and lion.)
$ connect_to_horse.sh
$ connect_to_monkey.sh
$ connect_to_lion.sh
4. Verify that you can communicate with all the hosts in your cluster from elephant
by using the hostnames.
On elephant:
$ ping elephant
$ ping tiger
$ ping horse
$ ping monkey
$ ping lion
5. Verify that passwordless SSH works by running the ip addr command on tiger,
horse, monkey, and lion from a session on elephant.
On elephant:
$ start_SOCKS5_proxy.sh
When the script prompts you to confirm that you want to continue connecting,
enter yes and then press the Enter key.
8. Minimize the terminal window in which you started the SOCKS5 proxy server.
Important: Do not close this terminal window or exit the SSH session running
in it while you are working on exercises. If you accidentally terminate the
proxy server, restart it by opening a new terminal window and rerunning the
start_SOCKS5_proxy.sh script.
If when attempting to start the proxy again, you see an error indicating that, "one or
more SOCKS5 proxies are already running," note the Process ID(s) returned in the
error message, and issue the following command to kill the stale process:
where pid is the actual process id. Then execute the start script again.
$ start_SOCKS5_proxy.sh
b. Browse to https://university.cloudera.com/user/learning/
enrollments
d. Select the course title, then click to download the Exercise Manual under
Materials. This will save the Exercise Manual PDF file in the Downloads folder
in the training user’s home directory.
$ evince &
b. In Evince, select menu item File > Open and browse to and open the exercise
manual PDF file in the Downloads directory.
$ java -version
$ echo $JAVA_HOME
$ env | grep PATH
2. Verify Python is installed. It is a requirement for Hue, which you will install later in
the course.
On lion:
Note that in a true production deployment you would also modify the
/etc/my.cnf MySQL configurations and move the InnoDB log files to a backup
location.
4. Verify the MySQL JDBC Connector is installed. Sqoop (a part of CDH that you will
install in this course) does not ship with a JDBC connector, but does require one.
On lion:
$ ls -l /usr/share/java
show databases;
exit;
The show databases command should show that the six databases (amon,
cmserver, hue, metastore, oozie, and rman) were created.
Note: In a true production deployment, you would also regularly backup your
database.
6. Make your MySQL installation secure and set the root user password.
On lion:
$ sudo /usr/bin/mysql_secure_installation
[...]
Enter current password for root (enter for none)
OK, successfully used password, moving on..
[...]
Set root password? [Y/n] Y
New password: training
Re-enter new password: training
Remove anonymous users? [Y/n] Y
[...]
Disallow root login remotely? [Y/n] Y
[...]
Remove test database and access to it [Y/n] Y
[...]
Reload privilege tables now? [Y/n] Y
All done!
$ cat /etc/yum.repos.d/cloudera-cm5.repo
$ ls ~/software/cloudera-cm5/RPMS/x86_64
Note that these two locations exist on each of the five machines in your
environment and have also been made available on HTTP ports 8050 and 8000
respectively via a Linux service. This setup is specific to the course environment
and is not required for Cloudera Manager or CDH installations, however it is
common to configure a similar setup in secure environments where Cloudera
Manager does not have internet access.
If you wanted to install from the online repository, you would create a reference to
the Cloudera repository in your /etc/yum.repos.d directory.
$ sudo /usr/share/cmf/schema/scm_prepare_database.sh \
mysql cmserver cmserveruser password
After running the command above you should see the message, “All done, your SCM
database is configured correctly!”
$ curl lion:8000
$ curl lion:8050/RPMS/x86_64/
Each command should show hyperlinks (<a href ="…"> code) to software
repositories. If either command did not successfully contact the web server, discuss
with the instructor before continuing.
$ top
Cloudera Manager Server runs as a java process that you can view by using the
top Linux utility. Notice the CPU usage is in the 90th percentile or above for a few
seconds while the server starts. Once the CPU usage drops, the Cloudera Manager
browser interface will become available.
On lion:
The results of the ps command above show that Cloudera Manager Server is using
the JDBC MySQL connector to connect to MySQL. It also shows logging configuration
and other details.
You will then be prompted to choose which CDH services you want to add in the cluster
and to which machines you would like to add each service.
At the end of this exercise, you will have Hadoop daemons deployed across your cluster
as depicted here (daemons added in this exercise shown in blue).
In subsequent Exercises, you will add more Hadoop services to your cluster.
Because you have only five hosts to work with, you must run multiple Hadoop daemons
on all the hosts except lion, which will be used for Cloudera Manager services
only. For example, the NameNode, a DataNode and a NodeManager will all run on
elephant. Having a very limited number of hosts is not unusual when deploying
Hadoop for a proof-of-concept project. Please follow the best practices in the “Planning
Your Hadoop Cluster” chapter when sizing and deploying your production clusters.
After completing the installation steps, you will review a Cloudera Manager Agent log
file and review processes running on a machine in the cluster. Finally, there is a section
at the end of this exercise that provides a brief tour of the Cloudera Manager Admin UI.
IMPORTANT: This exercise builds on the previous one. If you were unable to
complete the previous exercise or think you may have made a mistake, run the
following command and follow the prompts to prepare for this exercise before
continuing.
You must run the command on elephant. (NOTE: this script will take several
minutes to run and an increasingly longer amount of time the further you are into the
exercises.)
$ ~/training_materials/admin/scripts/reset_cluster.sh
4. A “Which edition to you want to deploy?” with a green box highlighting a product
edition column should appear.
Select the “Cloudera Enterprise Data Hub Edition Trial.”
Click Continue.
5. The “Thank you for choosing Cloudera Manager and CDH” page appears.
Click Continue.
Click Continue.
In the “Choose Method” section of the page, where “Use Parcels (Recommended)” is
selected, click the More Options button.
In the “Remote Parcel Repository URLs” area, click on each of the minus ( - ) icons
to remove all the current repository references.
Once the existing entries are all removed, click on the plus ( + ) icon to add a new
Remote Parcel Repository URL.
IMPORTANT: carefully verify that the details in your browser match the settings
shown in the screenshot above before proceeding. If an incorrect repository is
chosen, it can take several minutes to reset your environment correctly.
Click Continue.
10. The “Cluster Installation – Provide SSH login credentials” page appears.
Keep “Login To All Hosts As” set to root.
For “Authentication Method: choose “All hosts accept same private key”
Click the Browse button. In the “Places” column choose training. Then, in the
“Name” area, right-click (or on a Mac Ctrl-click or two finger-tap on the trackpad),
and select Show Hidden Files. Finally, click into the .ssh directory, select the
id_rsa file and click Open.
Leave the passphrase fields blank fields and click Continue.
When prompted to continue with no passphrase click OK.
13. The “Cluster Installation – Inspect hosts for correctness” page appears.
After a moment, output from the Host Inspector appears.
Click Finish.
14. The “Cluster Setup—Choose the CDH 5 services that you want to install on your
cluster” page appears.
Click Custom Services.
A table appears with a list of CDH service types.
15. Select the HDFS and YARN (MR2 Included) service types.
You will add additional services in later exercises.
Click Continue.
Role Node(s)
Host Monitor lion
Reports Manager lion
Event Server lion
Alert Publisher lion
YARN (MR2 Included)
ResourceManager horse
JobHistoryServer monkey
NodeManager Same as DataNode (elephant, tiger,
horse, monkey, but not lion)
To assign a role, click the fields with one or more hostnames in them. For example,
the field under SecondaryNameNode might initially have the value elephant. To
change the role assignment for the SecondaryNameNode to tiger, click the field
under SecondaryNameNode. A list of hosts appears. Select tiger, and then click
OK. The field under SecondaryNameNode now has the value tiger in it.
When you have finished assigning roles, compare your role assignments to the role
assignments in the screen shot below.
Verify that your role assignments are correct.
When you are certain that you have the correct role assignments, click Continue.
Click Continue.
You have already added YARN (Including MR2) and HDFS services, yet the only
service registered with init scripts at the operating system level is the cloudera-
scm-agent service.
With Cloudera Manager managed clusters, the cloudera-manager-agent
service on each cluster node manages starting and stopping the deployed Hadoop
daemons.
$ sudo jps
The Hadoop daemons run as Java processes. You should see the NodeManager,
NameNode, and DataNode processes running.
Examine the details of one of the running CDH daemons.
$ cd ~/training_materials/admin/data
$ gunzip shakespeare.txt.gz
$ hdfs dfs -put shakespeare.txt /tmp
26. Click on the drop-down menu to the right of Cluster 1 to see many of the actions
you can perform on the cluster, such as Add Service, Stop, and Restart.
27. Choose Hosts > All Hosts to view the current status of each managed host in the
cluster. Clicking on the > icon in the Roles column for a host will reveal which roles
are deployed on that host.
In the exercises that follow, you will discover many other areas of Cloudera
Manager that will prove useful for administering your Hadoop cluster(s).
$ ~/training_materials/admin/scripts/reset_cluster.sh
3. Similarly, use the search box in the HDFS configuration page, and search for “block
size” to discover the HDFS Block Size setting which defaults to 128MB.
$ cd ~/training_materials/admin/data
$ gunzip access_log.gz
$ hdfs dfs -put access_log weblog
The put command uploaded the file to HDFS. Since the file size is 504MB, HDFS will
have split it into multiple blocks. Let’s explore how the file was stored.
7. Run the hdfs dfs -ls command to review the file’s permissions in HDFS.
On elephant:
a. From the Cloudera Manager Clusters menu, choose HDFS for your cluster.
b. Click on the NameNode Web UI link. This will open the NameNode Web UI at
http://elephant:50070.
c. Choose Utilities > Browse the file system then navigate into the
/user/training/weblog directory.
d. Notice that the permissions shown for the access_log file are identical to the
permissions you observed when you ran the hdfs dfs -ls command.
e. Now select the access_log file in the NameNode Web UI file browser to bring
up the “File information – access_log” window.
Notice that the Block Information drop-down list has four entries, Block 0,
Block 1, Block 2, and Block 3. This makes sense because as you discovered, the
HDFS Block Size on your cluster is 128 MB. The extracted access_log file is
559 MB.
Blocks 0, 1, and 2 all show a size of 134217728 (or 128MB) in accordance with
the size specified in the HDFS block size setting you observed earlier in this
exercise. Block 3 is smaller than the others as you can observe in the “Size”
details if you choose it.
Notice also that each block is available on three different hosts in the cluster.
This is what you would expect since three is the current (and default)
replication factor in HDFS. Also, notice that each block may be replicated to
different data nodes than the other blocks that make up the file.
f. Choose Block 0 and write down the value that appears in the Size field. If Block
0 is not replicated on elephant, then chose another block that is replicated on
elephant.
________________________________________________________________________________
You will need this value when you examine the Linux file that contains this
block.
g. Select the value of the “Block ID” field and copy it (Edit menu > Copy). You will
need this value for the next step in this exercise.
where BLKID is the actual Block ID you copied from the NameNode Web UI. Be
sure to keep the * characters in each side of the BLKID in your find command.
c. Verify that two files with the Block ID you copied appear in the find command
output—one file with an extension, .meta, and another file without this
extension. The .meta file is a metadata file and contains checksums for
sections of the block.
d. Verify in the results of the find command output that the size of the file
containing the HDFS block is exactly the size that was reported in the
NameNode Web UI.
10. Start any Linux editor with sudo; open the file containing the HDFS block. Verify
that the first few lines of the file match the first chunk of the access_log file
content.
Note: You must start your editor with sudo because you are logged in to Linux as
the training user, and this user does not have privileges to access the Linux file
that contains the HDFS block.
Note: Replace /path/to/block in the command above with the actual path to
the block as shown in the results of the find command you ran in the previous step.
You can review the access_log file content on HDFS as follows:
The results returned by the last two commands should match exactly.
IMPORTANT: This exercise builds on the previous one. If you were unable to complete
the previous exercise or think you may have made a mistake, run the following
command and follow the prompts to prepare for this exercise before continuing:
$ ~/training_materials/admin/scripts/reset_cluster.sh
1. Since the code for the application we want to execute is in a Java Archive (JAR)
file, we’ll use the hadoop jar command to submit it to the cluster. Like many
MapReduce programs, WordCount accepts two additional arguments: the HDFS
directory path containing input and the HDFS directory path into which output
should be placed. Therefore, we can run the WordCount program with the following
command.
On elephant:
3. This directory should show all the data output for the job. Job output will include
a _SUCCESS flag and one file created by each Reducer that ran. You can view the
output by using the hdfs dfs -cat command.
On elephant:
5. Access the HistoryServer Web UI to discover where the Application Master ran.
From the drop-down menu for the “word count” application, choose Application
Details.
This action will open a page in the HistoryServer Web UI with details about the job.
7. Locate where the mapper task ran and view the log.
From the HistoryServer Web UI’s Job menu choose Map tasks.
From the Map Tasks table, click on the link in the Name column for the task.
The Attempts table displays. Notice the “Node” column shows you where the map
task attempt ran.
Click the logs link and review the contents of the mapper task log. When done, click
the browser back button to return to the previous page.
8. Locate where the reduce tasks ran and view the logs.
From the HistoryServer Web UI’s Job menu choose Reduce tasks.
From the Reduce Tasks table, click on the link in the Name column for one of the
tasks.
The Attempts table displays. Notice the “Node” column shows you where this
reducer task ran.
Click the logs link and review the contents of the log. Observe the amount of output
in the Reducer task log. When done, click the browser back button to return to the
previous page.
9. Determine the log level for Reducer tasks for the word count job.
Expand the Job menu and choose Configuration.
Twenty entries from the job configuration that were in effect when the word
count job ran appear.
In the Search field, enter log.level.
Locate the mapreduce.reduce.log.level property. Its value should be INFO.
Note: INFO is default value for the “JobHistory Server Logging Threshold” which can
be found in the Cloudera Manager YARN Configuration page for your cluster.
Note: You must delete the counts directory before running the WordCount
program a second time because MapReduce will not run if you specify an output
path which already exists.
MapReduce programs coded to take advantage of the Hadoop ToolRunner allow
you to pass several types of arguments to Hadoop, including run-time configuration
parameters. The hadoop jar command shown above sets the log level for reduce
tasks to DEBUG.
Note: The -D option, as used in the hadoop jar command above, allows you
override a default property setting by specifying the property and the value you
want to assign.
When your job is running, look for a line in standard output like the following:
11. After the job completes, locate and view one of the reducer logs.
From Cloudera Manager’s YARN Applications page for your cluster, locate the entry
for the application that you just ran.
Click on the ID link, and use the information available under the Job menu’s
“Configuration” and “Reduce tasks” links to verify the following:
• The Reducer task’s logs for this job contain DEBUG log records and the logs are
larger than the number of records written to the Reducer task’s logs during the
previous WordCount job execution.
12. Verify the results of the word count job were written to HDFS using any of the
following three methods.
Option 1: In Cloudera Manager, browse to the HDFS page for your cluster, then
choose File Browser. Drill down into /user/training/counts.
Option 2: Access the HDFS NameNode Web UI at http://elephant:50070.
Choose Utilities > Browse the file system, and navigate to the
/user/training/counts directory.
Option 3: on any machine in the cluster that has the DataNode installed
(elephant, tiger, monkey, or horse) run the following command in a terminal:
14. Select the Add Service option from the drop-down menu for Cluster 1.
17. Progress messages appear on the “Add Spark Service to Cluster 1” page.
When the adding of the service has completed, click Continue.
23. Notice (just to the right “Actions” menu), the icons indicating “Stale Configuration
– restart needed” and “Stale Configuration – client configuration redeployment
needed” are displayed.
25. In the “Restart Stale Services” page, ensure “Re-deploy client configuration” is
selected, and click Restart Now.
When the restart has completed, click Finish.
Note: you may notice a new configuration warning that appears next to “Hosts” on
the Cloudera Manager home page. If you look into it, Cloudera Manager indicates
that memory is overcommitted on four of the hosts. These configuration issues,
along with the other configuration warnings that appeared after the initial cluster
installation, would need to be addressed in a true production cluster, however they
can be safely ignored in the classroom environment.
26. Start the Spark shell and connect to the yarn-client spark context on monkey.
Recall that the Spark Gateway service was installed on monkey so the Spark shell
should be run from monkey.
On monkey:
The Scala Spark shell will launch. You should eventually see the message, “SQL
context available as sqlContext.” If necessary, click the Enter key on your keyboard
to see the scala> prompt.
27. Type in the commands below to run a word count application using the
shakespeare.txt file that is already in HDFS.
This accomplishes something very similar to the job you ran in the MapReduce
exercise, but this time the computational framework being used is Spark.
29. In Cloudera Manager, go to the YARN Applications page for your cluster.
You will see a “Spark shell” application.
32. In the Completed Jobs area, click on the “sortByKey…” link for the first job that ran
(Job Id 0).
Notice that this first job consisted of two stages.
33. Click on the “map at… “ link for the first stage (Stage 0).
34. Click on the Executors tab to see a summary of all the executors used by the Spark
application.
Copy the application ID for the Spark application returned by the command above.
Now run this command (where appId is the actual application ID).
On elephant:
Scroll through the logs returned by the shell. Notice that the logs for all the
containers that ran the Spark executors are included in the results.
These Spark application logs are stored in HDFS under /user/spark/\
applicationHistory.
$ ~/training_materials/admin/scripts/reset_cluster.sh
1. Go to the directory that contains the Hadoop configuration for Cloudera Manager-
managed daemons running on elephant and then view the contents.
On elephant:
$ cd /var/run/cloudera-scm-agent/process
$ sudo tree
Notice how there are separate directories for each role instance running on
elephant: DataNode, NameNode, and NodeManager. Notice also that some files,
such as hdfs-site.xml and core-site.xml, exist in more than one daemon’s
configuration directory.
In the commands above, replace each nn with the actual numbers generated
by Cloudera Manager. In the third command above, choose the nn-yarn-
NODEMANAGER directory with the highest nn value. There are two directories for
yarn-NODEMANAGER because you make a settings change in the previous exercise
and the agent retained a copy of the previous revision.
4. Return to the elephant terminal window and list the contents of the
/var/run/cloudera-scm-agent/process directory.
$ sudo ls -l /var/run/cloudera-scm-agent/process
Notice now how there are now two directories each for NameNode, DataNode, and
NodeManager. The old settings have been retained, however the new settings have
also been deployed and will now be used. The directory with the higher number in
the name is the newer one.
5. Find the difference between the old NameNode core-site.xml file and the new
one.
On elephant:
The nn values above should be replaced with the actual numbers with which the
configuration directories are named.
You should see that the fs.trash.interval property value change has been
deployed to the new NameNode configuration file.
7. View the NameNode log file using the NameNode Web UI.
From the Cloudera Manager HDFS page, click the NameNode Web UI link.
From the NameNode Web UI, select Utilities > Logs.
The list of directories and files in the /var/log/hadoop-hdfs directory on
elephant appears.
Open the NameNode log file and review the file.
9. Review the NameNode daemons standard error and standard output logs using
Cloudera Manager.
Return to the Processes page for elephant.
Click the Stdout link for the NameNode instance. The standard output log appears.
Review the file, then return to the Processes page.
Click the Stderr link for the NameNode instance. The standard error log appears.
Review the file.
Note: if you want to locate these log files on disk, they can be found on elephant
in the /var/log/hadoop-hdfs and /var/run/cloudera-scm-agent/
process/nn-hdfs-NAMENODE/logs directories.
10. Using Cloudera Manager, review recent entries in the SecondaryNameNode logs.
To find the log, go to the HDFS Instances page for your cluster, then locate the
SecondaryNameNode role type and click the tiger link in the Host column.
In the Status page for the tiger host, scroll down to the Roles area and click Role
Log File in the SecondaryNameNode column.
11. Access the ResourceManager log file using the ResourceManager Web UI.
Navigate to the ResourceManager Web UI (from the Cloudera Manager YARN
page’s Web UI menu or by specifying the URL http://horse:8088 in your
browser).
Choose Nodes from the Cluster menu on the left side of the page.
Click the horse:8042 link to be taken to the NodeManager Web UI.
Expand the Tools menu on the left side of the page.
Click Local logs.
Finally, click the entry for the ResourceManager log file and review the file.
IMPORTANT: This exercise builds on the previous one. If you were unable to complete
the previous exercise or think you may have made a mistake, run the following
command and follow the prompts to prepare for this exercise before continuing:
$ ~/training_materials/admin/scripts/reset_cluster.sh
Property Value
Agent Name tail1
Configuration tail1.sources = src1
File tail1.channels = ch1
tail1.sinks = sink1
tail1.sources.src1.type = exec
tail1.sources.src1.command = tail -F /tmp/access_log
tail1.sources.src1.channels = ch1
tail1.channels.ch1.type = memory
tail1.channels.ch1.capacity = 500
tail1.sinks.sink1.type = avro
tail1.sinks.sink1.hostname = horse
tail1.sinks.sink1.port = 6000
tail1.sinks.sink1.batch-size = 1
tail1.sinks.sink1.channel = ch1
Tip: Ensure that each property defined in the configuration file text box is on a
single and separate line.
Click Save Changes.
Property Value
Agent Name collector1
Configuration collector1.sources = src1
File collector1.channels = ch1
collector1.sinks = sink1
collector1.sources.src1.type = avro
collector1.sources.src1.bind = horse
collector1.sources.src1.port = 6000
collector1.sources.src1.channels = ch1
collector1.channels.ch1.type = memory
collector1.channels.ch1.capacity = 500
collector1.sinks.sink1.type = hdfs
collector1.sinks.sink1.hdfs.path =
hdfs://elephant/user/flume/collector1
collector1.sinks.sink1.hdfs.filePrefix = access_log
collector1.sinks.sink1.channel = ch1
$ accesslog-gen.sh /tmp/access_log
$ ls -l /tmp/access*
-rw-rw-r-- 1 training training 498 Nov 15 15:12 /tmp/access_log
-rw-rw-r-- 1 training training 997 Nov 15 15:12 /tmp/access_log.0
-rw-rw-r-- 1 training training 1005 Nov 15 15:11 /tmp/access_log.1
In Cloudera Manager browse to the HDFS page for your cluster and click on File
Browser.
Drill down into /user/flume/collector1. You should see many
access_log files.
11. Edit the Collector1 agent configuration settings on horse by adding these three
additional name value pairs:
collector1.sinks.sink1.hdfs.rollSize = 2048
collector1.sinks.sink1.hdfs.rollCount = 100
collector1.sinks.sink1.hdfs.rollInterval = 60
12. From the Flume Status page, click on the ”Stale Configuration – refresh
needed” icon and follow the prompts to refresh the cluster.
Click Search.
Browse through the logged actions from both Flume agents.
Cleaning Up
15. Stop the log generator by hitting Ctrl+C in the first terminal window.
16. Stop both Flume agents from the Flume Instances page in Cloudera Manager.
17. Remove the generated access log files from the /tmp directory to clear up space on
your virtual machine.
On elephant:
$ rm -rf /tmp/access_log*
$ ~/training_materials/admin/scripts/reset_cluster.sh
1. Log on to MySQL.
On elephant:
4. Exit MySQL.
On elephant:
mysql> quit;
$ sudo ln -s /usr/share/java/mysql-connector-java.jar \
/opt/cloudera/parcels/CDH/lib/sqoop/lib/
Now run the command below to confirm the symlink was properly created.
On elephant:
$ readlink /opt/cloudera/parcels/CDH/lib/\
sqoop/lib/mysql-connector-java.jar
If the symlink was properly defined, the command should return the
/usr/share/java/mysql-connector-java.jar path.
$ sqoop help
You can safely ignore the warning that Accumulo does not exist since this course
does not use Accumulo.
$ sqoop list-databases \
--connect jdbc:mysql://localhost \
--username training --password training
$ sqoop list-tables \
--connect jdbc:mysql://localhost/movielens \
--username training --password training
$ sqoop import \
--connect jdbc:mysql://localhost/movielens \
--table movie --fields-terminated-by '\t' \
--username training --password training
Notice how the INFO messages that appear show that a MapReduce job consisting
of four map tasks was completed.
12. Import the movierating table into Hadoop using the command in step 10 as an
example.
Verify that the movierating table was imported by using the command in step 11
above as an example, or by using the Cloudera Manager HDFS page’s File Browser.
13. Optionally observe the results in Cloudera Manager’s YARN Applications page.
Navigate to the YARN Applications page.
Notice the last two YARN applications that ran (movie.jar and
movierating.jar).
Explore the job details for either or both of these jobs.
Note:
This exercise uses the MovieLens data set, or subsets thereof. This
data is freely available for academic purposes, and is used and
distributed by Cloudera with the express permission of the UMN
GroupLens Research Group. If you would like to use this data for
your own research purposes, you are free to do so, as long as you
cite the GroupLens Research Group in any resulting publications. If
you would like to use this data for commercial purposes, you must
obtain explicit permission. You may find the full dataset, as well
as detailed license terms, at [http://www.grouplens.org/
node/73].
IMPORTANT: This exercise builds on the previous one. If you were unable to complete
the previous exercise or think you may have made a mistake, run the following
command and follow the prompts to prepare for this exercise before continuing:
$ ~/training_materials/admin/scripts/reset_cluster.sh
1. From the Cloudera Manager Home page, select the Add Service menu option from
the drop-down menu to the right of Cluster 1.
The Add Service wizard appears.
8. From the Cloudera Manager Home page, select the Add Service option for
Cluster 1.
The Add Service wizard appears.
10. Select the row containing the hdfs, yarn, and zookeeper services (but not the
Spark service), then click Continue.
The Customize Role Assignments page appears.
Click Test Connection and verify that connection to the MySQL database is
successful (indicated by the “Continue” button turning blue and becoming active).
Click Continue.
18. Verify that you can run a Hive command from the Beeline shell.
On elephant:
$ beeline -u jdbc:hive2://elephant:10000/default \
-n training
If you see a warning about hbase-prefix-tree, you can safely ignore it.
No tables should appear, because you haven’t defined any Hive tables yet, but you
should not see any error messages.
> !quit
20. From the Cloudera Manager Home page, select the Add Service option for
Cluster 1.
The Add Service wizard appears.
After adding Impala, on the Cloudera Manager home page you will notice that the
HDFS service has stale configurations as indicated by the icon that appears.
Click on the “Stale Configuration – Restart needed” icon. The “Stale
Configurations” page appears.
Click Restart Stale Services.
Check the box to “Re-deploy client configuration”, then click Restart Now.
When the action completes click Finish.
28. Review the movierating table data imported into HDFS during the Sqoop
exercise.
On elephant:
31. Verify that you created the movierating table in the Hive metastore.
On elephant:
32. Run a simple Hive test query that counts the number of rows in the
movieratings table.
On elephant:
The command above should show that the current execution engine is MapReduce
(mr).
The above command changes the execution engine for Hive to Spark.
> !quit
$ impala-shell
36. Since you defined a new table after starting the Impala server on horse, you must
now refresh that server’s copy of the Hive metadata.
37. In Impala, run the same query against the movierating table that you ran in Hive.
Compare the amount of time it took to run the query in Impala to the amount of
time it took in Hive on MapReduce and Hive on Spark.
> quit;
Users will be able to access their environments by using a Web browser, eliminating the
need for administrators to install Hadoop client environments on the analysts’ systems.
At the end of this exercise, you should have daemons deployed on your five hosts as
follows (new daemons shown in blue):
The Hue server will be deployed on monkey. The HttpFS and Oozie servers on monkey
will support several Hue applications.
IMPORTANT: This exercise builds on the previous one. If you were unable to complete
the previous exercise or think you may have made a mistake, run the following
command and follow the prompts to prepare for this exercise before continuing:
$ ~/training_materials/admin/scripts/reset_cluster.sh
4. The hdfs Role Instances page reappears. The HttpFS(monkey) role instance
now appears in the list of role instances.
Notice that the status for this role instance is Stopped.
6. To verify HttpFS operation, run the HttpFS LISTSTATUS operation to examine the
content in the /user/training directory in HDFS.
On elephant:
Note: The HttpFS REST API returns JSON objects. Piping the JSON objects to
python -m json.tool makes the objects easier to read in standard output.
10. Select the row containing the hdfs, yarn, and zookeeper services, then click
Continue.
The Customize Role Assignments page appears.
12. The Database Setup page appears. Specify the settings as shown here:
• Database Host Name: lion
• Database Type: MySQL
• Database Name: oozie
• Username: oozieuser
• Password: password
Click “Test Connection” and verify the connection is successful. Click Continue.
17. From the Cloudera Manager Home page, select the Add Service option for
Cluster 1.
The Add Service wizard appears.
19. Select the row containing the hdfs, hive, impala, oozie, yarn, and zookeeper
services, then click Continue.
The Customize Role Assignments page appears.
21. The Database Setup page appears. Fill in the details as shown here:
• Database Host Name: lion
• Database Type: MySQL
• Database Name: hue
• Username: hueuser
• Password: password
22. Click Test Connection and verify the connection is successful. Click Continue.
26. Submit a Hadoop WordCount job so that there will be a MapReduce job entry that
you can browse in Hue after you start the Hue UI.
On elephant:
Type in admin as the user, with the password training, then click Create
Account.
31. If you completed the “Querying HDFS with Hive and Impala” exercise, start the Hive
Query Editor by selecting Query Editors > Hive.
Enter the following query to verify that Hive is working in Hue:
SHOW TABLES;
Click the Execute icon (play button). The result of the query should be the
movierating table.
Enter another query to count the number of records in the movierating table:
Give it some time to complete. The UI will first show the Log tab contents, then it
should eventually show the Results tab. The query should run successfully.
32. If you completed the “Querying HDFS with Hive and Impala” exercise, start the
Impala Query Editor by selecting Query Editors > Impala.
Enter the following query to verify that Impala is working in Hue:
SHOW TABLES;
Click the Execute icon. The result of the query should show the movierating
table.
Enter another query to count the number of records in the movierating table:
38. Click the entry for the part-r-00000 file—the output file from the WordCount
job.
A read-only editor showing the contents of the part-r-00000 file appears.
40. Access the Hue User Admin tool by selecting the Administration menu (cog and
wheels icon) and then choosing Manage Users.
The User Admin screen appears where you can define Hue users and groups and set
permissions.
Notice the automatically created entry for the admin user.
You will create another user and a group in the next task.
42. Click the About Hue icon (to the left of the Home icon).
The Quick Start Wizard’s Check Configuration tab shows “All OK. Configuration
check passed.”
Click into the Configuration tab (to the right of the About Hue link) to examine
Hue’s configuration.
Click the Server Logs tab to examine Hue’s logs.
43. Verify that you are still logged in to Hue as the admin user.
44. Activate the Hue User Management tool by selecting Administration > Manage
Users.
46. Click Add group and name the new group analysts.
Configure the permissions by selecting the ones listed below:
• about.access
• beeswax.access
• filebrowser.access
• help.access
• impala.access
• jobbrowser.access
• metastore.write
• metastore.access
• pig.access
Click Add group.
49. Sign out of the Hue UI (using the arrow icon in the top right of the screen).
50. Log back in to the Hue UI as user fred with password training.
Verify that in the session for fred, only the Hue applications configured for the
analysts group appear. For example, the Administration menu does not allow
fred to manage users. Fred also has no access to the Workflows, Search, and
Security menus that are available to the admin user.
IMPORTANT: This exercise builds on the previous one. If you were unable to complete
the previous exercise or think you may have made a mistake, run the following
command and follow the prompts to prepare for this exercise before continuing:
$ ~/training_materials/admin/scripts/reset_cluster.sh
3. Using steps like the ones you followed to stop the Hue service, stop the Oozie
service.
Verify that the only services that are still up and running on your cluster are the Hosts,
HDFS, Spark, YARN (MR2 Included), ZooKeeper, and Cloudera Manager Service
services. All these services should have good health.
• NameNode Hosts
◦ elephant(Current)
◦ tiger
• JournalNode Hosts
◦ elephant, horse, tiger
Click Continue.
Click Continue.
The “Enable High Availability Command___” page appears. The messages shown in
the screen shot below appear as Cloudera Manager enables HDFS high availability.
Note: Formatting the name directories of the current NameNode will fail. As
described in the Cloudera Manager interface, this is expected.
When the process of enabling high availability has finished, click Continue.
An informational message appears informing you of post-setup steps regarding the
Hive Metastore. You will not perform the post-setup steps because you will not be
using Hive for any remaining exercises.
Click Finish.
12. The HDFS service’s Status page appears. Click into the Instances page.
Observe that the HDFS service now comprises the following role instances:
• Balancer on horse
• DataNodes on tiger, elephant, monkey, and horse
• Failover Controllers on tiger and elephant
• An HttpFS server on monkey
• JournalNodes on elephant, tiger, and horse
• The active NameNode on elephant
• The standby NameNode on tiger
• No SecondaryNameNode
13. From the HDFS Instances page, click on Federation and High Availability and
observe that the state of one of the NameNodes is active and the state of the other
NameNode is standby.
14. Click the Instances link again to return to the previous page view.
15. Select the check box to the left of the entry for the active NameNode.
17. Wait for the restart operation to complete. When it has successfully completed click
Close.
Verify that the states of the NameNodes have changed—the NameNode that was
originally active (elephant) is now the standby, and the NameNode that was
originally the standby (tiger) is now active. If the Cloudera Manager UI does not
immediately reflect this change, give it a few seconds and it will.
18. Go to Diagnostics > Events and notice the many recent event entries related to the
restarting of the NameNode.
19. Back in the HDFS Instances tab, restart the NameNode that is currently the active
NameNode.
After the restart has completed, verify that the states of the NameNodes have again
changed.
$ ~/training_materials/admin/scripts/reset_cluster.sh
On elephant:
$ cd ~/training_materials/admin/scripts
$ cat pools.sh
The script will start or stop a job in the pool you specify. It takes two parameters.
The first is the pool name and the second is the action to take (start or stop). Each
job it will run will be relatively long running and consist of 10 mappers and 10
reducers.
Note: Remember, we use the terms pool and queue interchangeably.
The table displays the pools in the cluster. The pools you submitted jobs to should
have pending containers and allocated containers.
The table also shows the amount of memory and vcores that have been allocated to
each pool.
8. Back in Cloudera Manager, observe the resource allocation effect of starting a new
job in a new pool.
Occasionally refresh the YARN Resource Pools page.
Some pools may be initially over their fair share because the first jobs to run will
take all available cluster resources.
However, over time, notice that the jobs running over fair share begin to shed
resources, which are reallocated to other pools to approach fair share allocation for
all pools.
Tip: Mouse over any one of the “Per Pool” charts and then click the double-arrow
icon to expand the chart size.
Wait a minute or two, then observe the results in the charts on the Dynamic
Resource Pools page.
11. Observe the effect of the new Dynamic Resource Pool on the resource allocations
within the cluster.
In the YARN Applications page, check how many jobs are still running and in
which pools they are running.
Use the pools.sh script to start or stop jobs so that there is one job running in
each of the four pools.
Return to the YARN Resource Pools page to observe the affect of the pool settings
you defined.
As you continue to observe the “Per Pool Shares – Fair Share Memory” chart, you
should soon see that pool2 is given a greater share of resources.
$ ~/training_materials/admin/scripts/reset_cluster.sh
Only if you do not see the access_log file in HDFS, place it there now.
On elephant:
$ cd ~/training_materials/admin/data
$ hdfs dfs -mkdir weblog
$ gunzip -c access_log.gz \
| hdfs dfs -put - weblog/access_log
We will revisit this block when the NameNode recognizes that one of the DataNodes
is a “dead node” (after 10 minutes).
4. Visit the NameNode Web UI again and click on Datanodes. Refresh the browser
several times and notice that the “Last contact” value for the elephant DataNode
stays the same, while the “Last contact” values for the other DataNodes continue to
update with more recent timestamps.
5. Run the HDFS file system consistency check to see that the NameNode
currently thinks there are no problems.
On elephant:
6. Wait for at least ten minutes to pass before starting the next exercise.
(optional) in any terminal:
$ ~/training_materials/admin/scripts/reset_cluster.sh
1. In the NameNode Web UI, click on Datanodes and confirm that you now have one
“dead node.”
2. View the location of the block from the access_log file you investigated in the
previous exercise. Notice that Hadoop has automatically re-replicated the data to
another host to retain three-fold replication.
3. In Cloudera Manager, navigate to the HDFS Charts Library page and click on the
Blocks and Files chart filter.
Scroll down and view the charts which show evidence of what occurred. For
example, view the “Under-replicated Blocks” chart, the “Pending Replication
Blocks” chart, and the “Scheduled Replication Blocks” chart.
Note the spike in activity that occurred after the DataNode went down.
Click Search.
Scroll down to the log entries that occurred just over 10 minutes after the DataNode
was stopped. Notice the log messages related blocks being replicated. You can also
find these log entries by searching for “blockstatechange”.
5. Run the hdfs fsck command again to observe that the filesystem is still healthy.
On elephant:
6. Run the hdfs dfsadmin -report command to see that one dead DataNode is
now reported.
On elephant:
7. Use Cloudera Manager to restart the DataNode on elephant, bringing your cluster
back to full strength.
8. Run the hdfs fsck command again to observe the temporary over replication of
blocks.
On elephant:
Note that the over replication situation will resolve itself (if it has not already) now
that the previously unavailable DataNode is once again running.
If the command above did not show any over replicated blocks, go to Diagnostics >
Logs in Cloudera Manager and search the HDFS source for “replica”. You should
find evidence of the temporary over-replication in the log entries corresponding to
the time range just after the DataNode was started again.
$ ~/training_materials/admin/scripts/reset_cluster.sh
2. Take a snapshot.
Still in the Cloudera Manager File Browser at /user/training, Click Take
Snapshot. Give it the name snap1 and click OK.
After the snapshot completes click Close.
The snapshot section should now show your “snap1” listing.
3. Delete data from /user/training then restore data from the snapshot.
Now let’s see what happens if we delete some data.
On elephant:
The second command should show that the weblog directory is now gone.
However, your weblog data is still available, which you can see by running the
commands here:
Restore a copy of the weblog directory to the original location and then verify it is
back in place.
$ ~/training_materials/admin/scripts/reset_cluster.sh
1. Configure Cloudera Manager to send email alerts using the email server on lion.
In Cloudera Manager, choose Clusters > Cloudera Management Service.
Click on Configuration and then choose the Alert Publisher filter.
Confirm the Alerts: Enable Email Alerts property is checked.
• Alerts: Mail Server Username: training
• Alerts: Mail Server Password: training
• Alerts: Mail Message Recipients: training@localhost
• Alerts: Mail Message Format: text
Save the changes.
After you are done reading the email, type q and again hit the Enter key to exit the
mail client.
$ ~/training_materials/admin/scripts/reset_cluster.sh
$ cd ~/training_materials/admin/data
$ gunzip shakespeare.txt.gz
$ hdfs dfs -put shakespeare.txt /tmp
On elephant:
The command above should confirm that the access_log file exists.
Only if access_log was not found, run these commands to place the file in HDFS.
On elephant:
$ cd ~/training_materials/admin/data
$ hdfs dfs -mkdir weblog
$ gunzip -c access_log.gz \
| hdfs dfs -put - weblog/access_log
$ cd ~/training_materials/admin/java
$ hadoop jar EvilJobs.jar HeapOfTrouble \
/tmp/shakespeare.txt heapOfTrouble
• What is there that is different in the environment that was not there before the
problem started occurring?
• If a specific job seems to be the cause of the problem, locate the task logs for the job,
including the ApplicationMaster logs, and review them. Does anything stand out?
• If it seems like a Hadoop MapReduce job is the cause of the problem, is it possible to
get the source code for the job?
• Does searching the Web for the error provide any useful hints?
Post-Exercise Discussion
After some time has passed, your instructor will ask you to stop troubleshooting and
will lead the class in a discussion of troubleshooting techniques.
Exercise Environment
After class has finished, you can continue working on the exercises or practice the skills
you have learned.
The cloud-based exercise environment you were provided as part of the class
will continue to be available after the class is over. You can continue to use that
environment for up to ten hours within 30 days after the class is complete. If you
require additional flexibility, please contact Cloudera.
Note: Be sure to bookmark or make a note of the Access URL you used during class to
access it.
2. Click on the play icon (triangle icon) at the top of the browser page that shows the
thumbnail views of your machines. Give the machines one or two minutes to start.
4. Run these commands in the terminal the first time to get the start-cluster.sh
script:
$ wget http://tiny.cloudera.com/start-cluster.sh
$ chmod +x start-cluster.sh
5. Run the start-cluster.sh script in the terminal every time you restart your
cluster:
$ ./start-cluster.sh
Note: The above will not work if Cloudera Manager has not yet been installed. If this
hasn’t happened, please perform the hands-on exercise titled: “Installing Cloudera
Manager Server” before doing this step.
9. Wait for any commands that are still running to complete, as shown in the All
Recent Commands page. The cluster should return to good health after a minute or
two.