Running A Pig Program On The CDH Single Node Cluster On An Aws Ec2 Instance
Running A Pig Program On The CDH Single Node Cluster On An Aws Ec2 Instance
● Ensure that you have installed the WinSCP tool on your Windows machine.
IMPORTANT INSTRUCTIONS
● The following notations have been used while running the Java API code:
[ec2-user@ip-10-0-0-14 ~]$ hadoop command
Output of the command
As shown above, the command to be run is written in bold. The output of the command
is written in italics. The [ec2-user@ip-10-0-0-14 ~] informs us about the user through
which the command is to be executed.
● Whenever you want to access a file whether to execute the code or to push it onto the
HDFS, you need to either be in the directory of the file or specify the relative path to the
file.
● Be careful with the spaces in the commands.
● If a series of commands is given in a particular order, make sure that you run them in
the same order.
NOTE: Before starting with the document below, it is necessary that you create the EC2
instance with Cloudera installed on your machine. Go through Video 1 before getting
started with this document.
In this document, we are running the wordcount program on the ‘dropbox-policy.txt file
using PIG.
Steps to Run a PIG Program on the CDH Single-Node Cluster on an
AWS EC2 Instance
1. Start CDH EC2 instance from the AWS console and wait until the instance state changes
to ‘running’.
3. Go to your browser and open Cloudera Manager. To access the Cloudera Manager page,
enter your public IP address followed by ‘:7180’, as shown below:
<your public IP>:7180
Username: admin
Password: admin
4. After logging in to the Cloudera Manager, click on ‘Cloudera management service’
followed by ‘Restart’.
5. After the restart, click on “Close”.
6. Wait until all the services turn green, as shown in the image below.
7. Whenever a MapReduce or PIG program is to be run on the AWS EC2 instance, we need
to visit the YARN Configuration page and edit the following properties:
● yarn.scheduler.maximum-allocation-mb
● yarn.nodemanager.resource.memory-mb
c. Now we have to increase the RAM. To do so, enter the following properties in
the ‘Search’ field. Increase the RAM of each property to the number mentioned
below:
i. yarn.scheduler.maximum-allocation-mb
The default value is 1 GB. Change it to 8 GB and then click on ‘Save
Changes’.
ii. yarn.nodemanager.resource.memory-mb
The default value is 1 GB. Change it to 10 GB and then click on ‘Save
Changes’.
d. Restart the YARN service by clicking on the ‘State Configuration Restart needed’
icon as shown in the image below:
g. It generally takes a few minutes to restart the service. Wait until both the
green ticks appear. Now, click on ‘Finish’.
h. Wait until the YARN service turns green.
8. Next, we have to connect to the EC2 instance using PuTTY. Once connected, we log
in as an EC2 user. We now need to switch to the root user using the sudo -i
command.
9. PIG is by default installed in Cloudera. To verify, type ‘pig’ as shown below:
[root@ip-10-0-0-105 ~]# pig
10. Now, the grunt shell opens in the terminal where you can run PIG codes.
We have used quit to exit from the grunt shell.
11. Now, we have to run a program as per the instructions in the video.
First, we have to create a directory named ‘Pig’. We will use the ‘mkdir’ command to
do so:
[root@ip-10-0-0-105 ~]# mkdir Pig
13. Now, we need to copy the files downloaded above from our local machine to the
EC2 instance.
a. Mac/Linux users:
Use the following command to copy a file from your local system to the EC2
instance.
scp -i <path of .pem> < path of the file in your local system>
ec2-user@public ip: <destination path in the EC2 instance>
b. Windows users:
Move the downloaded files to a folder named ‘Pig_Data’ on the desktop of
your Windows machine. Although it is not necessary to do so and you need
only to remember where you have stored the downloaded files.
WinSCP is a tool to transfer a file from a Windows machine to a Linux
machine (EC2 instance).
i. Open WinSCP.
Username: ec2-user
iii. Then, click on ‘Advanced’.
iv. After clicking on ‘Authentication’, enter the path of your PPK file.
v. Click ‘OK’ followed by ‘Login’. Click ‘Yes’ on the pop-up that appears.
vi. Now, the following screen appears.
vii. Browse to the folder where you have stored the downloaded files. In
our case, it was a folder named ‘Pig_Data’ on the desktop.
14. Now, go back to the EC2 instance. We need to copy the files ‘count-words.pig’ and
‘dropbox-policy.txt’ from /home/ec2-user/ to /root/Pig/ using the following
commands:
[root@ip-10-0-0-105 ~]# cp /home/ec2-user/count-words.pig /root/Pig/
[root@ip-10-0-0-105 ~]# cp /home/ec2-user/dropbox-policy.txt /root/Pig/
15. Verify whether the files have been copied to the Pig directory or not. To do so,
change the working directory to ‘Pig’ using the ‘cd’ command followed by ‘ls’, and
check whether the files ‘count-words.pig’ and ‘dropbox-policy.txt’ exist.
[root@ip-10-0-0-105 ~]# cd Pig
[root@ip-10-0-0-105 Pig]# ls
16. Come out of the Pig directory using the ‘cd’ command.
[root@ip-10-0-0-105 Pig]# cd
[root@ip-10-0-0-105 ~]#
Creating a Directory Inside the HDFS and Changing its Owner
17. The commands used below demonstrate how to create a directory in the HDFS.
Note: A directory can be created in Hadoop only using the HDFS user. So now, switch
to the HDFS user. Note that there is a space between ‘-’ and ‘hdfs’ in the command
used below:
[root@ip-10-0-0-105 ~]# su – hdfs
18. You can verify the directory that was created by the command, as shown below:
Found 6 items
Now, as seen above, the owner of the directory is ‘hdfs’ (underlined above). To send
a file from any user to hdfs, the owner of the directory inside hdfs should be changed
to the user sending the file. For example, if you have to send a file from the root user
to a directory inside hdfs, the owner of that particular directory inside hdfs should be
changed to root.
19. To change the owner of the directory created from hdfs to root, run the following
command:
[hdfs@ip-10-0-0-105 ~]$ hadoop fs -chown root /user/root
20. You can verify whether or not the owner has been changed using the command
shown below:
[hdfs@ip-10-0-0-105 ~]$ hadoop fs -ls /user/
Found 6 items
drwxrwxrwx - mapred hadoop 0 2018-02-18 07:16 /user/history
drwxrwxr-t - hive hive 0 2018-02-18 07:17 /user/hive
drwxrwxr-x - hue hue 0 2018-02-18 07:18 /user/hue
drwxrwxr-x - oozie oozie 0 2018-02-18 07:18 /user/oozie
drwxr-xr-x - root supergroup 0 2018-02-18 09:49 /user/root
drwxr-x--x - spark spark 0 2018-02-18 07:17 /user/spark
You can see that the owner has changed from hdfs to root.
21. Now, use the ‘exit’ command to shift from the hdfs user to the root user.
[hdfs@ip-10-0-0-105 ~]$ exit;
logout
[root@ip-10-0-0-105 ~]#
22. Now, use the ‘put’ command to send the ‘dropbox-policy.txt’ file into the hdfs
/user/root directory.
Syntax: hadoop fs -put <source> <destination>
[root@ip-10-0-0-105 ~]# hadoop fs -put /root/Pig/dropbox-policy.txt /user/root
25. Now, we need to edit the count-word.pig file. We need to change the load and store
paths according to the HDFS.
After making the changes, we will switch to the command mode and then use :wq!
to save and exit the file in the vi editor.
26. Verify the changes using the ‘cat’ command.
27. Now, we will run the code using the ‘pig count-words.pig’ command and check the
output as specified in the store command in the code file.
[root@ip-10-0-0-105 Pig]# pig count-words.pig
28. To verify whether the code has run successfully or not, go to the Resource Manager.
Access the Resource Manager using your public IP followed by ‘:8088’, as shown
below:
<Public IP>:8088
Username: admin
Password: admin
Now, check whether the PIG Program shows ‘SUCCEEDED’ in the FinalStatus column
or not.