Hadoop Lab
Hadoop Lab
2018-2019
HADOOP LAB
Name:
Roll No:
Section:
1 Introduction
4 HDFS Commands
5 Word Count
10
INTRODUCTION
Hadoop is an open-source framework that allows to store and process big data in a distributed
environment across clusters of computers using simple programming models. It is designed to scale up
from single servers to thousands of machines, each offering local computation and storage.
Due to the advent of new technologies, devices, and communication means like social
networking sites, the amount of data produced by mankind is growing rapidly every year. The amount of
data produced by us from the beginning of time till 2003 was 5 billion gigabytes. If you pile up the data
in the form of disks it may fill an entire football field. The same amount was created in every two days
in 2011, and in every ten minutes in 2013. This rate is still growing enormously. Though all this
information produced is meaningful and can be useful when processed, it is being neglected.
Curation
Storage
Searching
Sharing
Transfer
Analysis
Presentation
To fulfill the above challenges, organizations normally take the help of enterprise servers.
Traditional Approach
In this approach, an enterprise will have a computer to store and process big data. Here data will be
stored in an RDBMS like Oracle Database, MS SQL Server or DB2 and sophisticated softwares can be
written to interact with the database, process the required data and present it to the users for analysis
purpose.
Limitation
This approach works well where we have less volume of data that can be accommodated by standard
database servers, or up to the limit of the processor which is processing the data. But when it comes to
dealing with huge amounts of data, it is really a tedious task to process such data through a traditional
database server.
Google’s Solution
Google solved this problem using an algorithm called MapReduce. This algorithm divides the task into
small parts and assigns those parts to many computers connected over the network, and collects the
results to form the final result dataset.
Hadoop
Google and started an Open Source Project called HADOOP in 2005. Hadoop runs applications using
the MapReduce algorithm, where the data is processed in parallel on different CPU nodes. In short,
Hadoop framework is capable enough to develop applications capable of running on clusters of
computers and they could perform complete statistical analysis for a huge amounts of data.
A Hadoop frame-worked application works in an environment that provides distributed storage and
computation across clusters of computers. Hadoop is designed to scale up from single server to
thousands of machines, each offering local computation and storage.
Hadoop MapReduce: This is a system for parallel processing of large data sets.
MapReduce
Hadoop MapReduce is a software framework for easily writing applications which process big
amounts of data in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable,
fault-tolerant manner.
The term MapReduce actually refers to the following two different tasks that Hadoop programs
perform:
The Map Task: This is the first task, which takes input data and converts it into a set of data,
where individual elements are broken down into tuples (key/value pairs).
The Reduce Task: This task takes the output from a map task as input and combines those data
tuples into a smaller set of tuples. The reduce task is always performed after the map task.
Typically both the input and the output are stored in a file-system. The framework takes care of
scheduling tasks, monitoring them and re-executes the failed tasks.
The MapReduce framework consists of a single master JobTracker and one slave TaskTracker per
cluster-node. The master is responsible for resource management, tracking resource
consumption/availability and scheduling the jobs component tasks on the slaves, monitoring them and
re-executing the failed tasks. The slaves TaskTracker execute the tasks as directed by the master and
provide task-status information to the master periodically.
The JobTracker is a single point of failure for the Hadoop MapReduce service which means if
JobTracker goes down, all running jobs are halted.
The Hadoop Distributed File System (HDFS) is based on the Google File System (GFS) and provides a
distributed file system that is designed to run on large clusters (thousands of computers) of small
computer machines in a reliable, fault-tolerant manner.
HDFS uses a master/slave architecture where master consists of a single NameNode that manages the
file system metadata and one or more slave DataNodes that store the actual data.
A file in an HDFS namespace is split into several blocks and those blocks are stored in a set of
DataNodes. The NameNode determines the mapping of blocks to the DataNodes. The DataNodes takes
care of read and write operation with the file system. They also take care of block creation, deletion
and replication based on instruction given by NameNode.
HDFS provides a shell like any other file system and a list of commands are available to interact with
the file system. These shell commands will be covered in a separate chapter along with appropriate
examples.
The Hadoop job client then submits the job (jar/executable etc) and configuration to the JobTracker
which then assumes the responsibility of distributing the software/configuration to the slaves,
scheduling tasks and monitoring them, providing status and diagnostic information to the job-client.
Stage 3
The TaskTrackers on different nodes execute the task as per MapReduce implementation and output of
the reduce function is stored into the output files on the file system.
Advantages of Hadoop
Hadoop framework allows the user to quickly write and test distributed systems. It is efficient,
and it automatic distributes the data and work across the machines and in turn, utilizes the
underlying parallelism of the CPU cores.
Hadoop does not rely on hardware to provide fault-tolerance and high availability (FTHA),
rather Hadoop library itself has been designed to detect and handle failures at the application
layer.
Servers can be added or removed from the cluster dynamically and Hadoop continues to operate
without interruption.
Another big advantage of Hadoop is that apart from being open source, it is compatible on all
the platforms since it is Java based.
HADOOP INTSALLATION - UBUNTU
Download Hadoop
$ wget http://mirrors.sonic.net/apache/hadoop/common/hadoop-3.0.0/hadoop-3.0.0.tar.gz
Unzip it
Hadoop Configuration
Make a directory called hadoop and move the folder ‘hadoop-3.0.0’ to this directory
$ sudomkdir -p /usr/local/hadoop
$ cd hadoop-3.0.0/
$ sudo mv * /usr/local/hadoop
$ sudochown -R hduser:hadoop /usr/local/hadoop
1. ~/.bashrc
2. hadoop-env.sh
3. core-site.xml
4. hdfs-site.xml
5. yarn-site.xml
~/.bashrc
If you don’t know the path where java is installed, first run the following command to locate it
$sudonano ~/.bashrc
Note: I have used ‘nano’ editor, you can use a different one. No issues.
Now once the file is opened, append the following code at the end of file,
$source ~/.bashrc
hadoop-env.sh
We need to tell Hadoop the path where java is installed. That’s what we will do in this file, specify the path for
JAVA_HOME variable.
$sudonano /usr/local/hadoop/etc/hadoop/hadoop-env.sh
Now, the first variable in file will be JAVA_HOME variable, change the value of that variable to
export JAVA_HOME=usr/lib/jvm/java-8-openjdk-amd64
core-site.xml
$ sudomkdir -p /app/hadoop/tmp
$ sudochownhduser:hadoop /app/hadoop/tmp
$sudonano /usr/local/hadoop/etc/hadoop/core-site.xml
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. A URI whose scheme and authority determine the
FileSystem implementation. The uri’s scheme determines the config property (fs.SCHEME.impl) naming the
FileSystem implementation class. The uri’s authority is used to determine the host, port, etc. for a
filesystem.</description>
</property>
</configuration>
hdfs-site.xml
1. Name Node
2. Data Node
Make directories
$ sudomkdir -p /usr/local/hadoop_store/hdfs/namenode
$ sudomkdir -p /usr/local/hadoop_store/hdfs/datanode
$sudonano /usr/local/hadoop/etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.The actual number of replications can be specified when the file is
created. The default is used if replication is not specified in create time.
</description>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/hadoop_store/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/usr/local/hadoop_store/hdfs/datanode</value>
</property>
</configuration>
yarn-site.xml
$sudonano /usr/local/hadoop/etc/hadoop/yarn-site.xml
Just like the other two, add the content to configuration tags.
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
$ hadoopnamenode -format
Start Hadoop daemons
Now that hadoop installation is complete and name-nodes are formatted, we can start hadoop by going to
following directory.
$ cd /usr/local/hadoop/sbin
$ start-all.sh
Just check if all daemons are properly started using the following command:
$ jps
Stop Hadoop daemons
$ stop-all.sh
Appreciate yourself because you’ve done it. You have completed all the Hadoop installation steps and Hadoop is
now ready to run the first program.
Let’s run MapReduce job on our entirely fresh Hadoop cluster setup
$ cd /usr/local/hadoop
Kindly select the Quickstart VM with CDH 5.7 and download the VirtualBox install
Please sign in with Cloudera credentials and then download the VM image. Download is around 5 GB.
Unzip the folder to a suitable path and have the .ovf file available
Open the Virtual Box it looks like this. Ignore all pre-installed VM’s on my virtualbox. Your installation
will be blank
Hit Start virtual machine to start your Hadoop virtual machine. Loading will take few minutes to load
the CentOS 6.7
HDFS COMMANDS
ls
Lists the contents of a directory.
Example:
cloudera@quickstart ~]$ hadoop fs -ls
Found 5 items
-r---wx--x 1 cloudera cloudera 66 2018-08-23 06:54 abc.txt
drwxr-xr-x - cloudera cloudera 0 2018-08-28 02:57 cc
drwxr-xr-x - cloudera cloudera 0 2018-08-23 08:49 dh
drwxr-xr-x - cloudera cloudera 0 2018-08-27 03:19 out
-rw-r--r-- 1 cloudera cloudera 268141 2018-08-28 06:45 poems
appendToFile
Appends the contents to the given file. The file will be created if it does not exist.
Example:
cat
Example:
Found 5 items
-r---wx--x 1 cloudera cloudera 66 2018-08-23 06:54 abc.txt
drwxr-xr-x - cloudera cloudera 0 2018-08-28 02:57 cc
-rw-r--r-- 1 cloudera cloudera 37 2018-08-30 04:13 example.txt
-rw-r--r-- 1 cloudera cloudera 25 2018-08-30 03:51 examplehdfs
-rw-r--r-- 1 cloudera cloudera 0 2018-08-30 04:25 samplefile.txt
mkdir
It creates a new directory.
Example:
Found 6 items
-r---wx--x 1 cloudera cloudera 66 2018-08-23 06:54 abc.txt
drwxr-xr-x - cloudera cloudera 0 2018-08-28 02:57 cc
-rw-r--r-- 1 cloudera cloudera 37 2018-08-30 04:13 example.txt
-rw-r--r-- 1 cloudera cloudera 25 2018-08-30 03:51 examplehdfs
drwxr-xr-x - cloudera cloudera 0 2018-08-30 04:45 newdir
-rw-r--r-- 1 cloudera cloudera 0 2018-08-30 04:25 newfile.txt
cp command
Used for copying files from one directory to another within HDFS.
Example:
Found 1 items
-rw-r--r-- 1 cloudera cloudera 9 2018-08-03 02:48 newdir/file1
du
Displays sizes of files and directories contained in the given directory or the size of a file if
its just a file.
Example:
[cloudera@quickstart ~]$ hadoop fs -du exmpldir
0 0 exmpldir/file1
38 38 exmpldir/file2
put
Copy single src file, or multiple src files from local file system to the hadoop data file system.
Example:
[cloudera@quickstart ~]$ cat>exmple
Found 5 items
-rw-r--r-- 1 cloudera cloudera 37 2018-08-30 04:13 example.txt
-rw-r--r-- 1 cloudera cloudera 30 2018-08-30 05:01 outputfile
drwxr-xr-x - cloudera cloudera 0 2018-08-30 04:45 newdir
-rw-r--r-- 1 cloudera cloudera 0 2018-08-30 04:25 newfile.txt
copyFromLocal
CopyFromLocal is same as put command
Example:
[cloudera@quickstart ~]$ hadoop fs -copyFromLocal poems copyfile
[cloudera@quickstart ~]$ hadoop fs -ls
Found 5 items
-rw-r--r-- 1 cloudera cloudera 52338 2018-08-31 02:08 GSOD
-r---wx--x 1 cloudera cloudera 66 2018-08-23 06:54 abc.txt
drwxr-xr-x - cloudera cloudera 0 2018-08-28
-rw-r--r-- 1 cloudera cloudera 1092 2018-08-23 02:45 cm_api.sh
-rw-r--r-- 1 cloudera cloudera 268141 2018-08-31 08:21 copyfile
get command
Copy single src file, or multiple src files from hadoop file system to the local file
system.
Example:
[cloudera@quickstart ~]hadoop fs –ls
Found 6 items
-rw-r–r-- 1 cloudera cloudera 52338 2018-08-31 02:08 GSOD
-rw-r–r-- 1 cloudera cloudera 268141 2018-08-31 08:21 copyfile
-rwx–x-wx 1 cloudera cloudera 37 2018-08-30 04:13 example.txt
-rw-r–r-- 1 cloudera cloudera 30 2018-08-30 05:01 file1
drwxr-xr-x - cloudera cloudera 0 2018-08-30 04:45newdir
Example:
[cloudera@quickstart ~]hadoop fs –ls
Found 6 items
-rw-r–r-- 1 cloudera cloudera 52338 2018-08-31 02:08 GSOD
-rw-r–r-- 1 cloudera cloudera 268141 2018-08-31 08:21 copyfile
-rwx–x-wx 1 cloudera cloudera 37 2018-08-30 04:13 example.txt
-rw-r–r-- 1 cloudera cloudera 30 2018-08-30 05:01 inp
drwxr-xr-x - cloudera cloudera 0 2018-08-30 04:45 newdir
getfacl
Displays the Access Control Lists (ACLs) of files and directories. If a directory has a
default ACL, then getfacl also displays the default ACL.
Example:
[cloudera@quickstart ~]$ hadoop fs -getfacl exampl.txt
# file: exampl.txt
# owner: cloudera
# group:
cloudera
user::rw
x
group::r
-x
other::r-
x
moveFromLocal
Same as -put, except that the source is deleted after it's copied.
Example:
mv command
Move files that match the specified file pattern to a destination. When moving
multiple files, the destination must be a directory.
Example:
cloudera@quickstart ~]$ hadoop fs -mv abc.txt newdir
[cloudera@quickstart ~]$ hadoop fs -ls newdir
Found 1 items
-r---wx--x 1 cloudera cloudera 66 2018-08-23 06:54 newdir/abc.txt
rm command
Used for removing a file from HDFS. The command -rm r can be used for
recursive delete.
rmdir command can be used to delete directories .
Options:
-skipTrash option bypasses trash, if enabled, and immediately deletes <src>
-f If the file does not exist, do not display a diagnostic message or modify the exit
status to reflect an error.
-[rR] Recursively deletes directories
Example:
cloudera@quickstart ~]$ hadoop fs -rm sample.txt
test command
This command can be used to test a hdfs file’s existence or zero length or is it a
directory.
Options:
-d return 0 if <path> is a directory.
-e return 0 if <path> exists.
-f return 0 if <path> is a file.
-s return 0 if file <path> is greater than zero bytes in size.
-z return 0 if file <path> is zero bytes in size, else return 1.
expunge
This command is used to empty the trash in hadoop file system.
count
Count the number of directories, files and bytes under the paths that match the
specified file pattern.
The output columns are,
DIR_COUNT FILE_COUNT CONTENT_SIZE FILE_NAME
Example:
[cloudera@quickstart ~]$ hadoop fs -count newdir
1 2 75 newdir
chmod
Changes permissions of a file.
Example:
[cloudera@quickstart ~]$ hadoop fs -getfacl example.txt
# file: example.txt
# owner: cloudera
# group: cloudera
user::r
--
group:
:--x
other::
-w-
chown
Changes owner and group of a file.
Syntax:
$ hadoop fs -chown [-R] [OWNER][:[GROUP]] PATH
WordCount
Write a Hadoop MapReduce program to calculate the individual word count of a file.
1. Before you run the sample, you must create input and output locations in HDFS. Use the
following commands to create the input directory/user/cloudera/wordcount/input in HDFS:
$ sudo su hdfs
$ hadoop fs -mkdir /user/cloudera
$ hadoop fs -chown cloudera /user/cloudera
$ exit
$ sudo su cloudera
$ hadoop fs -mkdir /user/cloudera/wordcount /user/cloudera/wordcount/input
5. Run the WordCount application from the JAR file, passing the paths to the input and
output directories in HDFS.
$ hadoop jar wordcount.jar org.myorg.WordCount /user/cloudera/wordcount/input
/user/cloudera/wordcount/output
When you look at the output, all of the words are listed in UTF-8 alphabetical order
(capitalized words first). The number of occurrences from all input files has been reduced to a
single sum for each word.
6. If you want to run the sample again, you first need to remove the output directory. Use
the following command.
$ hadoop fs -rm -r /user/cloudera/wordcount/output