100% found this document useful (1 vote)

150 views

Hadoop Lab

This document provides an introduction to Hadoop, including: 1) Hadoop is an open-source framework for storing and processing large datasets across clusters of computers using simple programming models. 2) It describes what constitutes "big data" and examples like social media, stock exchanges, and power grids. 3) The core Hadoop components are HDFS for distributed storage and MapReduce for parallel processing. HDFS uses a master/slave architecture with the NameNode and DataNodes.

Uploaded by

ISS GVP

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

150 views

Hadoop Lab

Uploaded by

ISS GVP

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 32

G. V. P.

COLLEGE OF ENGINEERING (A)

VISAKHAPATNAM
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

SKILL BASED LAB ELECTIVE

III YEAR-V SEMESTER

2018-2019

HADOOP LAB

Name:

Roll No:

Section:

Signature of External Examiner Signature of Internal Examiner

INDEX

S.NO DESCRIPTION PAGE NO. REMARKS

1 Introduction

2 Hadoop installation using

Ubuntu

3 Hadoop installation using

CloudEra

4 HDFS Commands

5 Word Count

10
INTRODUCTION

Hadoop is an open-source framework that allows to store and process big data in a distributed
environment across clusters of computers using simple programming models. It is designed to scale up
from single servers to thousands of machines, each offering local computation and storage.
Due to the advent of new technologies, devices, and communication means like social
networking sites, the amount of data produced by mankind is growing rapidly every year. The amount of
data produced by us from the beginning of time till 2003 was 5 billion gigabytes. If you pile up the data
in the form of disks it may fill an entire football field. The same amount was created in every two days
in 2011, and in every ten minutes in 2013. This rate is still growing enormously. Though all this
information produced is meaningful and can be useful when processed, it is being neglected.

What is Big Data?

Big data means really a big data, it is a collection of large datasets that cannot be processed using
traditional computing techniques. Big data is not merely a data, rather it has become a complete
subject, which involves various tools, technqiues and frameworks.

What Comes Under Big Data?

Big data involves the data produced by different devices and applications. Given below are some of the
fields that come under the umbrella of Big Data.
 Black Box Data : It is a component of helicopter, airplanes, and jets, etc. It captures voices of
the flight crew, recordings of microphones and earphones, and the performance information of
the aircraft.
 Social Media Data : Social media such as Facebook and Twitter hold information and the views
posted by millions of people across the globe.
 Stock Exchange Data : The stock exchange data holds information about the ‘buy’ and ‘sell’
decisions made on a share of different companies made by the customers.
 Power Grid Data : The power grid data holds information consumed by a particular node with
respect to a base station.
 Transport Data : Transport data includes model, capacity, distance and availability of a vehicle.
 Search Engine Data : Search engines retrieve lots of data from different databases.
Thus Big Data includes huge volume, high velocity, and extensible variety of data. The data in it will
be of three types.

 Structured data : Relational data.

 Semi Structured data : XML data.
 Unstructured data : Word, PDF, Text, Media Logs.
Benefits of Big Data
Big data is really critical to our life and its emerging as one of the most important technologies in
modern world. Follow are just few benefits which are very much known to all of us:
 Using the information kept in the social network like Facebook, the marketing agencies are
learning about the response for their campaigns, promotions, and other advertising mediums.
 Using the information in the social media like preferences and product perception of their
consumers, product companies and retail organizations are planning their production.
 Using the data regarding the previous medical history of patients, hospitals are providing better
and quick service.
Big Data Challenges
The major challenges associated with big data are as follows:
 Capturing data

 Curation

 Storage

 Searching

 Sharing

 Transfer

 Analysis

 Presentation
To fulfill the above challenges, organizations normally take the help of enterprise servers.
Traditional Approach
In this approach, an enterprise will have a computer to store and process big data. Here data will be
stored in an RDBMS like Oracle Database, MS SQL Server or DB2 and sophisticated softwares can be
written to interact with the database, process the required data and present it to the users for analysis
purpose.

Limitation
This approach works well where we have less volume of data that can be accommodated by standard
database servers, or up to the limit of the processor which is processing the data. But when it comes to
dealing with huge amounts of data, it is really a tedious task to process such data through a traditional
database server.

Google’s Solution
Google solved this problem using an algorithm called MapReduce. This algorithm divides the task into
small parts and assigns those parts to many computers connected over the network, and collects the
results to form the final result dataset.

Hadoop
Google and started an Open Source Project called HADOOP in 2005. Hadoop runs applications using
the MapReduce algorithm, where the data is processed in parallel on different CPU nodes. In short,
Hadoop framework is capable enough to develop applications capable of running on clusters of
computers and they could perform complete statistical analysis for a huge amounts of data.
A Hadoop frame-worked application works in an environment that provides distributed storage and
computation across clusters of computers. Hadoop is designed to scale up from single server to
thousands of machines, each offering local computation and storage.

Basic Hadoop framework includes following modules:

 Hadoop Distributed File System (HDFS): A distributed file system that provides high-
throughput access to application data.

 Hadoop MapReduce: This is a system for parallel processing of large data sets.

MapReduce
Hadoop MapReduce is a software framework for easily writing applications which process big
amounts of data in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable,
fault-tolerant manner.

The term MapReduce actually refers to the following two different tasks that Hadoop programs
perform:
 The Map Task: This is the first task, which takes input data and converts it into a set of data,
where individual elements are broken down into tuples (key/value pairs).
 The Reduce Task: This task takes the output from a map task as input and combines those data
tuples into a smaller set of tuples. The reduce task is always performed after the map task.
Typically both the input and the output are stored in a file-system. The framework takes care of
scheduling tasks, monitoring them and re-executes the failed tasks.
The MapReduce framework consists of a single master JobTracker and one slave TaskTracker per
cluster-node. The master is responsible for resource management, tracking resource
consumption/availability and scheduling the jobs component tasks on the slaves, monitoring them and
re-executing the failed tasks. The slaves TaskTracker execute the tasks as directed by the master and
provide task-status information to the master periodically.

The JobTracker is a single point of failure for the Hadoop MapReduce service which means if
JobTracker goes down, all running jobs are halted.

Hadoop Distributed File System

Hadoop can work directly with any mountable distributed file system such as Local FS, HFTP FS, S3
FS, and others, but the most common file system used by Hadoop is the Hadoop Distributed File
System (HDFS).

The Hadoop Distributed File System (HDFS) is based on the Google File System (GFS) and provides a
distributed file system that is designed to run on large clusters (thousands of computers) of small
computer machines in a reliable, fault-tolerant manner.

HDFS uses a master/slave architecture where master consists of a single NameNode that manages the
file system metadata and one or more slave DataNodes that store the actual data.

A file in an HDFS namespace is split into several blocks and those blocks are stored in a set of
DataNodes. The NameNode determines the mapping of blocks to the DataNodes. The DataNodes takes
care of read and write operation with the file system. They also take care of block creation, deletion
and replication based on instruction given by NameNode.

HDFS provides a shell like any other file system and a list of commands are available to interact with
the file system. These shell commands will be covered in a separate chapter along with appropriate
examples.

How Does Hadoop Work?

Stage 1
A user/application can submit a job to the Hadoop (a hadoop job client) for required process by
specifying the following items:
1. The location of the input and output files in the distributed file system.
2. The java classes in the form of jar file containing the implementation of map and reduce
functions.
3. The job configuration by setting different parameters specific to the job.
Stage 2

The Hadoop job client then submits the job (jar/executable etc) and configuration to the JobTracker
which then assumes the responsibility of distributing the software/configuration to the slaves,
scheduling tasks and monitoring them, providing status and diagnostic information to the job-client.
Stage 3
The TaskTrackers on different nodes execute the task as per MapReduce implementation and output of
the reduce function is stored into the output files on the file system.

Advantages of Hadoop
 Hadoop framework allows the user to quickly write and test distributed systems. It is efficient,
and it automatic distributes the data and work across the machines and in turn, utilizes the
underlying parallelism of the CPU cores.
 Hadoop does not rely on hardware to provide fault-tolerance and high availability (FTHA),
rather Hadoop library itself has been designed to detect and handle failures at the application
layer.
 Servers can be added or removed from the cluster dynamically and Hadoop continues to operate
without interruption.
 Another big advantage of Hadoop is that apart from being open source, it is compatible on all
the platforms since it is Java based.
HADOOP INTSALLATION - UBUNTU
Download Hadoop

$ wget http://mirrors.sonic.net/apache/hadoop/common/hadoop-3.0.0/hadoop-3.0.0.tar.gz

Unzip it

$ tar xvzf hadoop-3.0.0.tar.gz

Hadoop Configuration

Make a directory called hadoop and move the folder ‘hadoop-3.0.0’ to this directory

$ sudomkdir -p /usr/local/hadoop

$ cd hadoop-3.0.0/

$ sudo mv * /usr/local/hadoop
$ sudochown -R hduser:hadoop /usr/local/hadoop

Setting up Configuration files

We will change content of following files in order to complete hadoop installation.

1. ~/.bashrc

2. hadoop-env.sh

3. core-site.xml

4. hdfs-site.xml

5. yarn-site.xml

~/.bashrc

If you don’t know the path where java is installed, first run the following command to locate it

$update-alternatives –config java

Now open the ~/.bashrc file

$sudonano ~/.bashrc

Note: I have used ‘nano’ editor, you can use a different one. No issues.

Now once the file is opened, append the following code at the end of file,

#HADOOP VARIABLES START

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS=”-Djava.library.path=$HADOOP_HOME/lib”
#HADOOP VARIABLES END

Press CTRL+O to save and CTRL+X to exit from that window.

Update .bashrc file to apply changes

$source ~/.bashrc

hadoop-env.sh

We need to tell Hadoop the path where java is installed. That’s what we will do in this file, specify the path for
JAVA_HOME variable.

Open the file,

$sudonano /usr/local/hadoop/etc/hadoop/hadoop-env.sh

Now, the first variable in file will be JAVA_HOME variable, change the value of that variable to

export JAVA_HOME=usr/lib/jvm/java-8-openjdk-amd64
core-site.xml

Create temporary directory

$ sudomkdir -p /app/hadoop/tmp

$ sudochownhduser:hadoop /app/hadoop/tmp

Open the file,

$sudonano /usr/local/hadoop/etc/hadoop/core-site.xml

Append the following between configuration tags. Same as below.

<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. A URI whose scheme and authority determine the
FileSystem implementation. The uri’s scheme determines the config property (fs.SCHEME.impl) naming the
FileSystem implementation class. The uri’s authority is used to determine the host, port, etc. for a
filesystem.</description>
</property>
</configuration>
hdfs-site.xml

Mainly there are two directories,

1. Name Node

2. Data Node

Make directories

$ sudomkdir -p /usr/local/hadoop_store/hdfs/namenode

$ sudomkdir -p /usr/local/hadoop_store/hdfs/datanode

$ sudochown -R hduser:hadoop /usr/local/hadoop_store

Open the file,

$sudonano /usr/local/hadoop/etc/hadoop/hdfs-site.xml

Change the content between configuration tags shown as below.

<name>dfs.replication</name>

<value>1</value>
<description>Default block replication.The actual number of replications can be specified when the file is
created. The default is used if replication is not specified in create time.

</description>

</property>

<name>dfs.namenode.name.dir</name>

<value>file:/usr/local/hadoop_store/hdfs/namenode</value>

</property>

<name>dfs.datanode.data.dir</name>

<value>file:/usr/local/hadoop_store/hdfs/datanode</value>

</property>

</configuration>

yarn-site.xml

Open the file,

$sudonano /usr/local/hadoop/etc/hadoop/yarn-site.xml

Just like the other two, add the content to configuration tags.

<name>yarn.nodemanager.aux-services</name>

<value>mapreduce_shuffle</value>

</property>

</configuration>

Format Hadoop file system

Hadoop installation is now done. All we have to do is change format the name-nodes before using it.

$ hadoopnamenode -format
Start Hadoop daemons

Now that hadoop installation is complete and name-nodes are formatted, we can start hadoop by going to
following directory.

$ cd /usr/local/hadoop/sbin

$ start-all.sh

Just check if all daemons are properly started using the following command:

$ jps
Stop Hadoop daemons

$ stop-all.sh

Appreciate yourself because you’ve done it. You have completed all the Hadoop installation steps and Hadoop is
now ready to run the first program.

Let’s run MapReduce job on our entirely fresh Hadoop cluster setup

Go to the following directory

$ cd /usr/local/hadoop

Run the following command

hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-3.0.0.jar pi 2 5

Hooray! It’s done.

HADOOP INTSALLATION – CLOUDERA

Installation for Cloudera VM CDH 5.10 for Oracle VirtualBox

Install Oracle VirtualBox

Go to google and type “download oracle virtualbox”

For Windows system, Select VirtualBox 5.1.26 for Windows hosts

Select an appropriate download folder and download the exe file.

Follow the download Wizard and install the Oracle VirtualBox
Installing Cloudera VM for VirtualBox.

Firstly go to Cloudera.com in your web browser

On top bar navigate to the downloads link

Go to the quickstart VM links and install a Virtual machine for VirtualBox

Kindly select the Quickstart VM with CDH 5.7 and download the VirtualBox install

Please sign in with Cloudera credentials and then download the VM image. Download is around 5 GB.
Unzip the folder to a suitable path and have the .ovf file available

Open the Virtual Box it looks like this. Ignore all pre-installed VM’s on my virtualbox. Your installation
will be blank

Hit File -> Import Appliance

Navigate to the unzipped folder of your CDH5.10ovf image

Select the image and hit Import. It will open the Cloudera Quickstart VM. The loading may take some
time.

Hit Start virtual machine to start your Hadoop virtual machine. Loading will take few minutes to load
the CentOS 6.7
HDFS COMMANDS

 ls
Lists the contents of a directory.
Example:
cloudera@quickstart ~]$ hadoop fs -ls
Found 5 items
-r---wx--x 1 cloudera cloudera 66 2018-08-23 06:54 abc.txt
drwxr-xr-x - cloudera cloudera 0 2018-08-28 02:57 cc
drwxr-xr-x - cloudera cloudera 0 2018-08-23 08:49 dh
drwxr-xr-x - cloudera cloudera 0 2018-08-27 03:19 out
-rw-r--r-- 1 cloudera cloudera 268141 2018-08-28 06:45 poems

 appendToFile
Appends the contents to the given file. The file will be created if it does not exist.

Example:

[cloudera@quickstart ~]$ hadoop fs -cat example.txt

This is hdfs file system

[cloudera@quickstart ~]$ hadoop fs -appendToFile - example.txt

Welcome...!

[cloudera@quickstart ~]$ hadoop fs -cat example.txt

This is hdfs file system
Welcome...!

 cat

Displays the contents of the file.

Example:

[cloudera@quickstart ~]$ hadoop fs -cat example.txt

This is hdfs file system
 touchz command

Creates a new file in the given path.

Example:

[cloudera@quickstart ~]$ hadoop fs -touchz samplefile.txt

[cloudera@quickstart ~]$ hadoop fs -ls

Found 5 items
-r---wx--x 1 cloudera cloudera 66 2018-08-23 06:54 abc.txt
drwxr-xr-x - cloudera cloudera 0 2018-08-28 02:57 cc
-rw-r--r-- 1 cloudera cloudera 37 2018-08-30 04:13 example.txt
-rw-r--r-- 1 cloudera cloudera 25 2018-08-30 03:51 examplehdfs
-rw-r--r-- 1 cloudera cloudera 0 2018-08-30 04:25 samplefile.txt

 mkdir
It creates a new directory.

Example:

[cloudera@quickstart ~]$ hadoop fs -mkdir newdir

[cloudera@quickstart ~]$ hadoop fs -ls

Found 6 items
-r---wx--x 1 cloudera cloudera 66 2018-08-23 06:54 abc.txt
drwxr-xr-x - cloudera cloudera 0 2018-08-28 02:57 cc
-rw-r--r-- 1 cloudera cloudera 37 2018-08-30 04:13 example.txt
-rw-r--r-- 1 cloudera cloudera 25 2018-08-30 03:51 examplehdfs
drwxr-xr-x - cloudera cloudera 0 2018-08-30 04:45 newdir
-rw-r--r-- 1 cloudera cloudera 0 2018-08-30 04:25 newfile.txt
 cp command

Used for copying files from one directory to another within HDFS.
Example:

[cloudera@quickstart ~]$ hadoop fs -cp olddir/file1 newdir

[cloudera@quickstart ~]$ hadoop fs -ls newdir

Found 1 items
-rw-r--r-- 1 cloudera cloudera 9 2018-08-03 02:48 newdir/file1

 du
Displays sizes of files and directories contained in the given directory or the size of a file if
its just a file.

Example:
[cloudera@quickstart ~]$ hadoop fs -du exmpldir

0 0 exmpldir/file1
38 38 exmpldir/file2

 put

Copy single src file, or multiple src files from local file system to the hadoop data file system.
Example:
[cloudera@quickstart ~]$ cat>exmple

[cloudera@quickstart ~]$ hadoop fs -put exmple outputfile

[cloudera@quickstart ~]$ hadoop fs -ls

Found 5 items
-rw-r--r-- 1 cloudera cloudera 37 2018-08-30 04:13 example.txt
-rw-r--r-- 1 cloudera cloudera 30 2018-08-30 05:01 outputfile
drwxr-xr-x - cloudera cloudera 0 2018-08-30 04:45 newdir
-rw-r--r-- 1 cloudera cloudera 0 2018-08-30 04:25 newfile.txt
 copyFromLocal
CopyFromLocal is same as put command

Example:
[cloudera@quickstart ~]$ hadoop fs -copyFromLocal poems copyfile
[cloudera@quickstart ~]$ hadoop fs -ls
Found 5 items
-rw-r--r-- 1 cloudera cloudera 52338 2018-08-31 02:08 GSOD
-r---wx--x 1 cloudera cloudera 66 2018-08-23 06:54 abc.txt
drwxr-xr-x - cloudera cloudera 0 2018-08-28
-rw-r--r-- 1 cloudera cloudera 1092 2018-08-23 02:45 cm_api.sh
-rw-r--r-- 1 cloudera cloudera 268141 2018-08-31 08:21 copyfile

 get command

Copy single src file, or multiple src files from hadoop file system to the local file
system.
Example:
[cloudera@quickstart ~]hadoop fs –ls
Found 6 items
-rw-r–r-- 1 cloudera cloudera 52338 2018-08-31 02:08 GSOD
-rw-r–r-- 1 cloudera cloudera 268141 2018-08-31 08:21 copyfile
-rwx–x-wx 1 cloudera cloudera 37 2018-08-30 04:13 example.txt
-rw-r–r-- 1 cloudera cloudera 30 2018-08-30 05:01 file1
drwxr-xr-x - cloudera cloudera 0 2018-08-30 04:45newdir

[cloudera@quickstart ~]$ hadoop fs –get example.txt localcopy

[cloudera@quickstart ~]$ ls
count.jar example.txt hello poems~ Videos fileread
a datasets GSOD lib prgrm wcount.jar ab.txt
Documents gt Music Public workspace cloudera-manager
Downloads count Pictures Templates wpart.jar cm_api.sh eclipse
hadop poems tempr.jar
 copyToLocal
copyToLocal is same as get command

Example:
[cloudera@quickstart ~]hadoop fs –ls
Found 6 items
-rw-r–r-- 1 cloudera cloudera 52338 2018-08-31 02:08 GSOD
-rw-r–r-- 1 cloudera cloudera 268141 2018-08-31 08:21 copyfile
-rwx–x-wx 1 cloudera cloudera 37 2018-08-30 04:13 example.txt
-rw-r–r-- 1 cloudera cloudera 30 2018-08-30 05:01 inp
drwxr-xr-x - cloudera cloudera 0 2018-08-30 04:45 newdir

[cloudera@quickstart ~]$ hadoop fs -copyToLocal inp outfile

[cloudera@quickstart ~]$ ls
count.jar example.txt poems~ tolocalfile
a datasets GSOD lib prgrm Videos
ab.txt Documents gt Music Public wordcount.jar
cloudera-manager Downloads outfile Pictures Templates workspace

 getfacl
Displays the Access Control Lists (ACLs) of files and directories. If a directory has a
default ACL, then getfacl also displays the default ACL.

Example:
[cloudera@quickstart ~]$ hadoop fs -getfacl exampl.txt
# file: exampl.txt
# owner: cloudera
# group:
cloudera
user::rw
x
group::r
-x
other::r-
x

 moveFromLocal
Same as -put, except that the source is deleted after it's copied.
Example:

[cloudera@quickstart ~]$ hadoop fs -moveFromLocal file movfile

[cloudera@quickstart ~]$ hadoop fs -ls
Found 5 items
-r---wx--x 1 cloudera cloudera 66 2018-08-23 06:54 abc.txt
-rw-r--r-- 1 cloudera cloudera 30 2018-08-30 05:01 file1
drwxr-xr-x - cloudera cloudera 0 2018-08-31 08:47 mergefile
-rw-r--r-- 1 cloudera cloudera 14 2018-08-31 08:51 movfile
drwxr-xr-x - cloudera cloudera 0 2018-08-30 04:45 newdir

 mv command
Move files that match the specified file pattern to a destination. When moving
multiple files, the destination must be a directory.

Example:
cloudera@quickstart ~]$ hadoop fs -mv abc.txt newdir
[cloudera@quickstart ~]$ hadoop fs -ls newdir
Found 1 items
-r---wx--x 1 cloudera cloudera 66 2018-08-23 06:54 newdir/abc.txt

 rm command
Used for removing a file from HDFS. The command -rm r can be used for
recursive delete.
rmdir command can be used to delete directories .
Options:
-skipTrash option bypasses trash, if enabled, and immediately deletes <src>

-f If the file does not exist, do not display a diagnostic message or modify the exit
status to reflect an error.
-[rR] Recursively deletes directories

Example:
cloudera@quickstart ~]$ hadoop fs -rm sample.txt

 test command
This command can be used to test a hdfs file’s existence or zero length or is it a
directory.
Options:
-d return 0 if <path> is a directory.
-e return 0 if <path> exists.
-f return 0 if <path> is a file.
-s return 0 if file <path> is greater than zero bytes in size.
-z return 0 if file <path> is zero bytes in size, else return 1.

[cloudera@quickstart ~]$ hadoop fs -test -d newdir

 expunge
This command is used to empty the trash in hadoop file system.

[cloudera@quickstart ~]$ hadoop fs -expunge

 count
Count the number of directories, files and bytes under the paths that match the
specified file pattern.
The output columns are,
DIR_COUNT FILE_COUNT CONTENT_SIZE FILE_NAME
Example:
[cloudera@quickstart ~]$ hadoop fs -count newdir
1 2 75 newdir
 chmod
Changes permissions of a file.
Example:
[cloudera@quickstart ~]$ hadoop fs -getfacl example.txt
# file: example.txt
# owner: cloudera
# group: cloudera

user::r
--
group:
:--x
other::
-w-

[cloudera@quickstart ~]$ hadoop fs -chmod 777 example.txt

[cloudera@quickstart ~]$ hadoop fs -getfacl example.txt

# file: example.txt
# owner: cloudera
#
groupcloudera
user::rwx
group::rwx
other::rwx

 chown
Changes owner and group of a file.
Syntax:
$ hadoop fs -chown [-R] [OWNER][:[GROUP]] PATH
WordCount

Write a Hadoop MapReduce program to calculate the individual word count of a file.

1. Before you run the sample, you must create input and output locations in HDFS. Use the
following commands to create the input directory/user/cloudera/wordcount/input in HDFS:
$ sudo su hdfs
$ hadoop fs -mkdir /user/cloudera
$ hadoop fs -chown cloudera /user/cloudera
$ exit
$ sudo su cloudera
$ hadoop fs -mkdir /user/cloudera/wordcount /user/cloudera/wordcount/input

2. Create sample text files to use as input, and move them to

the/user/cloudera/wordcount/input directory in HDFS. You can use any files you choose; for
convenience, the following shell commands create a few small input files for illustrative
purposes. The Makefile also contains most of the commands that follow.
$ echo "Hadoop is an elephant" > file0
$ echo "Hadoop is as yellow as can be" > file1
$ echo "Oh what a yellow fellow is Hadoop" > file2
$ hadoop fs -put file* /user/cloudera/wordcount/input

3. Compile the WordCount class.

To compile in a package installation of CDH:
$ mkdir -p build
$ javac -cp /usr/lib/hadoop/*:/usr/lib/hadoop-mapreduce/* WordCount.java -d build -Xlint
To compile in a parcel installation of CDH:
$ mkdir -p build
$ javac -cp /opt/cloudera/parcels/CDH/lib/hadoop/*:/opt/cloudera/parcels/CDH/lib/hadoop-
mapreduce/* \
WordCount.java -d build -Xlint

4. Create a JAR file for the WordCount application.

$ jar -cvf wordcount.jar -C build/ .

5. Run the WordCount application from the JAR file, passing the paths to the input and
output directories in HDFS.
$ hadoop jar wordcount.jar org.myorg.WordCount /user/cloudera/wordcount/input
/user/cloudera/wordcount/output
When you look at the output, all of the words are listed in UTF-8 alphabetical order
(capitalized words first). The number of occurrences from all input files has been reduced to a
single sum for each word.

$ hadoop fs -cat /user/cloudera/wordcount/output/*

Hadoop 3
Oh 1
a 1
an 1
as 2
be 1
can 1
elephant 1
fellow 1
is 3
what 1
yellow 2

6. If you want to run the sample again, you first need to remove the output directory. Use
the following command.
$ hadoop fs -rm -r /user/cloudera/wordcount/output

Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Hive Lab
No ratings yet
Hive Lab
33 pages
Cloudera Administrator Training For Apache Hadoop
No ratings yet
Cloudera Administrator Training For Apache Hadoop
5 pages
Describe The Functions and Features of HDP
100% (2)
Describe The Functions and Features of HDP
16 pages
2 Hadoop (Uploaded)
No ratings yet
2 Hadoop (Uploaded)
82 pages
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
Professional Hadoop Solutions
From Everand
Professional Hadoop Solutions
Boris Lublinsky
4/5 (2)
HDInsight Essentials - Second Edition
From Everand
HDInsight Essentials - Second Edition
Rajesh Nadipalli
No ratings yet
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Mastering Hadoop
From Everand
Mastering Hadoop
Sandeep Karanth
No ratings yet
Apache Spark: Dhineshkumar S K
No ratings yet
Apache Spark: Dhineshkumar S K
31 pages
3 Mapreduce Notes
No ratings yet
3 Mapreduce Notes
25 pages
Apache Spark Theory by Arsh
No ratings yet
Apache Spark Theory by Arsh
4 pages
Hadoop Hdfs Commands
No ratings yet
Hadoop Hdfs Commands
5 pages
Bigdata 2016 Hands On 2891109
No ratings yet
Bigdata 2016 Hands On 2891109
96 pages
7 Hive Notes
No ratings yet
7 Hive Notes
36 pages
Real Time Hadoop Interview Questions From Various Interviews
No ratings yet
Real Time Hadoop Interview Questions From Various Interviews
6 pages
Hands On
No ratings yet
Hands On
26 pages
Chatgpt
No ratings yet
Chatgpt
7 pages
MongoDB - Course Curriculum
No ratings yet
MongoDB - Course Curriculum
5 pages
Map Reduce
No ratings yet
Map Reduce
10 pages
Hadoop Overview
100% (1)
Hadoop Overview
16 pages
Ajay Singh - Hadoop Resume
67% (3)
Ajay Singh - Hadoop Resume
2 pages
Nirmal Full Stack Developer Resume
No ratings yet
Nirmal Full Stack Developer Resume
4 pages
Spark Training in Bangalore
No ratings yet
Spark Training in Bangalore
36 pages
Hadoop Lab
100% (2)
Hadoop Lab
6 pages
L.V. Siva Reddy: Sivareddylella95@
No ratings yet
L.V. Siva Reddy: Sivareddylella95@
3 pages
Scala Basic Interview Questions
No ratings yet
Scala Basic Interview Questions
16 pages
Gcloud Python
No ratings yet
Gcloud Python
398 pages
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
No ratings yet
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
47 pages
MongoDB's Performance Over RDBMS - MongoDB
No ratings yet
MongoDB's Performance Over RDBMS - MongoDB
12 pages
AaxHadoop Interview Questions and Answers
No ratings yet
AaxHadoop Interview Questions and Answers
37 pages
Hadoop Realtime Issues
No ratings yet
Hadoop Realtime Issues
3 pages
Deepak Professional Summary
No ratings yet
Deepak Professional Summary
3 pages
Deepshikha Agrawal Pushp B.Sc. (IT), MBA (IT) Certification-Hadoop, Spark, Scala, Python, Tableau, ML (Assistant Professor JLBS)
No ratings yet
Deepshikha Agrawal Pushp B.Sc. (IT), MBA (IT) Certification-Hadoop, Spark, Scala, Python, Tableau, ML (Assistant Professor JLBS)
74 pages
JVM Architecture
No ratings yet
JVM Architecture
6 pages
Spark Sample Resume 2
100% (1)
Spark Sample Resume 2
7 pages
Apache Hive
No ratings yet
Apache Hive
77 pages
HBase Interview Questions
No ratings yet
HBase Interview Questions
12 pages
Big Data Hadoop Certification Training: About Intellipaat
No ratings yet
Big Data Hadoop Certification Training: About Intellipaat
13 pages
Homework Labs Lecture01
No ratings yet
Homework Labs Lecture01
9 pages
Distributed Computing With Python - Sample Chapter
No ratings yet
Distributed Computing With Python - Sample Chapter
18 pages
Big Data Best Practices PDF
No ratings yet
Big Data Best Practices PDF
4 pages
Sqoop User Guide
No ratings yet
Sqoop User Guide
58 pages
Sqoop Export and Import Commands
No ratings yet
Sqoop Export and Import Commands
5 pages
Spark Details
No ratings yet
Spark Details
11 pages
Course Outline Hadoop and Spark For Big Data and Data Science PDF
No ratings yet
Course Outline Hadoop and Spark For Big Data and Data Science PDF
4 pages
Spark Concept
No ratings yet
Spark Concept
18 pages
DVS SPARK Course Content PDF
No ratings yet
DVS SPARK Course Content PDF
2 pages
1 Apache Zookeeper
No ratings yet
1 Apache Zookeeper
7 pages
MapR Certified Hadoop Developer Study Guide (MCHD)
No ratings yet
MapR Certified Hadoop Developer Study Guide (MCHD)
26 pages
HOL Hive PDF
No ratings yet
HOL Hive PDF
23 pages
Name: Wable Snehal Mahesh Subject:-Scala & Spark Div: - Mba Ii Roll No: - 57 Guidence Name: - Prof. Archana Suryawanshi - Kadam
No ratings yet
Name: Wable Snehal Mahesh Subject:-Scala & Spark Div: - Mba Ii Roll No: - 57 Guidence Name: - Prof. Archana Suryawanshi - Kadam
11 pages
Hadoop Installation Step by Step
No ratings yet
Hadoop Installation Step by Step
6 pages
STUTI - GUPTA Hadoop Resume PDF
No ratings yet
STUTI - GUPTA Hadoop Resume PDF
2 pages
Hadoop Training #1: Thinking at Scale
100% (1)
Hadoop Training #1: Thinking at Scale
20 pages
HDFS Commands
No ratings yet
HDFS Commands
15 pages
Regulatory Reporting Broker Dealer ICOFR Process Flow and RC Matrix - Inv Bank - July 2014
No ratings yet
Regulatory Reporting Broker Dealer ICOFR Process Flow and RC Matrix - Inv Bank - July 2014
4 pages
Spark Notes
No ratings yet
Spark Notes
6 pages
F4 - TMD - KEP SLAB BOUNDARY PKr1
No ratings yet
F4 - TMD - KEP SLAB BOUNDARY PKr1
18 pages
Kenguru Catalog 2014
No ratings yet
Kenguru Catalog 2014
41 pages
P4M900-M7 Fe
No ratings yet
P4M900-M7 Fe
41 pages
The Difference Between The Internet and World Wide Web
100% (1)
The Difference Between The Internet and World Wide Web
3 pages
Georgia & Iran - I
No ratings yet
Georgia & Iran - I
43 pages
MasterEmaco ADH 326 (Concresive Liq)
No ratings yet
MasterEmaco ADH 326 (Concresive Liq)
4 pages
Reinforced Concrete Pipe - Ar Full Catalog PDF
No ratings yet
Reinforced Concrete Pipe - Ar Full Catalog PDF
155 pages
Practicals OS GTU For Practice
No ratings yet
Practicals OS GTU For Practice
34 pages
Business Objects Questions
No ratings yet
Business Objects Questions
43 pages
Revised Rules and Regulations Implementing THE Subdivision and Gondominium Buyer'S Protective Decree (PD 957) AND OTHER Related Laws
No ratings yet
Revised Rules and Regulations Implementing THE Subdivision and Gondominium Buyer'S Protective Decree (PD 957) AND OTHER Related Laws
29 pages
TDS Nitoflor SL2000 India PDF
No ratings yet
TDS Nitoflor SL2000 India PDF
4 pages
"Effect of Corrosion On RC Building": "Tanhavi Gajbhiye"
No ratings yet
"Effect of Corrosion On RC Building": "Tanhavi Gajbhiye"
28 pages
Security Questions
No ratings yet
Security Questions
20 pages
Loads and Forces Acting On Retaining Wall 1
No ratings yet
Loads and Forces Acting On Retaining Wall 1
7 pages
Knowlarity Salesforce CTI Solution
No ratings yet
Knowlarity Salesforce CTI Solution
11 pages
Construction Material Estimate Sheet
100% (1)
Construction Material Estimate Sheet
14 pages
Realeastate List PDF
No ratings yet
Realeastate List PDF
21 pages
Ted Talk Report
No ratings yet
Ted Talk Report
7 pages
A Companion To Linear B
No ratings yet
A Companion To Linear B
36 pages
Veverka Planning Interpretive Walkingtours
No ratings yet
Veverka Planning Interpretive Walkingtours
9 pages
Virtual Panel Installation Rev 1.2
No ratings yet
Virtual Panel Installation Rev 1.2
15 pages
UAP DOC - Philippines
No ratings yet
UAP DOC - Philippines
87 pages
Orissa Temples Architectural Styles
No ratings yet
Orissa Temples Architectural Styles
22 pages
Assessment BCA
No ratings yet
Assessment BCA
13 pages
WSHP - Sizing Guide
No ratings yet
WSHP - Sizing Guide
5 pages
Micro Pills
No ratings yet
Micro Pills
33 pages
Ibuyan Joseph Me150-2 E02 Quiz3
100% (1)
Ibuyan Joseph Me150-2 E02 Quiz3
5 pages
Unit Iv Rtos Based Embedded System Design
100% (1)
Unit Iv Rtos Based Embedded System Design
11 pages
Appendix 5 Building Condition Assessment Audit Form
100% (1)
Appendix 5 Building Condition Assessment Audit Form
21 pages
How To Track Original Location of An Email Via Its IP Address
No ratings yet
How To Track Original Location of An Email Via Its IP Address
8 pages

Hadoop Lab

Uploaded by

Hadoop Lab

Uploaded by

G. V. P.

COLLEGE OF ENGINEERING (A)

SKILL BASED LAB ELECTIVE

III YEAR-V SEMESTER

Signature of External Examiner Signature of Internal Examiner

S.NO DESCRIPTION PAGE NO. REMARKS

2 Hadoop installation using

3 Hadoop installation using

What is Big Data?

What Comes Under Big Data?

 Structured data : Relational data.

Basic Hadoop framework includes following modules:

Hadoop Distributed File System

How Does Hadoop Work?

$ tar xvzf hadoop-3.0.0.tar.gz

Setting up Configuration files

$update-alternatives –config java

Now open the ~/.bashrc file

#HADOOP VARIABLES START

Press CTRL+O to save and CTRL+X to exit from that window.

Update .bashrc file to apply changes

Open the file,

Create temporary directory

Open the file,

Append the following between configuration tags. Same as below.

Mainly there are two directories,

$ sudochown -R hduser:hadoop /usr/local/hadoop_store

Open the file,

Change the content between configuration tags shown as below.

Open the file,

Format Hadoop file system

Go to the following directory

Run the following command

hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-3.0.0.jar pi 2 5

Hooray! It’s done.

HADOOP INTSALLATION – CLOUDERA

Install Oracle VirtualBox

Go to google and type “download oracle virtualbox”

For Windows system, Select VirtualBox 5.1.26 for Windows hosts

Select an appropriate download folder and download the exe file.

Firstly go to Cloudera.com in your web browser

On top bar navigate to the downloads link

Hit File -> Import Appliance

Navigate to the unzipped folder of your CDH5.10ovf image

[cloudera@quickstart ~]$ hadoop fs -cat example.txt

[cloudera@quickstart ~]$ hadoop fs -appendToFile - example.txt

[cloudera@quickstart ~]$ hadoop fs -cat example.txt

Displays the contents of the file.

[cloudera@quickstart ~]$ hadoop fs -cat example.txt

Creates a new file in the given path.

[cloudera@quickstart ~]$ hadoop fs -touchz samplefile.txt

[cloudera@quickstart ~]$ hadoop fs -ls

[cloudera@quickstart ~]$ hadoop fs -mkdir newdir

[cloudera@quickstart ~]$ hadoop fs -ls

[cloudera@quickstart ~]$ hadoop fs -cp olddir/file1 newdir

[cloudera@quickstart ~]$ hadoop fs -ls newdir

[cloudera@quickstart ~]$ hadoop fs -put exmple outputfile

[cloudera@quickstart ~]$ hadoop fs -ls

[cloudera@quickstart ~]$ hadoop fs –get example.txt localcopy

[cloudera@quickstart ~]$ hadoop fs -copyToLocal inp outfile

[cloudera@quickstart ~]$ hadoop fs -moveFromLocal file movfile

[cloudera@quickstart ~]$ hadoop fs -test -d newdir

[cloudera@quickstart ~]$ hadoop fs -expunge

[cloudera@quickstart ~]$ hadoop fs -chmod 777 example.txt

[cloudera@quickstart ~]$ hadoop fs -getfacl example.txt

2. Create sample text files to use as input, and move them to

3. Compile the WordCount class.

4. Create a JAR file for the WordCount application.

$ hadoop fs -cat /user/cloudera/wordcount/output/*

You might also like