0% found this document useful (0 votes)
2 views

lab2_BD

Uploaded by

mailing2chinka
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

lab2_BD

Uploaded by

mailing2chinka
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

BIG DATA AND ANALYTICS LAB

LAB NO2: Implementation of file management in Hadoop

With growing data velocity the data size easily outgrows the storage limit of a machine.
A solution would be to store the data across a network of machines. Such filesystems are
called distributed filesystems. Since data is stored across a network all the complications
of a network come in.
This is where Hadoop comes in. It provides one of the most reliable filesystems. HDFS
(Hadoop Distributed File System) is a unique design that provides storage for extremely
large files with streaming data access pattern and it runs on commodity hardware. Let’s
elaborate the terms:
• Extremely large files: Here we are talking about the data in range of petabytes(1000
TB).
• Streaming Data Access Pattern: HDFS is designed on principle of write-once and
read-many-times. Once data is written large portions of dataset can be processed any
number times.
• Commodity hardware: Hardware that is inexpensive and easily available in the
market. This is one of feature which specially distinguishes HDFS from other file
system.
Nodes: Master-slave nodes typically forms the HDFS cluster.
1. NameNode(MasterNode):
• Manages all the slave nodes and assign work to them.
• It executes filesystem namespace operations like opening, closing, renaming
• files and directories.
• It should be deployed on reliable hardware which has the high config. not on
commodity hardware.
2. DataNode(SlaveNode):
• Actual worker nodes, who do the actual work like reading, writing, processing
etc.
• They also perform creation, deletion, and replication upon instruction from the
master.
• They can be deployed on commodity hardware.
HDFS daemons: Daemons are the processes running in background.
• Namenodes:
• Run on the master node.
• Store metadata (data about data) like file path, the number of blocks, block
Ids. etc.
• Require high amount of RAM.
• Store meta-data in RAM for fast retrieval i.e to reduce seek time. Though a
persistent copy of it is kept on disk.
• DataNodes:
• Run on slave nodes.
• Require high memory as data is actually stored here.
Data storage in HDFS: Now let’s see how the data is stored in a distributed manner.
Lets assume that 100TB file is inserted, then masternode(namenode) will first divide the
file into blocks of 10TB (default size is 128 MB in Hadoop 2.x and above). Then these
blocks are stored across different datanodes(slavenode).
Datanodes(slavenode)replicate the blocks among themselves and the information of
what blocks they contain is sent to the master. Default replication factor is 3 means for
each block 3 replicas are created (including itself). In hdfs.site.xml we can increase or
decrease the replication factor i.e we can edit its configuration here.
Note: MasterNode has the record of everything, it knows the location and info of each
and every single data nodes and the blocks they contain, i.e. nothing is done without the
permission of masternode.
Why divide the file into blocks?
Answer: Let’s assume that we don’t divide, now it’s very difficult to store a 100 TB file
on a single machine. Even if we store, then each read and write operation on that whole
file is going to take very high seek time. But if we have multiple blocks of size 128MB
then its become easy to perform various read and write operations on it compared to
doing it on a whole file at once. So we divide the file to have faster data access i.e.
reduce seek time.
Why replicate the blocks in data nodes while storing?
Answer: Let’s assume we don’t replicate and only one yellow block is present on
datanode D1. Now if the data node D1 crashes we will lose the block and which will
make the overall data inconsistent and faulty. So we replicate the blocks to
achieve fault-tolerance.
Terms related to HDFS:
• HeartBeat : It is the signal that datanode continuously sends to namenode. If
namenode doesn’t receive heartbeat from a datanode then it will consider it dead.
• Balancing : If a datanode is crashed the blocks present on it will be gone too and the
blocks will be under-replicated compared to the remaining blocks. Here master
node(namenode) will give a signal to datanodes containing replicas of those lost
blocks to replicate so that overall distribution of blocks is balanced.
• Replication:: It is done by datanode.
Note: No two replicas of the same block are present on the same datanode.
Features:
• Distributed data storage.
• Blocks reduce seek time.
• The data is highly available as the same block is present at multiple datanodes.
• Even if multiple datanodes are down we can still do our work, thus making it highly
reliable.
• High fault tolerance.
Limitations: Though HDFS provide many features there are some areas where it
doesn’t work well.
• Low latency data access: Applications that require low-latency access to data i.e in
the range of milliseconds will not work well with HDFS, because HDFS is designed
keeping in mind that we need high-throughput of data even at the cost of latency.
• Small file problem: Having lots of small files will result in lots of seeks and lots of
movement from one datanode to another datanode to retrieve each small file, this
whole process is a very inefficient data access pattern.
File Management tasks in Hadoop
1. Create a directory in HDFS at given path(s).
Usage:
hadoop fs -mkdir <paths>
Example:
hadoop fs -mkdir /user/saurzcode/dir1 /user/saurzcode/dir2

2. List the contents of a directory.


Usage :
hadoop fs -ls <args>
Example:
hadoop fs -ls /user/saurzcode

3. Upload and download a file in HDFS.


Upload: hadoop fs -put:
Copy single src file, or multiple src files from local file system to the Hadoop data file system
Usage:
hadoop fs -put <localsrc> ... <HDFS_dest_Path>
Example:
hadoop fs -put /home/saurzcode/Samplefile.txt /user/ saurzcode/dir3/
Download: hadoop fs -get:
Copies/Downloads files to the local file system
Usage:
hadoop fs -get <hdfs_src> <localdst>
Example:
hadoop fs -get /user/saurzcode/dir3/Samplefile.txt /home/

4.See contents of a file


Same as unix cat command:
Usage:
hadoop fs -cat <path[filename]>
Example:
hadoop fs -cat /user/saurzcode/dir1/abc.txt
5. Copy a file from source to destination
This command allows multiple sources as well in which case the destination must be a directory.
Usage:
hadoop fs -cp <source> <dest>
Example:
hadoop fs -cp /user/saurzcode/dir1/abc.txt /user/saurzcode/ dir2

6. Copy a file from/To Local file system to HDFS


copyFromLocal
Usage:
hadoop fs -copyFromLocal <localsrc> URI
Example:
hadoop fs -copyFromLocal /home/saurzcode/abc.txt /user/ saurzcode/abc.txt

Similar to put command, except that the source is restricted to a local file reference.
copyToLocal
Usage:
hadoop fs -copyToLocal [-ignorecrc] [-crc] URI <localdst>
Similar to get command, except that the destination is restricted to a local file reference.

7.Move file from source to destination.


Note:- Moving files across filesystem is not permitted.
Usage :
hadoop fs -mv <src> <dest>
Example:
hadoop fs -mv /user/saurzcode/dir1/abc.txt /user/saurzcode/ dir2

8.Remove a file or directory in HDFS.


Remove files specified as argument. Deletes directory only when it is empty

Usage :

hadoop fs -rm <arg> Example:


hadoop fs -rm /user/saurzcode/dir1/abc.txt

9.Recursive version of delete.


Usage :
hadoop fs -rmr <arg>
Example:
hadoop fs -rmr /user/saurzcode/

10. Display last few lines of a file.


Similar to tail command in Unix.
Usage :
hadoop fs -tail <path[filename]>
Example:
hadoop fs -tail /user/saurzcode/dir1/abc.txt
11. Display the aggregate length of a file.
Usage :
hadoop fs -du <path>
Example:
hadoop fs -du /user/saurzcode/dir1/abc.txt
82

INSTALLATION OF MOVIELENSE DATASET

82
83

OPEN FILE VIEW

83
84

PAGE OF FILE

84
85

MAKE NEW FOLDER

85
86

AFTER UPLOAD FILE (DATA/ITEM FROM


MK-100K)

86
87

SELECT BOTH FILE AND CONCATENATE

87
88

AFTER CONCATENATE OF FILE:

88
INSTALL THE MOVIELENSE DATA INTO HDFS
89

USING COMMAND LINE

 To start command line we have to download


PUTTY Software (SSH)
 https://www.chiark.greenend.org.uk/~sgtatham/p
utty/latest.html

 AFTER Install putty

89
90

PUTTY

90
HOST NAME: [email protected]
91

NAME:2222SAVED SESSION: HDP

91
92

ENTER PASSWORD: MARIAV_DEV

92
93

NOW YOU WILL ENTER HADOOP

93
94

ENETER QUERY:
 Hadoop fs –ls ( file system list form)

94
95

FILE SYSTEM SHOW IN HADOOP

95
96

RUN OTHER COMMAND LINE


 pwd
 wget http://media.sundog-soft.com/hadoop/m1-
100k

96

You might also like