0% found this document useful (0 votes)
4 views

IMTC634_Data Science_Chapter 14

Uploaded by

msmakkar.chief19
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

IMTC634_Data Science_Chapter 14

Uploaded by

msmakkar.chief19
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 22

Chapter 14: Hadoop

file system for storing


data
Chapter Index
S.No. Reference Particulars Slide
No. From - To
1 Learning Objectives 3
2 Topic 1 Hadoop Distributed File 4-5
System
3 Topic 2 HDFS Architecture 6-14

4 Topic 3 Features of HDFS 15-16

5 Topic 4 Data Integrity in HDFS 17

6 Topic 5 Features of HBase 18-19

7 Let’s Sum Up 20-21


Learning Objectives

 Discuss Hadoop Distributed File System

 Explain HDFS Architecture

 Describe the features of HDFS

 Describe the data integrity in HDFS

 €€Explain the features of HBase


1. Hadoop Distributed File System

 HDFS is a useful distributed file system to store very large


data sets.

 It is a data service that offers a unique set of capabilities


needed when data volumes and velocity are high, keeping
the reliability factor in check.

 Unlike other file systems that constantly read and write data,
HDFS writes data only once and reads it many times
thereafter.

 HDFS stores file metadata and application data separately on


namenode and datanode servers, respectively.

 The file content is divided in to several small data blocks and


is replicated throughout the datanodes.
1. Hadoop Distributed File System

 HDFS can provide data flow with a constant bitrate above a


particular threshold during data transfer, instead of having
data flow in bursts.

 Applications running on HDFS require streaming access to


their data sets. HDFS is created for batch processing.

 Hadoop does not require large, exceptionally dependable


hardware to run. It can be installed on generic hardware.

 Since the NameNode holds file system data information in


memory, the quantity of documents in a file system is
administered in terms of the memory on the server.
2. HDFS Architecture

 HDFS consists of a central DataNode and multiple DataNodes


running on the appliance model cluster. The following figure
shows architecture of HDFS:
2. HDFS Architecture

 NameNode and DataNode are software pieces specifically


designed to run on commodity hardware.

 HDFS allows simultaneous application execution across


multiple servers consisting of economical internal disk
drives.

 The NameNode manages HDFS cluster metadata, whereas


DataNodes store the data. Records and directories are
presented by clients to the NameNode.

 DataNodes read and write requests from the clients.


2. HDFS Architecture

Concept of Blocks in HDFS Architecture

 A disk has a certain block size, which is the basic measure of


information that it can read or compose.

 File system blocks are commonly a couple of kilobytes in


size, while disk blocks are regularly 512 bytes.

 HDFS blocks are huge in contrast to disk blocks because they


have to minimize the expense of the seek operation.

 Map tasks of MapReduce (component of HDFS) typically work


on one block at once, so if you have fewer assignments
(fewer than the nodes in the group), your tasks will run
slower.
2. HDFS Architecture

NameNodes and DataNodes

 An HDFS cluster has two node types working in a slave master


design: a NameNode (the master) and various DataNodes (slaves).
 The NameNode deals with the file system. It stores the metadata
for all the documents and indexes in the file system.
 By establishing communication with DataNodes and NameNodes, a
client accesses the file system on behalf of user.
 DataNodes are the workhouses of a file system. When asked by
client or namenode, they store and recover the data blocks and
report back with a list of externally stored blocks to the NameNode.
 The file system can’t be accessed without the NameNode. The file
system would lose all the files if the NameNode crashes.
 DataNodes ensure connectivity with the NameNode by sending
heartbeat messages.
2. HDFS Architecture

The Command-Line Interface

 There are numerous different interfaces for HDFS; however, the


command line is one of the easiest and most popular among developers.

 There are two properties that we set in the distributed mode prototype
setup that calls for further clarification

 The principal is fs.default.name, set to hdfs://localhost/, which is used to


set a default Hadoop file system.

 File systems are tagged by a URI, and here we have utilized an HDFS URI
to design Hadoop.

 The HDFS daemons will utilize this property to focus the host and port
for the HDFS NameNode.
2. HDFS Architecture

Using HDFS Files


 The HDFS file system is accessed by user applications with the
help of the HDFS client. It is a library that reveals the interface of
the HDFS file system and hides almost all the complexities that
appear during the implementation of HDFS.
 The object of a Filesystem class is created for accessing HDFS.
The Filesystem class is an abstract base class for a generic file
system. The code created by a user referring to HDFS must be
written in order to use an object of the Filesystem class. An
instance of the FileSystem class can be created by passing a new
Configuration object into a constructor.
 The code to create an instance of the FileSystem class is shown
here:
2. HDFS Architecture
 Path is another important HDFS object, which specifies the names of
the files or directories in a file system.
 You can create a Path object from a string specifying the location of
the file/directory on HDFS.
 Both the FileSystem and Path objects allow you to perform
programmatic operations on HDFS files and directories. The following
code shows the manipulation of HDFS objects:
2. HDFS Architecture

HDFS Commands

 HDFS and other file systems supported by Hadoop, e.g.


Local FS,HFTP FS, S# FS and others can directly be
interacted with various shell like commands.

 Most commands in the FS shell are similar to the Unix


commands and perform almost similar functions.

 While reading and writing data to network, database and


files, the org.apache.hadoop.io package provides generic
i/o code that can be used. This package provides various
interfaces, classes, and exceptions.
2. HDFS Architecture

HDFS High Availability

 The HDFS cluster would not be available if its NameNode


cluster is not active.
 The cluster would remain unavailable if an unplanned event
such as machine crash occurs, until an operator restarts the
NameNode.
 Generally, there are two separate machines configured as
NameNodes in a typical HA cluster. At a given instance, either
of the NameNodes will be in active state while the other will be
in standby state.
 NameNode machines: These are the machines on which you
can run the Active and Standby NameNodes. These NameNode
machines must have similar hardware configurations.
 €€ Shared storage: Both NameNode machines must have
read/write accessibility on a shared directory.
3. Features of HDFS

 Data replication, data resilience, and data integrity are the


three key features of HDFS.

 A block is written by a client application on the first


DataNode in the pipeline. The DataNode then forwards the
data block to the next connecting node in the pipeline, which
further forwards the data block to the next node and so on.

 When a file is divided into blocks and the replicated blocks


are distributed throughout the different DataNodes of a
cluster, the process requires careful execution as even a
minute variation may result in corrupt data.
3. Features of HDFS

 HDFS ensures data integrity throughout the cluster with the


help of the following features:

 Maintaining Transaction Logs: HDFS maintains


transaction logs in order to monitor every operation and
carry out effective auditing and recovery of data in case
something goes wrong.
 Validating Checksum: Checksum is an effective error-
detection technique wherein a numerical value is
assigned to a transmitted message on the basis of the
number of bits contained in the message. HDFS uses
checksum validations for verification of the content of a
file.
 Creating Data Blocks: HDFS maintains replicated copies
of data blocks to avoid corruption of a file due to failure
of a server.
4. Data integrity in HDFS

 HDFS directly checksums all information written to it, and as a


matter of course, confirms checksums when perusing information.

 DataNodes are in charge of confirming the information they get


before putting away the information and its checksum.

 Every DataNode keeps a determined log of checksum


confirmations, so it knows when each of its blocks was last
confirmed.

 Every DataNode runs a Datablockscanner in a background thread


that occasionally checks all the blocks put away on the DataNode.
This is to prepare for debasement because of “bit decay” in the
physical storage media.

 Since HDFS stores imitations of pieces, it can “repair” corrupted


blocks by replicating one of the good copies to deliver another,
uncorrupted copy.
5. Features of Hbase

 HBase is a column-oriented distributed database composed on


top of HDFS.

 HBase is used when you need real-time continuous read/write


access to huge datasets.

 The main feature of Hbase are:

 Consistency: In spite of not being an ACID implementation,


HBase supports consistent read and write operations. In
places where RDBMS-supported features such as full
transaction support or typed columns are not requires, this
feature makes HBase suitable for high speed requirements.
5. Features of Hbase

 €€Sharding: HBase supports many operations such as


transparent and automatic splitting, redistribution of
content and distribution of data using an underlying file
system.

 High availability: HBase implements region servers to


ensure the recovery of LAN and WAN operations in case
of a failure. The master server at the core monitors the
regional servers and manages all the metadata for the
cluster.

 Client API: HBase supports programmatic access using


Java APIs.

 Support for IT operations: HBase provides a set of built-in


Web pages to view detailed operational insights about
the system
Let’s Sum Up

 Hadoop Distributed File System (HDFS) is a resilient, flexible,


grouped method of file management in a Big Data setup which
runs on commodity hardware.

 HDFS is a useful distributed file system to store very large data


sets. It is a data service that offers a unique set of capabilities
needed when data volumes and velocity are high, keeping the
reliability factor in check.

 HDFS consists of a central DataNode and multiple DataNodes


running on the appliance model cluster and offers the highest
performance levels when the same physical rack is used for the
entire cluster.
Let’s Sum Up

 HDFS blocks are huge in contrast to disk blocks because they


have to minimize the expense of the seek operation.

 An HDFS cluster has two node types working in a slave


master design: a NameNode (the master) and various
DataNodes (slaves).

 The availability of an HDFS cluster depends upon the


availability of the NameNode. In other words, the HDFS
cluster would not be available if its NameNode cluster is not
active.

 Data replication, data resilience, and data integrity are the


three key features of HDFS
THANK YOU

You might also like