0% found this document useful (0 votes)

14 views

Hadoop Architecture - Hadoop Distributed File System (HDFS)-2

Uploaded by

cakvlr

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views

Hadoop Architecture - Hadoop Distributed File System (HDFS)-2

Uploaded by

cakvlr

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 39

Preface

Content of this Lecture:

In this lecture, we will discuss design goals of HDFS, the

read/write process to HDFS, the main configuration
tuning parameters to control HDFS performance and
robustness.

Big Data Computing Vu Pham Hadoop Distributed File System (HDFS)

Introduction
Hadoop provides a distributed file system and a framework for
the analysis and transformation of very large data sets using
the MapReduce paradigm.

An important characteristic of Hadoop is the partitioning of

data and computation across many (thousands) of hosts, and
executing application computations in parallel close to their
data.

A Hadoop cluster scales computation capacity, storage capacity

and IO bandwidth by simply adding commodity servers.
Hadoop clusters at Yahoo! span 25,000 servers, and store 25
petabytes of application data, with the largest cluster being
3500 servers. One hundred other organizations worldwide
report using Hadoop.
Big Data Computing Vu Pham Hadoop Distributed File System (HDFS)
Introduction
Hadoop is an Apache project; all components are available via
the Apache open source license.

Yahoo! has developed and contributed to 80% of the core of

Hadoop (HDFS and MapReduce).

HBase was originally developed at Powerset, now a department

at Microsoft.
Hive was originated and developed at Facebook.
Pig, ZooKeeper, and Chukwa were originated and developed at
Yahoo!
Avro was originated at Yahoo! and is being co-developed with
Cloudera.

Big Data Computing Vu Pham Hadoop Distributed File System (HDFS)

Hadoop Project Components

HDFS Distributed file system

MapReduce Distributed computation framework
HBase Column-oriented table service
Dataflow language and parallel execution
Pig
framework
Hive Data warehouse infrastructure
ZooKeeper Distributed coordination service
Chukwa System for collecting management data
Avro Data serialization system

Big Data Computing Vu Pham Hadoop Distributed File System (HDFS)

HDFS Design Concepts
Scalable distributed filesystem: So essentially, as you add disks
you get scalable performance. And as you add more, you're
adding a lot of disks, and that scales out the performance.

Distributed data on local disks on several nodes.

Low cost commodity hardware: A lot of performance out of it

because you're aggregating performance.

Node 1 Node 2 Node n

B1 B2 … Bn

Big Data Computing Vu Pham Hadoop Distributed File System (HDFS)

HDFS Design Goals
Hundreds/Thousands of nodes and disks:
It means there's a higher probability of hardware failure. So the design
needs to handle node/disk failures.

Portability across heterogeneous hardware/software:

Implementation across lots of different kinds of hardware and software.

Handle large data sets:

Need to handle terabytes to petabytes.

Enable processing with high throughput

Big Data Computing Vu Pham Hadoop Distributed File System (HDFS)

Techniques to meet HDFS design goals
Simplified coherency model:
The idea is to write once and then read many times. And that simplifies
the number of operations required to commit the write.

Data replication:
Helps to handle hardware failures.
Try to spread the data, same piece of data on different nodes.

Move computation close to the data:

So you're not moving data around. That improves your performance and
throughput.

Relax POSIX requirements to increase the throughput.

Big Data Computing Vu Pham Hadoop Distributed File System (HDFS)

Basic architecture of HDFS

Big Data Computing Vu Pham Hadoop Distributed File System (HDFS)

HDFS Architecture: Key Components

Single NameNode: A master server that manages the file system

namespace and basically regulates access to these files from
clients, and it also keeps track of where the data is on the
DataNodes and where the blocks are distributed essentially.

Multiple DataNodes: Typically one per node in a cluster. So

you're basically using storage which is local.

Basic Functions:
Manage the storage on the DataNode.
Read and write requests on the clients
Block creation, deletion, and replication is all based on instructions from
the NameNode.

Big Data Computing Vu Pham Hadoop Distributed File System (HDFS)

Original HDFS Design
Single NameNode
Multiple DataNodes
Manage storage- blocks of data
Serving read/write requests from clients
Block creation, deletion, replication

Big Data Computing Vu Pham Big Data Enabling Technologies

HDFS in Hadoop 2
HDFS Federation: Basically what we are doing is trying to have
multiple data nodes, and multiple name nodes. So that we can
increase the name space data. So, if you recall from the first design
you have essentially a single node handling all the namespace
responsibilities. And you can imagine as you start having thousands of
nodes that they'll not scale, and if you have billions of files, you will
have scalability issues. So to address that, the federation aspect was
brought in. That also brings performance improvements.

Benefits:
Increase namespace scalability
Performance
Isolation

Big Data Computing Vu Pham Big Data Enabling Technologies

HDFS in Hadoop 2
How its done
Multiple Namenode servers
Multiple namespaces
Data is now stored in Block pools

So there is a pool associated with each namenode or

namespace.
And these pools are essentially spread out over all the data
nodes.

Big Data Computing Vu Pham Big Data Enabling Technologies

HDFS in Hadoop 2
High Availability-
Redundant NameNodes

Heterogeneous Storage
and Archival Storage
ARCHIVE, DISK, SSD, RAM_DISK

Big Data Computing Vu Pham Big Data Enabling Technologies

Federation: Block Pools

So, if you remember the original design you have one name space and a bunch of
data nodes. So, the structure looks similar.

You have a bunch of NameNodes, instead of one NameNode. And each of those
NameNodes is essentially right into these pools, but the pools are spread out over the
data nodes just like before. This is where the data is spread out. You can gloss over
the different data nodes. So, the block pool is essentially the main thing that's
different.
Big Data Computing Vu Pham Big Data Enabling Technologies
HDFS Performance Measures

Determine the number of blocks for a given file size,

Key HDFS and system components that are affected

by the block size.

An impact of using a lot of small files on HDFS and

system

Big Data Computing Vu Pham Hadoop Distributed File System (HDFS)

Recall: HDFS Architecture

Distributed data on local disks on several nodes

Node 1 Node 2 Node n

B1 B2 … Bn

Big Data Computing Vu Pham Hadoop Distributed File System (HDFS)

HDFS Block Size

Default block size is 64 megabytes.

Good for large files!

So a 10GB file will be broken into: 10 x 1024/64=160 blocks

Node 1 Node 2 Node n

B1 B2 … Bn

Big Data Computing Vu Pham Hadoop Distributed File System (HDFS)

Importance of No. of Blocks in a file
NameNode memory usage: Every block that you create basically
every file could be a lot of blocks as we saw in the previous case,
160 blocks. And if you have millions of files that's millions of
objects essentially. And for each object, it uses a bit of memory on
the NameNode, so that is a direct effect of the number of blocks.
But if you have replication, then you have 3 times the number of
blocks.

Number of map tasks: Number of maps typically depends on the

number of blocks being processed.

Big Data Computing Vu Pham Hadoop Distributed File System (HDFS)

Large No. of small files: Impact on Name node

Memory usage: Typically, the usage is around 150 bytes per

object. Now, if you have a billion objects, that's going to be like
300GB of memory.

Network load: Number of checks with datanodes proportional

to number of blocks

Big Data Computing Vu Pham Hadoop Distributed File System (HDFS)

Large No. of small files: Performance Impact

Number of map tasks: Suppose we have 10GB of data to

process and you have them all in lots of 32k file sizes? Then we
will end up with 327680 map tasks.

Huge list of tasks that are queued.

The other impact of this is the map tasks, each time they spin up
and spin down, there's a latency involved with that because you
are starting up Java processes and stopping them.

Inefficient disk I/O with small sizes

Big Data Computing Vu Pham Hadoop Distributed File System (HDFS)

HDFS optimized for large files

Lots of small files is bad!

Solution:

Merge/Concatenate files
Sequence files
HBase, HIVE configuration
CombineFileInputFormat

Big Data Computing Vu Pham Hadoop Distributed File System (HDFS)

Read/Write Processes in HDFS

Big Data Computing Vu Pham Hadoop Distributed File System (HDFS)

Read Process in HDFS

Big Data Computing Vu Pham Hadoop Distributed File System (HDFS)

Write Process in HDFS

Big Data Computing Vu Pham Hadoop Distributed File System (HDFS)

HDFS Tuning Parameters

Big Data Computing Vu Pham Hadoop Distributed File System (HDFS)

Overview

Tuning parameters

Specifically DFS Block size

NameNode, DataNode system/dfs parameters.

Big Data Computing Vu Pham Hadoop Distributed File System (HDFS)

HDFS XML configuration files

Tuning environment typically in HDFS XML configuration files,

for example, in the hdfs-site.xml.

This is more for system administrators of Hadoop clusters, but

it's good to know what changes affect impact the performance,
and especially if your trying things out on your own there some
important parameters to keep in mind.

Commercial vendors have GUI based management console

Big Data Computing Vu Pham Hadoop Distributed File System (HDFS)

HDFS Block Size

Recall: impacts how much NameNode memory is used, number

of map tasks that are showing up, and also have impacts on
performance.

Default 64 megabytes: Typically bumped up to 128 megabytes

and can be changed based on workloads.

The parameter that this changes dfs.blocksize or dfs.block.size.

Big Data Computing Vu Pham Hadoop Distributed File System (HDFS)

HDFS Replication

Default replication is 3.
Parameter: dfs.replication

Tradeoffs:

Lower it to reduce replication cost

Less robust
Higher replication can make data local to more workers
Lower replication ➔ More space

Big Data Computing Vu Pham Hadoop Distributed File System (HDFS)

Lot of other parameters

Various tunables for datanode, namenode.

Examples:
Dfs.datanode.handler.count (10): Sets the number of server
threads on each datanode
Dfs.namenode.fs-limits.max-blocks-per-file: Maximum number
of blocks per file.

Full List:

http://hadoop.apache.org/docs/current/hadoop-project-
dist/hadoop-hdfs/hdfs-default.xml

Big Data Computing Vu Pham Hadoop Distributed File System (HDFS)

HDFS Performance and
Robustness

Big Data Computing Vu Pham Hadoop Distributed File System (HDFS)

Common Failures

DataNode Failures: Server can fail, disk can crash, data

corruption.

Network Failures: Sometimes there's data corruption because

of network issues or disk issue. So, all of that could lead to a
failure in the DataNode aspect of HDFS. You could have network
failures. So, you could have a network go down between a
particular and the name node that can affect a lot of data nodes
at the same time.

NameNode Failures: Could have name node failures, disk failure

on the name node itself or the name node itself could corrupt
this process.

Big Data Computing Vu Pham Hadoop Distributed File System (HDFS)

HDFS Robustness

NameNode receives heartbeat and block reports from

DataNodes

Big Data Computing Vu Pham Hadoop Distributed File System (HDFS)

Mitigation of common failures

Periodic heartbeat: from DataNode to NameNode.

DataNodes without recent heartbeat:

Mark the data. And any new I/O that comes up is not going to be sent to
that data node. Also remember that NameNode has information on all
the replication information for the files on the file system. So, if it knows
that a datanode fails which blocks will follow that replication factor.

Now this replication factor is set for the entire system and also you could
set it for particular file when you're writing the file. Either way, the
NameNode knows which blocks fall below replication factor. And it will
restart the process to re-replicate.

Big Data Computing Vu Pham Hadoop Distributed File System (HDFS)

Mitigation of common failures

Checksum computed on file creation.

Checksums stored in HDFS namespace.

Used to check retrieved data.

Re-read from alternate replica

Big Data Computing Vu Pham Hadoop Distributed File System (HDFS)

Mitigation of common failures

Multiple copies of central meta data structures.

Failover to standby NameNode- manual by default.

Big Data Computing Vu Pham Hadoop Distributed File System (HDFS)

Performance

Changing blocksize and replication factor can improve

performance.

Example: Distributed copy

Hadoop distcp allows parallel transfer of files.

Big Data Computing Vu Pham Hadoop Distributed File System (HDFS)

Replication trade off with respect to robustness

One performance tradeoff is, actually when you go out

to do some of the map reduce jobs, having replicas
gives additional locality possibilities, but the big trade
off is the robustness. In this case, we said no replicas.
Might lose a node or a local disk: can't recover because
there is no replication.

Similarly, with data corruption, if you get a checksum

that's bad, now you can't recover because you don't
have a replica.

Other parameters changes can have similar effects.

Big Data Computing Vu Pham Hadoop Distributed File System (HDFS)
Conclusion

In this lecture, we have discussed design goals of HDFS,

the read/write process to HDFS, the main configuration
tuning parameters to control HDFS performance and
robustness.

Big Data Computing Vu Pham Hadoop Distributed File System (HDFS)

IDMS Batch COBOL Programming
100% (7)
IDMS Batch COBOL Programming
52 pages
Harish 5+ BI Developer Resume
No ratings yet
Harish 5+ BI Developer Resume
5 pages
Unit 2 Lecture - 04 - HDFS PDF
No ratings yet
Unit 2 Lecture - 04 - HDFS PDF
40 pages
Week-2 Lecture Notes
No ratings yet
Week-2 Lecture Notes
101 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
5 pages
Unit 3
No ratings yet
Unit 3
5 pages
BDA-3
No ratings yet
BDA-3
70 pages
Unit II Big Data Analytics
No ratings yet
Unit II Big Data Analytics
11 pages
Unit-2 Introduction To Hadoop
No ratings yet
Unit-2 Introduction To Hadoop
19 pages
4 UNIT-4 Introduction To Hadoop
No ratings yet
4 UNIT-4 Introduction To Hadoop
154 pages
Unit 3
No ratings yet
Unit 3
61 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
52 pages
Bda Summer 2022 Solution
No ratings yet
Bda Summer 2022 Solution
30 pages
Hadoop
No ratings yet
Hadoop
7 pages
HDFS
No ratings yet
HDFS
8 pages
UNIT 3 HDFS, Hadoop Environment Part 1
No ratings yet
UNIT 3 HDFS, Hadoop Environment Part 1
9 pages
Cloud Computing
No ratings yet
Cloud Computing
19 pages
Exp3 BDI 60004200124
No ratings yet
Exp3 BDI 60004200124
5 pages
BDA UNIT -3 Updated (1).docx
No ratings yet
BDA UNIT -3 Updated (1).docx
25 pages
11 Lecture
No ratings yet
11 Lecture
22 pages
Fbda Unit-3
No ratings yet
Fbda Unit-3
27 pages
BDA-Unit 4
No ratings yet
BDA-Unit 4
20 pages
IBM - What Is The Hadoop Distributed File System (HDFS) - United States
No ratings yet
IBM - What Is The Hadoop Distributed File System (HDFS) - United States
2 pages
BDA Notes
No ratings yet
BDA Notes
25 pages
Unit-Iv CC&BD CS71
No ratings yet
Unit-Iv CC&BD CS71
148 pages
UNIT 3 FULL
No ratings yet
UNIT 3 FULL
89 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
67 pages
Unit III
No ratings yet
Unit III
86 pages
Big Data Lecture Presentation
No ratings yet
Big Data Lecture Presentation
28 pages
Unit 3 Big Data_240516_090400
No ratings yet
Unit 3 Big Data_240516_090400
20 pages
Apex Institute of Technology: Big Data Security
No ratings yet
Apex Institute of Technology: Big Data Security
30 pages
Hadoop
No ratings yet
Hadoop
154 pages
CC Unit 5
No ratings yet
CC Unit 5
43 pages
HADOOP
No ratings yet
HADOOP
40 pages
cloud computing Unit-5
No ratings yet
cloud computing Unit-5
22 pages
02 Unit-II Hadoop Architecture and HDFS
No ratings yet
02 Unit-II Hadoop Architecture and HDFS
18 pages
BDA Lab Assignment 2
No ratings yet
BDA Lab Assignment 2
18 pages
Top Hadoop Interview Q&A
No ratings yet
Top Hadoop Interview Q&A
25 pages
Unit-2 Hadoop
No ratings yet
Unit-2 Hadoop
16 pages
Big Data Analytics
No ratings yet
Big Data Analytics
28 pages
HDFS
No ratings yet
HDFS
11 pages
Lec 4
No ratings yet
Lec 4
27 pages
Big Data-UNIT-2
No ratings yet
Big Data-UNIT-2
46 pages
PDF Bigdata 15cs82 Vtu Module 1 2 Notes
No ratings yet
PDF Bigdata 15cs82 Vtu Module 1 2 Notes
17 pages
Bigdata 15cs82 Vtu Module 1 2 Notes PDF
No ratings yet
Bigdata 15cs82 Vtu Module 1 2 Notes PDF
49 pages
Bigdata 15cs82 Vtu Module 1 2 Notes
57% (14)
Bigdata 15cs82 Vtu Module 1 2 Notes
49 pages
Big Data Analytics – Unit 4
No ratings yet
Big Data Analytics – Unit 4
32 pages
BD-Unit-II (1)
No ratings yet
BD-Unit-II (1)
57 pages
Module 1
No ratings yet
Module 1
66 pages
Hadoop Overview
100% (1)
Hadoop Overview
16 pages
Unit-2-_Hadoop2_
No ratings yet
Unit-2-_Hadoop2_
30 pages
Unit-2_ch_1_updated
No ratings yet
Unit-2_ch_1_updated
22 pages
CS19741-Cloud Computing-Unit 3 Notes
No ratings yet
CS19741-Cloud Computing-Unit 3 Notes
37 pages
Module-2-Introduction To HDFS and Tools
No ratings yet
Module-2-Introduction To HDFS and Tools
38 pages
BD Unit-IIINotes
No ratings yet
BD Unit-IIINotes
17 pages
UNIT -2
No ratings yet
UNIT -2
27 pages
CC Unit 5 Notes
No ratings yet
CC Unit 5 Notes
30 pages
BDA Unit-3
No ratings yet
BDA Unit-3
47 pages
Unit 4 - Data Science - Www.rgpvnotes.in
No ratings yet
Unit 4 - Data Science - Www.rgpvnotes.in
18 pages
Unit Iii
No ratings yet
Unit Iii
43 pages
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Mastering Data Engineering: Advanced Techniques with Apache Hadoop and Hive
From Everand
Mastering Data Engineering: Advanced Techniques with Apache Hadoop and Hive
Peter Jones
No ratings yet
Pekerja
No ratings yet
Pekerja
7 pages
Aruba Electronic Key License Installation Guide
No ratings yet
Aruba Electronic Key License Installation Guide
8 pages
Interview Questions
No ratings yet
Interview Questions
16 pages
How To Make Pendrive Bootable
No ratings yet
How To Make Pendrive Bootable
2 pages
SI and MFT
No ratings yet
SI and MFT
354 pages
4-Simple SQL Queries
No ratings yet
4-Simple SQL Queries
2 pages
build-your-own-database-from-scratch-1n
No ratings yet
build-your-own-database-from-scratch-1n
120 pages
Java + Spring Questions
No ratings yet
Java + Spring Questions
12 pages
Data Analyst in 2025
No ratings yet
Data Analyst in 2025
13 pages
Advanced Data Mining
No ratings yet
Advanced Data Mining
6 pages
SAP S Factors
No ratings yet
SAP S Factors
13 pages
Log
No ratings yet
Log
3 pages
Systems Analysis and Design, 10 Edition Scott Tilley and Harry Rosenblatt
No ratings yet
Systems Analysis and Design, 10 Edition Scott Tilley and Harry Rosenblatt
56 pages
Struxure 6.2.3 Release Notes
No ratings yet
Struxure 6.2.3 Release Notes
14 pages
Reconnaissance Process
No ratings yet
Reconnaissance Process
13 pages
New Features Guide: Informatica (Version 9.1.0)
No ratings yet
New Features Guide: Informatica (Version 9.1.0)
18 pages
Board Practical Qp-20-23 Xii
No ratings yet
Board Practical Qp-20-23 Xii
10 pages
The Telecom Paradigm Shift Needed For 5G Automation
No ratings yet
The Telecom Paradigm Shift Needed For 5G Automation
14 pages
The FAT16 File System: Fragmentation 16-Bit
No ratings yet
The FAT16 File System: Fragmentation 16-Bit
4 pages
Major Oracle R12 1.3 Features
No ratings yet
Major Oracle R12 1.3 Features
60 pages
Database Programming With SQL Section 3 Quiz
No ratings yet
Database Programming With SQL Section 3 Quiz
13 pages
HITEN_VORA_RESUME.doc
No ratings yet
HITEN_VORA_RESUME.doc
5 pages
Corba and Java
No ratings yet
Corba and Java
120 pages
Sap Hybris Data Hub
No ratings yet
Sap Hybris Data Hub
4 pages
Cron Tab
No ratings yet
Cron Tab
59 pages
E-Terrasource Metamodel Quickstartguide
No ratings yet
E-Terrasource Metamodel Quickstartguide
69 pages
SQL Server Query Optimization Techniques PDF
No ratings yet
SQL Server Query Optimization Techniques PDF
9 pages
Question Bank ISF
No ratings yet
Question Bank ISF
17 pages