SlideShare a Scribd company logo
Data Management
Scale-up 
•To understand the popularity of distributed systems (scale-out) vis-à-vis huge monolithic servers (scale-up), consider the price performance of current I/O technology. 
•A high-end machine with four I/O channels each having a throughput of 100 MB/sec will require three hours to read a 4 TB data set! 
•With Hadoop, this same data set will be divided into smaller (typically 64 MB) blocks that are spread among many machines in the cluster via the HadoopDistributed File System (HDFS ). 
•With a modest degree of replication, the cluster machines can read the data set in parallel and provide a much higher throughput. 
•And such a cluster of commodity machines turns out to be cheaper than one high-end server!
Hadoopfocuses on moving code to data 
•The clients send only the MapReduce programs to be executed, and these programs are usually small (often in kilobytes). 
•More importantly, the move-code-to-data philosophy applies within the Hadoop cluster itself. 
•Data is broken up and distributed across the cluster, and as much as possible, computation on a piece of data takes place on the same machine where that piece of data resides. 
•The programs to run (“code”) are orders of magnitude smaller than the data and are easier to move around. 
•Also, it takes more time to move data across a network than to apply the computation to it.
HDFS 
•HDFS is the file system component of Hadoop. 
•Interface to HDFS is patterned after the UNIX file system 
•Faithfulness to standards was sacrificed in favor of improved performance for the applications at hand 
•HDFS stores file system metadata and application data separately 
•“HDFS is a file-system designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware”1 
1 “The Hadoop Distributed File System” by Konstantin Shvachko, HairongKuang, Sanjay Radia, and Robert Chansler(Proceedings of MSST2010, May 2010, http:// storageconference.org/2010/Papers/MSST/Shvachko.pdf)
Key properties of HDFS 
•Very Large 
–“Very large” in this context means files that are hundreds of megabytes, gigabytes, or terabytes in size. 
–There are Hadoop clusters running today that store petabytes of data. 
•Streaming data 
–write-once, read-many-times pattern 
–the time to read the whole dataset is more important than the latency in reading the first record 
•Commodity hardware 
–HDFS is designed to carry on working without a noticeable interruption to the user in the face of such failure
Not a good fit for 
•Low-latency data access 
–HDFS is optimized for delivering a high throughput of data, and this may be at the expense of latency. 
–Hbaseis currently a better choice for low-latency access. 
•Lots of small files 
–Since the namenode holds filesystemmetadata in memory, the limit to the number of files in a filesystemis governed by the amount of memory on the namenode. 
–As a rule of thumb, each file, directory, and block takes about 150 bytes. 
–While storing millions of files is feasible, billions is beyond the capability of current hardware. 
•Multiple writers, arbitrary file modifications 
–Files in HDFS may be written to by a single writer. Writes are always made at the end of the file. 
–There is no support for multiple writers, or for modifications at arbitrary offsets in the file.
Namenode and Datanode 
Master/slave architecture 
HDFS cluster consists of a single Namenode, a master server that manages the file system namespace and regulates access to files by clients. 
There are a number of DataNodesusually one per node in a cluster. 
The DataNodesmanage storage attached to the nodes that they run on. 
HDFS exposes a file system namespace and allows user data to be stored in files. 
A file is split into one or more blocks and set of blocks are stored in DataNodes. 
DataNodes: serves read, write requests, performs block creation, deletion, and replication upon instruction from Namenode.
Web Interface 
•NameNodeand DataNodeeach run an internal web server in order to display basic information about the current status of the cluster. 
• 
•With the default configuration, the NameNodefront page is athttp://namenode-name:50070/. 
•It lists the DataNodesin the cluster and basic statistics of the cluster. 
•The web interface can also be used to browse the file system (using "Browse the file system" link on the NameNodefront page).
HDFS architecture 
Namenode 
B 
replication 
Rack1 
Rack2 
Client 
Blocks 
Datanodes 
Datanodes 
Client 
Write 
Read 
Metadata ops 
Metadata(Name, replicas..) 
(/home/foo/data,6. .. 
Block ops
Namenode 
Keeps image of entire file system namespace and file Blockmapin memory. 
4GB of local RAM is sufficient to support the above data structures that represent the huge number of files and directories. 
When the Namenode starts up it gets the FsImageand Editlogfrom its local file system, update FsImagewith EditLoginformation and then stores a copy of the FsImageon the filesytstemas a checkpoint. 
Periodic checkpointingis done. So that the system can recover back to the last checkpointedstate in case of a crash.
Datanode 
A Datanodestores data in files in its local file system. 
Datanodehas no knowledge about HDFS filesystem 
It stores each block of HDFS data in a separate file. 
Datanodedoes not create all files in the same directory. 
It uses heuristics to determine optimal number of files per directory and creates directories appropriately 
When the filesystemstarts up it generates a list of all HDFS blocks and send this report to Namenode: Blockreport.
HDFS 
Application 
Local file system 
Master node 
Name Nodes 
HDFS Client 
HDFS Server 
Block size: 2K 
Block size: 128M 
Replicated
HDFS: Module view
HDFS: Modules 
•Protocol: The protocol package is used in communication between the client and the namenode and datanode. It describes the messages used between these servers. 
•Security: security is used in authenticating access to the files. The security is based on token-based authentication, where the namenode server controls the distribution of access tokens. 
•server.protocol: server.protocoldefines the communication between namenode and datanode, and between namenode and balancer. 
•server.common: server.commoncontains utilities that are used by the namenode, datanodeand balancer. Examples are classes containing server-wide constants, utilities, and other logic that is shared among the servers. 
•Client: The client contains the logic to access the file system from a user’s computer. It interfaces with the datanodeand namenode servers using the protocol module. In the diagram this module spans two layers. This is because the client module also contains some logic that is shared system wide. 
•Datanode: The datanodeis responsible for storing the actual blocks of filesystemdata. It receives instructions on which blocks to store from the namenode. It also services the client directly to stream file block contents. 
•Namenode: The namenode is responsible for authorizing the user, storing a mapping from filenames to data blocks, and it knows which blocks of data are stored where. 
•Balancer: The balancer is a separate server that tells the namenode to move data blocks between datanodeswhen the load is not evenly balanced among datanodes. 
•Tools: The tools package can be used to administer the filesystem, and also contains debugging code.
File system 
•Hierarchical file system with directories and files 
•Create, remove, move, rename etc. 
•Namenode maintains the file system 
•Any meta information changes to the file system recorded by the Namenode. 
•An application can specify the number of replicas of the file needed: replication factor of the file. 
•This information is stored in the Namenode.
Metadata 
•The HDFS namespace is stored by Namenode. 
•Namenode uses a transaction log called the EditLogto record every change that occurs to the filesystemmeta data. 
–For example, creating a new file. 
–Change replication factor of a file 
–EditLogis stored in the Namenode’slocal filesystem 
•Entire filesystemnamespace including mapping of blocks to files and file system properties is stored in a file FsImage. 
•Stored in Namenode’slocal filesystem.
Applicationcode<-> Client 
•HDFS provides aJava APIfor applications to use. 
•Fundamentally, the application uses the standard java.io interface. 
•A C language wrapper for this Java API is also available. 
•The client and the application code are bound into the same address space.
Client
Java Interface 
•One of the simplest ways to read a file from a Hadoop filesystemis by using a java.net.URLobject to open a stream to read the data from. 
•The general idiom is: 
InputStreamin = null; 
try { 
in = new URL("hdfs://host/path").openStream(); 
// process in 
} finally { 
IOUtils.closeStream(in); 
} 
•There’s a little bit more work required to make Java recognize Hadoop’shdfsURL scheme. 
•This is achieved by calling the setURLStreamHandlerFactorymethod on URL with an instance of FsUrlStreamHandlerFactory.
Example : Displaying files from a Hadoop filesystemon standard output 
public class URLCat{ 
static { 
URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory()); 
} 
public static void main(String[] args) throws Exception { 
InputStreamin = null; 
try { 
in = new URL(args[0]).openStream(); 
IOUtils.copyBytes(in, System.out, 4096, false); 
} finally { 
IOUtils.closeStream(in); 
} 
} 
}
Reading Data Using the FileSystemAPI 
•A file in a Hadoop filesystemis represented by a Hadoop Path object (and not a java.io.Fileobject. 
•There are several static factory methods for getting a FileSysteminstance: 
–public static FileSystemget(Configuration conf) throws IOException 
–public static FileSystemget(URI uri, Configuration conf) throws IOException 
–public static FileSystemget(URI uri, Configuration conf, String user) throws IOException 
•A Configuration object encapsulates a client or server’s configuration, which is set using configuration files read from the classpath, such as conf/core- site.xml. 
•With a FileSysteminstance in hand, we invoke an open() method to get the input stream for a file: 
–public FSDataInputStreamopen(Path f) throws IOException 
–public abstract FSDataInputStreamopen(Path f, intbufferSize) throws IOException
Example : Displaying files with FileSystemAPI 
public class FileSystemCat{ 
public static void main(String[] args) throws Exception { 
String uri= args[0]; 
Configuration conf = new Configuration(); 
FileSystemfs= FileSystem.get(URI.create(uri), conf); 
InputStreamin = null; 
try { 
in = fs.open(new Path(uri)); 
IOUtils.copyBytes(in, System.out, 4096, false); 
} finally { 
IOUtils.closeStream(in); 
} 
} 
}
FSDataInputStream 
•The open() method on FileSystemactually returns a FSDataInputStreamrather than a standard java.io class. 
•This class is a specialization of java.io.DataInputStreamwith support for random access, so you can read from any part of the stream. 
package org.apache.hadoop.fs; 
public class FSDataInputStreamextends DataInputStream 
implements Seekable, PositionedReadable{ 
// implementation 
} 
public interface Seekable{ 
void seek(long pos) throws IOException; 
long getPos() throws IOException; 
} 
public interface PositionedReadable{ 
public intread(long position, byte[] buffer, intoffset, intlength) throws IOException; 
public void readFully(long position, byte[] buffer, intoffset, intlength) throws IOException; 
public void readFully(long position, byte[] buffer) throws IOException; 
}
FSDataOutputStream 
public FSDataOutputStreamcreate(Path f) throws IOException 
package org.apache.hadoop.util; 
public interface Progressable{ 
public void progress(); 
} 
public FSDataOutputStreamappend(Path f) throws IOException
Example: Copying a local file to a Hadoop filesystem 
public class FileCopyWithProgress{ 
public static void main(String[] args) throws Exception { 
String localSrc= args[0]; 
String dst= args[1]; 
InputStreamin = new BufferedInputStream(new FileInputStream(localSrc)); 
Configuration conf = new Configuration(); 
FileSystemfs= FileSystem.get(URI.create(dst), conf); 
OutputStreamout = fs.create(new Path(dst), new Progressable() { 
public void progress() { 
System.out.print("."); 
} 
}); 
IOUtils.copyBytes(in, out, 4096, true); 
} 
}
File-Based Data Structures 
•For some applications, you need a specialized data structure to hold your data. 
•For doing MapReduce-based processing, putting each blob of binary data into its own file doesn’t scale, so Hadoop developed a number of higher- level containers for these situations. 
•Imagine a logfile, where each log record is a new line of text. 
•If you want to log binary types, plain text isn’t a suitable format. 
•Hadoop’sSequenceFileclass fits the bill in this situation, providing a persistent data structure for binary key-value pairs.
SequenceFile 
•SequenceFileis a flat file consisting of binary key/value pairs. 
•It is extensively used inMapReduceas input/output formats. 
•Internally, the temporary outputs of maps are stored using SequenceFile. 
•The SequenceFileprovides a Writer, Reader and Sorter classes for writing, reading and sorting respectively. 
•There are 3 different SequenceFileformats: 
–Uncompressed key/value records. 
–Record compressed key/value records -only 'values' are compressed here. 
–Block compressed key/value records -both keys and values are collected in 'blocks' separately and compressed. The size of the 'block' is configurable. 
•TheSequenceFile.Readeracts as a bridge and can read any of the above SequenceFileformats.
Using SequenceFile 
•To use it as a logfileformat, you would choose a key, such as timestamp represented by a LongWritable, and the value is a Writable that represents the quantity being logged. 
•To create a SequenceFile, use one of its createWriter() static methods, which returns a SequenceFile.Writerinstance. 
•Once you have a SequenceFile.Writer, you then write key-value pairs, using the append() method. 
•Then when you’ve finished, you call the close() method. 
•Reading sequence files from beginning to end is a matter of creating an instance of SequenceFile.Readerand iterating over records by repeatedly invoking one of the next() methods.
Internals of A sequence file 
•A sequence file consists of a header followed by one or more records 
•The header contains other fields including the names of the key and value classes, compression details, user defined metadata, and the sync marker. 
•A MapFileis a sorted SequenceFilewith an index to permit lookups by key.
Compression 
•Hadoop allows users to compress output data, intermediate data, or both. 
•Hadoop checks whether input data is in a compressed format and decompresses the data as needed. 
•Compression codec: 
–two lossless codecs. 
–The default codec is gzip, a combination of the Lempel-Ziv 1977 (LZ77) algorithm and Huffmanencoding. 
–The other codec implements the Lempel-ZivOberhumer(LZO) algorithm, a variant of LZ77 optimized for decompression speed. 
•Compression unit: 
–Hadoop allows both per-record and per-block compression. 
–Thus, the record or block size affects the compressibility of the data.
When to use compression? 
•Compression adds a read-time-penalty, why would one enable any compression? 
•There are a few reasons why the advantages of compression can outweigh the disadvantages: 
–Compression reduces the number of bytes written to/read from HDFS 
–Compression effectively improves the efficiency of network bandwidth and disk space 
–Compression reduces the size of data needed to be read when issuing a read 
•To be as low friction as necessary, a real-time compression library is preferred. 
•To achieve maximal performance and benefit, you must enable LZO. 
•What about parallelism?
compression and Hadoop 
•Storing compressed data in HDFS allows your hardware allocation to go further since compressed data is often 25% of the size of the original data. 
•Furthermore, since MapReduce jobs are nearly always IO- bound, storing compressed data means there is less overall IO to do, meaning jobs run faster. 
•There are two caveats to this, however: 
–some compression formats cannot be split for parallel processing, and 
–others are slow enough at decompression that jobs become CPU-bound, eliminating your gains on IO.
gzipcompression on Hadoop 
•The gzipcompression format illustrates the first caveat, and to understand why we need to go back to how Hadoop’sinput splits work. 
•Imagine you have a 1.1 GB gzipfile, and your cluster has a 128 MB block size. 
•This file will be split into 9 chunks of size approximately 128 MB. 
•In order to process these in parallel in a MapReduce job, a different mapperwill be responsible for each chunk. 
•But this means that the second mapperwill start on an arbitrary byte about 128MB into the file. 
•The contextfuldictionary that gzipuses to decompress input will be empty at this point, which means the gzipdecompressorwill not be able to correctly interpret the bytes. 
•The upshot is that large gzipfiles in Hadoop need to be processed by a single mapper, which defeats the purpose of parallelism.
Bzip2 compression on Hadoop 
•For an example of the second caveat in which jobs become CPU-bound, we can look to the bzip2 compression format. 
•Bzip2 files compress well and are even splittable, but the decompression algorithm is slow and cannot keep up with the streaming disk reads that are common in Hadoop jobs. 
•While Bzip2 compression has some upside because it conserves storage space, running jobs now spend their time waiting on the CPU to finish decompressing data. 
•Which slows them down and offsets the other gains.
LZO and ElephantBird 
•How can we split large compressed data and run them in parallel on Hadoop? 
•One of the biggest drawbacks from compression algorithms like Gzipis that you can’t split them into multiple mappers. 
•This is where LZO comes in 
•Using LZO compression in Hadoop allows for 
–reduced data size and 
–shorter disk read times 
•LZO’s block-based structure allows it to be split into chunks for parallel processing in Hadoop. 
•Taken together, these characteristics make LZO an excellent compression format to use in your cluster. 
•Elephant Bird is Twitter's open source library ofLZO,Thrift, and/orProtocol Buffer-relatedHadoopInputFormats, OutputFormats, Writables,PigLoadFuncs,HiveSerDe,HBasemiscellanea, etc. 
•More: 
•https://github.com/twitter/hadoop-lzo 
•https://github.com/kevinweil/elephant-bird 
•http://code.google.com/p/protobuf/(IDL)
End of session 
Day –1: Data Management
Ad

More Related Content

What's hot (20)

Map reduce in BIG DATA
Map reduce in BIG DATAMap reduce in BIG DATA
Map reduce in BIG DATA
GauravBiswas9
 
Migration into a Cloud
Migration into a CloudMigration into a Cloud
Migration into a Cloud
Divya S
 
Wot
WotWot
Wot
PRAVEENSRC2113003011
 
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Simplilearn
 
Introduction to Data Stream Processing
Introduction to Data Stream ProcessingIntroduction to Data Stream Processing
Introduction to Data Stream Processing
Safe Software
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big data
hktripathy
 
Design Issues and Challenges in Wireless Sensor Networks
Design Issues and Challenges in Wireless Sensor NetworksDesign Issues and Challenges in Wireless Sensor Networks
Design Issues and Challenges in Wireless Sensor Networks
KhushbooGupta145
 
INTRODUCTION TO BIG DATA AND HADOOP
INTRODUCTION TO BIG DATA AND HADOOPINTRODUCTION TO BIG DATA AND HADOOP
INTRODUCTION TO BIG DATA AND HADOOP
Dr Geetha Mohan
 
Task Scheduling methodology in cloud computing
Task Scheduling methodology in cloud computing Task Scheduling methodology in cloud computing
Task Scheduling methodology in cloud computing
Qutub-ud- Din
 
Mining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and CorrelationsMining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and Correlations
Justin Cletus
 
Data ingestion
Data ingestionData ingestion
Data ingestion
nitheeshe2
 
Aneka platform
Aneka platformAneka platform
Aneka platform
Shyam Krishna Khadka
 
20CS2021 DISTRIBUTED COMPUTING
20CS2021 DISTRIBUTED COMPUTING20CS2021 DISTRIBUTED COMPUTING
20CS2021 DISTRIBUTED COMPUTING
Kathirvel Ayyaswamy
 
Current Trends in HPC
Current Trends in HPCCurrent Trends in HPC
Current Trends in HPC
Putchong Uthayopas
 
Lecture4 big data technology foundations
Lecture4 big data technology foundationsLecture4 big data technology foundations
Lecture4 big data technology foundations
hktripathy
 
Big data ppt
Big data pptBig data ppt
Big data ppt
OECLIB Odisha Electronics Control Library
 
Cloud computing and Software defined networking
Cloud computing and Software defined networkingCloud computing and Software defined networking
Cloud computing and Software defined networking
saigandham1
 
An optimized scientific workflow scheduling in cloud computing
An optimized scientific workflow scheduling in cloud computingAn optimized scientific workflow scheduling in cloud computing
An optimized scientific workflow scheduling in cloud computing
DIGVIJAY SHINDE
 
cloud computing, Principle and Paradigms: 1 introdution
cloud computing, Principle and Paradigms: 1 introdutioncloud computing, Principle and Paradigms: 1 introdution
cloud computing, Principle and Paradigms: 1 introdution
Majid Hajibaba
 
Chapter05
Chapter05Chapter05
Chapter05
Muhammad Ahad
 
Map reduce in BIG DATA
Map reduce in BIG DATAMap reduce in BIG DATA
Map reduce in BIG DATA
GauravBiswas9
 
Migration into a Cloud
Migration into a CloudMigration into a Cloud
Migration into a Cloud
Divya S
 
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Simplilearn
 
Introduction to Data Stream Processing
Introduction to Data Stream ProcessingIntroduction to Data Stream Processing
Introduction to Data Stream Processing
Safe Software
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big data
hktripathy
 
Design Issues and Challenges in Wireless Sensor Networks
Design Issues and Challenges in Wireless Sensor NetworksDesign Issues and Challenges in Wireless Sensor Networks
Design Issues and Challenges in Wireless Sensor Networks
KhushbooGupta145
 
INTRODUCTION TO BIG DATA AND HADOOP
INTRODUCTION TO BIG DATA AND HADOOPINTRODUCTION TO BIG DATA AND HADOOP
INTRODUCTION TO BIG DATA AND HADOOP
Dr Geetha Mohan
 
Task Scheduling methodology in cloud computing
Task Scheduling methodology in cloud computing Task Scheduling methodology in cloud computing
Task Scheduling methodology in cloud computing
Qutub-ud- Din
 
Mining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and CorrelationsMining Frequent Patterns, Association and Correlations
Mining Frequent Patterns, Association and Correlations
Justin Cletus
 
Data ingestion
Data ingestionData ingestion
Data ingestion
nitheeshe2
 
Lecture4 big data technology foundations
Lecture4 big data technology foundationsLecture4 big data technology foundations
Lecture4 big data technology foundations
hktripathy
 
Cloud computing and Software defined networking
Cloud computing and Software defined networkingCloud computing and Software defined networking
Cloud computing and Software defined networking
saigandham1
 
An optimized scientific workflow scheduling in cloud computing
An optimized scientific workflow scheduling in cloud computingAn optimized scientific workflow scheduling in cloud computing
An optimized scientific workflow scheduling in cloud computing
DIGVIJAY SHINDE
 
cloud computing, Principle and Paradigms: 1 introdution
cloud computing, Principle and Paradigms: 1 introdutioncloud computing, Principle and Paradigms: 1 introdution
cloud computing, Principle and Paradigms: 1 introdution
Majid Hajibaba
 

Viewers also liked (12)

Hadoop exercise
Hadoop exerciseHadoop exercise
Hadoop exercise
Subhas Kumar Ghosh
 
Matrix methods for Hadoop
Matrix methods for HadoopMatrix methods for Hadoop
Matrix methods for Hadoop
David Gleich
 
03 pig intro
03 pig intro03 pig intro
03 pig intro
Subhas Kumar Ghosh
 
01 hbase
01 hbase01 hbase
01 hbase
Subhas Kumar Ghosh
 
Simplifying Use of Hive with the Hive Query Tool
Simplifying Use of Hive with the Hive Query ToolSimplifying Use of Hive with the Hive Query Tool
Simplifying Use of Hive with the Hive Query Tool
DataWorks Summit
 
Hadoop map reduce v2
Hadoop map reduce v2Hadoop map reduce v2
Hadoop map reduce v2
Subhas Kumar Ghosh
 
Resilient Distributed Datasets
Resilient Distributed DatasetsResilient Distributed Datasets
Resilient Distributed Datasets
Alessandro Menabò
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
Subhas Kumar Ghosh
 
02 Hadoop deployment and configuration
02 Hadoop deployment and configuration02 Hadoop deployment and configuration
02 Hadoop deployment and configuration
Subhas Kumar Ghosh
 
03 hive query language (hql)
03 hive query language (hql)03 hive query language (hql)
03 hive query language (hql)
Subhas Kumar Ghosh
 
Sparse matrix computations in MapReduce
Sparse matrix computations in MapReduceSparse matrix computations in MapReduce
Sparse matrix computations in MapReduce
David Gleich
 
Hadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionHadoop first mr job - inverted index construction
Hadoop first mr job - inverted index construction
Subhas Kumar Ghosh
 
Ad

Similar to Hadoop data management (20)

Cloud Computing - Cloud Technologies and Advancements
Cloud Computing - Cloud Technologies and AdvancementsCloud Computing - Cloud Technologies and Advancements
Cloud Computing - Cloud Technologies and Advancements
Sathishkumar Jaganathan
 
Data Analytics presentation.pptx
Data Analytics presentation.pptxData Analytics presentation.pptx
Data Analytics presentation.pptx
SwarnaSLcse
 
HADOOP.pptx
HADOOP.pptxHADOOP.pptx
HADOOP.pptx
Bharathi567510
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptx
DanishMahmood23
 
Big Data-Session, data engineering and scala
Big Data-Session, data engineering and scalaBig Data-Session, data engineering and scala
Big Data-Session, data engineering and scala
ssusera3b277
 
Hadoop and HDFS
Hadoop and HDFSHadoop and HDFS
Hadoop and HDFS
SatyaHadoop
 
Hadoop File System.pptx
Hadoop File System.pptxHadoop File System.pptx
Hadoop File System.pptx
AakashBerlia1
 
Hadoop-professional-software-development-course-in-mumbai
Hadoop-professional-software-development-course-in-mumbaiHadoop-professional-software-development-course-in-mumbai
Hadoop-professional-software-development-course-in-mumbai
Unmesh Baile
 
Hadoop professional-software-development-course-in-mumbai
Hadoop professional-software-development-course-in-mumbaiHadoop professional-software-development-course-in-mumbai
Hadoop professional-software-development-course-in-mumbai
Unmesh Baile
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
Nalini Mehta
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
Atul Kushwaha
 
Unit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxUnit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptx
AnkitChauhan817826
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfs
shrey mehrotra
 
getFamiliarWithHadoop
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
AmirReza Mohammadi
 
Module 2 C2_HadoopEcosystemComponents.pptx
Module 2 C2_HadoopEcosystemComponents.pptxModule 2 C2_HadoopEcosystemComponents.pptx
Module 2 C2_HadoopEcosystemComponents.pptx
Shrinivasa6
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorial
vinayiqbusiness
 
module 2.pptx
module 2.pptxmodule 2.pptx
module 2.pptx
ssuser6e8e41
 
Big data interview questions and answers
Big data interview questions and answersBig data interview questions and answers
Big data interview questions and answers
Kalyan Hadoop
 
Chapter2.pdf
Chapter2.pdfChapter2.pdf
Chapter2.pdf
WasyihunSema2
 
Hadoop
HadoopHadoop
Hadoop
Syed Measum Haider Bokhari
 
Cloud Computing - Cloud Technologies and Advancements
Cloud Computing - Cloud Technologies and AdvancementsCloud Computing - Cloud Technologies and Advancements
Cloud Computing - Cloud Technologies and Advancements
Sathishkumar Jaganathan
 
Data Analytics presentation.pptx
Data Analytics presentation.pptxData Analytics presentation.pptx
Data Analytics presentation.pptx
SwarnaSLcse
 
Topic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptxTopic 9a-Hadoop Storage- HDFS.pptx
Topic 9a-Hadoop Storage- HDFS.pptx
DanishMahmood23
 
Big Data-Session, data engineering and scala
Big Data-Session, data engineering and scalaBig Data-Session, data engineering and scala
Big Data-Session, data engineering and scala
ssusera3b277
 
Hadoop File System.pptx
Hadoop File System.pptxHadoop File System.pptx
Hadoop File System.pptx
AakashBerlia1
 
Hadoop-professional-software-development-course-in-mumbai
Hadoop-professional-software-development-course-in-mumbaiHadoop-professional-software-development-course-in-mumbai
Hadoop-professional-software-development-course-in-mumbai
Unmesh Baile
 
Hadoop professional-software-development-course-in-mumbai
Hadoop professional-software-development-course-in-mumbaiHadoop professional-software-development-course-in-mumbai
Hadoop professional-software-development-course-in-mumbai
Unmesh Baile
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
Nalini Mehta
 
Unit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptxUnit-1 Introduction to Big Data.pptx
Unit-1 Introduction to Big Data.pptx
AnkitChauhan817826
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfs
shrey mehrotra
 
Module 2 C2_HadoopEcosystemComponents.pptx
Module 2 C2_HadoopEcosystemComponents.pptxModule 2 C2_HadoopEcosystemComponents.pptx
Module 2 C2_HadoopEcosystemComponents.pptx
Shrinivasa6
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorial
vinayiqbusiness
 
Big data interview questions and answers
Big data interview questions and answersBig data interview questions and answers
Big data interview questions and answers
Kalyan Hadoop
 
Ad

More from Subhas Kumar Ghosh (19)

07 logistic regression and stochastic gradient descent
07 logistic regression and stochastic gradient descent07 logistic regression and stochastic gradient descent
07 logistic regression and stochastic gradient descent
Subhas Kumar Ghosh
 
06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering
Subhas Kumar Ghosh
 
05 k-means clustering
05 k-means clustering05 k-means clustering
05 k-means clustering
Subhas Kumar Ghosh
 
02 data warehouse applications with hive
02 data warehouse applications with hive02 data warehouse applications with hive
02 data warehouse applications with hive
Subhas Kumar Ghosh
 
06 pig etl features
06 pig etl features06 pig etl features
06 pig etl features
Subhas Kumar Ghosh
 
05 pig user defined functions (udfs)
05 pig user defined functions (udfs)05 pig user defined functions (udfs)
05 pig user defined functions (udfs)
Subhas Kumar Ghosh
 
04 pig data operations
04 pig data operations04 pig data operations
04 pig data operations
Subhas Kumar Ghosh
 
02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysis02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysis
Subhas Kumar Ghosh
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tips
Subhas Kumar Ghosh
 
Hadoop Day 3
Hadoop Day 3Hadoop Day 3
Hadoop Day 3
Subhas Kumar Ghosh
 
Hadoop job chaining
Hadoop job chainingHadoop job chaining
Hadoop job chaining
Subhas Kumar Ghosh
 
Hadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparatorHadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparator
Subhas Kumar Ghosh
 
Hadoop combiner and partitioner
Hadoop combiner and partitionerHadoop combiner and partitioner
Hadoop combiner and partitioner
Subhas Kumar Ghosh
 
Hadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by stepHadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by step
Subhas Kumar Ghosh
 
Hadoop map reduce in operation
Hadoop map reduce in operationHadoop map reduce in operation
Hadoop map reduce in operation
Subhas Kumar Ghosh
 
Hadoop map reduce concepts
Hadoop map reduce conceptsHadoop map reduce concepts
Hadoop map reduce concepts
Subhas Kumar Ghosh
 
Hadoop availability
Hadoop availabilityHadoop availability
Hadoop availability
Subhas Kumar Ghosh
 
Hadoop scheduler
Hadoop schedulerHadoop scheduler
Hadoop scheduler
Subhas Kumar Ghosh
 
Greedy embedding problem
Greedy embedding problemGreedy embedding problem
Greedy embedding problem
Subhas Kumar Ghosh
 
07 logistic regression and stochastic gradient descent
07 logistic regression and stochastic gradient descent07 logistic regression and stochastic gradient descent
07 logistic regression and stochastic gradient descent
Subhas Kumar Ghosh
 
06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering
Subhas Kumar Ghosh
 
02 data warehouse applications with hive
02 data warehouse applications with hive02 data warehouse applications with hive
02 data warehouse applications with hive
Subhas Kumar Ghosh
 
05 pig user defined functions (udfs)
05 pig user defined functions (udfs)05 pig user defined functions (udfs)
05 pig user defined functions (udfs)
Subhas Kumar Ghosh
 
02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysis02 naive bays classifier and sentiment analysis
02 naive bays classifier and sentiment analysis
Subhas Kumar Ghosh
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tips
Subhas Kumar Ghosh
 
Hadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparatorHadoop secondary sort and a custom comparator
Hadoop secondary sort and a custom comparator
Subhas Kumar Ghosh
 
Hadoop combiner and partitioner
Hadoop combiner and partitionerHadoop combiner and partitioner
Hadoop combiner and partitioner
Subhas Kumar Ghosh
 
Hadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by stepHadoop deconstructing map reduce job step by step
Hadoop deconstructing map reduce job step by step
Subhas Kumar Ghosh
 
Hadoop map reduce in operation
Hadoop map reduce in operationHadoop map reduce in operation
Hadoop map reduce in operation
Subhas Kumar Ghosh
 

Recently uploaded (20)

AI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsAI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
Contify
 
FPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptxFPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptx
ssuser4ef83d
 
Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...
Pixellion
 
Cleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdfCleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdf
alcinialbob1234
 
Principles of information security Chapter 5.ppt
Principles of information security Chapter 5.pptPrinciples of information security Chapter 5.ppt
Principles of information security Chapter 5.ppt
EstherBaguma
 
C++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptxC++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptx
aquibnoor22079
 
chapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.pptchapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.ppt
justinebandajbn
 
How iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost FundsHow iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost Funds
ireneschmid345
 
Medical Dataset including visualizations
Medical Dataset including visualizationsMedical Dataset including visualizations
Medical Dataset including visualizations
vishrut8750588758
 
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.pptJust-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
ssuser5f8f49
 
Data Science Courses in India iim skills
Data Science Courses in India iim skillsData Science Courses in India iim skills
Data Science Courses in India iim skills
dharnathakur29
 
Simple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptxSimple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptx
ssuser2aa19f
 
chapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptxchapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptx
justinebandajbn
 
DPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdfDPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdf
inmishra17121973
 
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
ThanushsaranS
 
Stack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptxStack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptx
binduraniha86
 
Flip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptxFlip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptx
mubashirkhan45461
 
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
Simran112433
 
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
James Francis Paradigm Asset Management
 
Data Analytics Overview and its applications
Data Analytics Overview and its applicationsData Analytics Overview and its applications
Data Analytics Overview and its applications
JanmejayaMishra7
 
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsAI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
Contify
 
FPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptxFPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptx
ssuser4ef83d
 
Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...
Pixellion
 
Cleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdfCleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdf
alcinialbob1234
 
Principles of information security Chapter 5.ppt
Principles of information security Chapter 5.pptPrinciples of information security Chapter 5.ppt
Principles of information security Chapter 5.ppt
EstherBaguma
 
C++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptxC++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptx
aquibnoor22079
 
chapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.pptchapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.ppt
justinebandajbn
 
How iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost FundsHow iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost Funds
ireneschmid345
 
Medical Dataset including visualizations
Medical Dataset including visualizationsMedical Dataset including visualizations
Medical Dataset including visualizations
vishrut8750588758
 
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.pptJust-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
ssuser5f8f49
 
Data Science Courses in India iim skills
Data Science Courses in India iim skillsData Science Courses in India iim skills
Data Science Courses in India iim skills
dharnathakur29
 
Simple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptxSimple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptx
ssuser2aa19f
 
chapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptxchapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptx
justinebandajbn
 
DPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdfDPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdf
inmishra17121973
 
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
ThanushsaranS
 
Stack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptxStack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptx
binduraniha86
 
Flip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptxFlip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptx
mubashirkhan45461
 
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
Simran112433
 
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
James Francis Paradigm Asset Management
 
Data Analytics Overview and its applications
Data Analytics Overview and its applicationsData Analytics Overview and its applications
Data Analytics Overview and its applications
JanmejayaMishra7
 

Hadoop data management

  • 2. Scale-up •To understand the popularity of distributed systems (scale-out) vis-à-vis huge monolithic servers (scale-up), consider the price performance of current I/O technology. •A high-end machine with four I/O channels each having a throughput of 100 MB/sec will require three hours to read a 4 TB data set! •With Hadoop, this same data set will be divided into smaller (typically 64 MB) blocks that are spread among many machines in the cluster via the HadoopDistributed File System (HDFS ). •With a modest degree of replication, the cluster machines can read the data set in parallel and provide a much higher throughput. •And such a cluster of commodity machines turns out to be cheaper than one high-end server!
  • 3. Hadoopfocuses on moving code to data •The clients send only the MapReduce programs to be executed, and these programs are usually small (often in kilobytes). •More importantly, the move-code-to-data philosophy applies within the Hadoop cluster itself. •Data is broken up and distributed across the cluster, and as much as possible, computation on a piece of data takes place on the same machine where that piece of data resides. •The programs to run (“code”) are orders of magnitude smaller than the data and are easier to move around. •Also, it takes more time to move data across a network than to apply the computation to it.
  • 4. HDFS •HDFS is the file system component of Hadoop. •Interface to HDFS is patterned after the UNIX file system •Faithfulness to standards was sacrificed in favor of improved performance for the applications at hand •HDFS stores file system metadata and application data separately •“HDFS is a file-system designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware”1 1 “The Hadoop Distributed File System” by Konstantin Shvachko, HairongKuang, Sanjay Radia, and Robert Chansler(Proceedings of MSST2010, May 2010, http:// storageconference.org/2010/Papers/MSST/Shvachko.pdf)
  • 5. Key properties of HDFS •Very Large –“Very large” in this context means files that are hundreds of megabytes, gigabytes, or terabytes in size. –There are Hadoop clusters running today that store petabytes of data. •Streaming data –write-once, read-many-times pattern –the time to read the whole dataset is more important than the latency in reading the first record •Commodity hardware –HDFS is designed to carry on working without a noticeable interruption to the user in the face of such failure
  • 6. Not a good fit for •Low-latency data access –HDFS is optimized for delivering a high throughput of data, and this may be at the expense of latency. –Hbaseis currently a better choice for low-latency access. •Lots of small files –Since the namenode holds filesystemmetadata in memory, the limit to the number of files in a filesystemis governed by the amount of memory on the namenode. –As a rule of thumb, each file, directory, and block takes about 150 bytes. –While storing millions of files is feasible, billions is beyond the capability of current hardware. •Multiple writers, arbitrary file modifications –Files in HDFS may be written to by a single writer. Writes are always made at the end of the file. –There is no support for multiple writers, or for modifications at arbitrary offsets in the file.
  • 7. Namenode and Datanode Master/slave architecture HDFS cluster consists of a single Namenode, a master server that manages the file system namespace and regulates access to files by clients. There are a number of DataNodesusually one per node in a cluster. The DataNodesmanage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. A file is split into one or more blocks and set of blocks are stored in DataNodes. DataNodes: serves read, write requests, performs block creation, deletion, and replication upon instruction from Namenode.
  • 8. Web Interface •NameNodeand DataNodeeach run an internal web server in order to display basic information about the current status of the cluster. • •With the default configuration, the NameNodefront page is athttp://namenode-name:50070/. •It lists the DataNodesin the cluster and basic statistics of the cluster. •The web interface can also be used to browse the file system (using "Browse the file system" link on the NameNodefront page).
  • 9. HDFS architecture Namenode B replication Rack1 Rack2 Client Blocks Datanodes Datanodes Client Write Read Metadata ops Metadata(Name, replicas..) (/home/foo/data,6. .. Block ops
  • 10. Namenode Keeps image of entire file system namespace and file Blockmapin memory. 4GB of local RAM is sufficient to support the above data structures that represent the huge number of files and directories. When the Namenode starts up it gets the FsImageand Editlogfrom its local file system, update FsImagewith EditLoginformation and then stores a copy of the FsImageon the filesytstemas a checkpoint. Periodic checkpointingis done. So that the system can recover back to the last checkpointedstate in case of a crash.
  • 11. Datanode A Datanodestores data in files in its local file system. Datanodehas no knowledge about HDFS filesystem It stores each block of HDFS data in a separate file. Datanodedoes not create all files in the same directory. It uses heuristics to determine optimal number of files per directory and creates directories appropriately When the filesystemstarts up it generates a list of all HDFS blocks and send this report to Namenode: Blockreport.
  • 12. HDFS Application Local file system Master node Name Nodes HDFS Client HDFS Server Block size: 2K Block size: 128M Replicated
  • 14. HDFS: Modules •Protocol: The protocol package is used in communication between the client and the namenode and datanode. It describes the messages used between these servers. •Security: security is used in authenticating access to the files. The security is based on token-based authentication, where the namenode server controls the distribution of access tokens. •server.protocol: server.protocoldefines the communication between namenode and datanode, and between namenode and balancer. •server.common: server.commoncontains utilities that are used by the namenode, datanodeand balancer. Examples are classes containing server-wide constants, utilities, and other logic that is shared among the servers. •Client: The client contains the logic to access the file system from a user’s computer. It interfaces with the datanodeand namenode servers using the protocol module. In the diagram this module spans two layers. This is because the client module also contains some logic that is shared system wide. •Datanode: The datanodeis responsible for storing the actual blocks of filesystemdata. It receives instructions on which blocks to store from the namenode. It also services the client directly to stream file block contents. •Namenode: The namenode is responsible for authorizing the user, storing a mapping from filenames to data blocks, and it knows which blocks of data are stored where. •Balancer: The balancer is a separate server that tells the namenode to move data blocks between datanodeswhen the load is not evenly balanced among datanodes. •Tools: The tools package can be used to administer the filesystem, and also contains debugging code.
  • 15. File system •Hierarchical file system with directories and files •Create, remove, move, rename etc. •Namenode maintains the file system •Any meta information changes to the file system recorded by the Namenode. •An application can specify the number of replicas of the file needed: replication factor of the file. •This information is stored in the Namenode.
  • 16. Metadata •The HDFS namespace is stored by Namenode. •Namenode uses a transaction log called the EditLogto record every change that occurs to the filesystemmeta data. –For example, creating a new file. –Change replication factor of a file –EditLogis stored in the Namenode’slocal filesystem •Entire filesystemnamespace including mapping of blocks to files and file system properties is stored in a file FsImage. •Stored in Namenode’slocal filesystem.
  • 17. Applicationcode<-> Client •HDFS provides aJava APIfor applications to use. •Fundamentally, the application uses the standard java.io interface. •A C language wrapper for this Java API is also available. •The client and the application code are bound into the same address space.
  • 19. Java Interface •One of the simplest ways to read a file from a Hadoop filesystemis by using a java.net.URLobject to open a stream to read the data from. •The general idiom is: InputStreamin = null; try { in = new URL("hdfs://host/path").openStream(); // process in } finally { IOUtils.closeStream(in); } •There’s a little bit more work required to make Java recognize Hadoop’shdfsURL scheme. •This is achieved by calling the setURLStreamHandlerFactorymethod on URL with an instance of FsUrlStreamHandlerFactory.
  • 20. Example : Displaying files from a Hadoop filesystemon standard output public class URLCat{ static { URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory()); } public static void main(String[] args) throws Exception { InputStreamin = null; try { in = new URL(args[0]).openStream(); IOUtils.copyBytes(in, System.out, 4096, false); } finally { IOUtils.closeStream(in); } } }
  • 21. Reading Data Using the FileSystemAPI •A file in a Hadoop filesystemis represented by a Hadoop Path object (and not a java.io.Fileobject. •There are several static factory methods for getting a FileSysteminstance: –public static FileSystemget(Configuration conf) throws IOException –public static FileSystemget(URI uri, Configuration conf) throws IOException –public static FileSystemget(URI uri, Configuration conf, String user) throws IOException •A Configuration object encapsulates a client or server’s configuration, which is set using configuration files read from the classpath, such as conf/core- site.xml. •With a FileSysteminstance in hand, we invoke an open() method to get the input stream for a file: –public FSDataInputStreamopen(Path f) throws IOException –public abstract FSDataInputStreamopen(Path f, intbufferSize) throws IOException
  • 22. Example : Displaying files with FileSystemAPI public class FileSystemCat{ public static void main(String[] args) throws Exception { String uri= args[0]; Configuration conf = new Configuration(); FileSystemfs= FileSystem.get(URI.create(uri), conf); InputStreamin = null; try { in = fs.open(new Path(uri)); IOUtils.copyBytes(in, System.out, 4096, false); } finally { IOUtils.closeStream(in); } } }
  • 23. FSDataInputStream •The open() method on FileSystemactually returns a FSDataInputStreamrather than a standard java.io class. •This class is a specialization of java.io.DataInputStreamwith support for random access, so you can read from any part of the stream. package org.apache.hadoop.fs; public class FSDataInputStreamextends DataInputStream implements Seekable, PositionedReadable{ // implementation } public interface Seekable{ void seek(long pos) throws IOException; long getPos() throws IOException; } public interface PositionedReadable{ public intread(long position, byte[] buffer, intoffset, intlength) throws IOException; public void readFully(long position, byte[] buffer, intoffset, intlength) throws IOException; public void readFully(long position, byte[] buffer) throws IOException; }
  • 24. FSDataOutputStream public FSDataOutputStreamcreate(Path f) throws IOException package org.apache.hadoop.util; public interface Progressable{ public void progress(); } public FSDataOutputStreamappend(Path f) throws IOException
  • 25. Example: Copying a local file to a Hadoop filesystem public class FileCopyWithProgress{ public static void main(String[] args) throws Exception { String localSrc= args[0]; String dst= args[1]; InputStreamin = new BufferedInputStream(new FileInputStream(localSrc)); Configuration conf = new Configuration(); FileSystemfs= FileSystem.get(URI.create(dst), conf); OutputStreamout = fs.create(new Path(dst), new Progressable() { public void progress() { System.out.print("."); } }); IOUtils.copyBytes(in, out, 4096, true); } }
  • 26. File-Based Data Structures •For some applications, you need a specialized data structure to hold your data. •For doing MapReduce-based processing, putting each blob of binary data into its own file doesn’t scale, so Hadoop developed a number of higher- level containers for these situations. •Imagine a logfile, where each log record is a new line of text. •If you want to log binary types, plain text isn’t a suitable format. •Hadoop’sSequenceFileclass fits the bill in this situation, providing a persistent data structure for binary key-value pairs.
  • 27. SequenceFile •SequenceFileis a flat file consisting of binary key/value pairs. •It is extensively used inMapReduceas input/output formats. •Internally, the temporary outputs of maps are stored using SequenceFile. •The SequenceFileprovides a Writer, Reader and Sorter classes for writing, reading and sorting respectively. •There are 3 different SequenceFileformats: –Uncompressed key/value records. –Record compressed key/value records -only 'values' are compressed here. –Block compressed key/value records -both keys and values are collected in 'blocks' separately and compressed. The size of the 'block' is configurable. •TheSequenceFile.Readeracts as a bridge and can read any of the above SequenceFileformats.
  • 28. Using SequenceFile •To use it as a logfileformat, you would choose a key, such as timestamp represented by a LongWritable, and the value is a Writable that represents the quantity being logged. •To create a SequenceFile, use one of its createWriter() static methods, which returns a SequenceFile.Writerinstance. •Once you have a SequenceFile.Writer, you then write key-value pairs, using the append() method. •Then when you’ve finished, you call the close() method. •Reading sequence files from beginning to end is a matter of creating an instance of SequenceFile.Readerand iterating over records by repeatedly invoking one of the next() methods.
  • 29. Internals of A sequence file •A sequence file consists of a header followed by one or more records •The header contains other fields including the names of the key and value classes, compression details, user defined metadata, and the sync marker. •A MapFileis a sorted SequenceFilewith an index to permit lookups by key.
  • 30. Compression •Hadoop allows users to compress output data, intermediate data, or both. •Hadoop checks whether input data is in a compressed format and decompresses the data as needed. •Compression codec: –two lossless codecs. –The default codec is gzip, a combination of the Lempel-Ziv 1977 (LZ77) algorithm and Huffmanencoding. –The other codec implements the Lempel-ZivOberhumer(LZO) algorithm, a variant of LZ77 optimized for decompression speed. •Compression unit: –Hadoop allows both per-record and per-block compression. –Thus, the record or block size affects the compressibility of the data.
  • 31. When to use compression? •Compression adds a read-time-penalty, why would one enable any compression? •There are a few reasons why the advantages of compression can outweigh the disadvantages: –Compression reduces the number of bytes written to/read from HDFS –Compression effectively improves the efficiency of network bandwidth and disk space –Compression reduces the size of data needed to be read when issuing a read •To be as low friction as necessary, a real-time compression library is preferred. •To achieve maximal performance and benefit, you must enable LZO. •What about parallelism?
  • 32. compression and Hadoop •Storing compressed data in HDFS allows your hardware allocation to go further since compressed data is often 25% of the size of the original data. •Furthermore, since MapReduce jobs are nearly always IO- bound, storing compressed data means there is less overall IO to do, meaning jobs run faster. •There are two caveats to this, however: –some compression formats cannot be split for parallel processing, and –others are slow enough at decompression that jobs become CPU-bound, eliminating your gains on IO.
  • 33. gzipcompression on Hadoop •The gzipcompression format illustrates the first caveat, and to understand why we need to go back to how Hadoop’sinput splits work. •Imagine you have a 1.1 GB gzipfile, and your cluster has a 128 MB block size. •This file will be split into 9 chunks of size approximately 128 MB. •In order to process these in parallel in a MapReduce job, a different mapperwill be responsible for each chunk. •But this means that the second mapperwill start on an arbitrary byte about 128MB into the file. •The contextfuldictionary that gzipuses to decompress input will be empty at this point, which means the gzipdecompressorwill not be able to correctly interpret the bytes. •The upshot is that large gzipfiles in Hadoop need to be processed by a single mapper, which defeats the purpose of parallelism.
  • 34. Bzip2 compression on Hadoop •For an example of the second caveat in which jobs become CPU-bound, we can look to the bzip2 compression format. •Bzip2 files compress well and are even splittable, but the decompression algorithm is slow and cannot keep up with the streaming disk reads that are common in Hadoop jobs. •While Bzip2 compression has some upside because it conserves storage space, running jobs now spend their time waiting on the CPU to finish decompressing data. •Which slows them down and offsets the other gains.
  • 35. LZO and ElephantBird •How can we split large compressed data and run them in parallel on Hadoop? •One of the biggest drawbacks from compression algorithms like Gzipis that you can’t split them into multiple mappers. •This is where LZO comes in •Using LZO compression in Hadoop allows for –reduced data size and –shorter disk read times •LZO’s block-based structure allows it to be split into chunks for parallel processing in Hadoop. •Taken together, these characteristics make LZO an excellent compression format to use in your cluster. •Elephant Bird is Twitter's open source library ofLZO,Thrift, and/orProtocol Buffer-relatedHadoopInputFormats, OutputFormats, Writables,PigLoadFuncs,HiveSerDe,HBasemiscellanea, etc. •More: •https://github.com/twitter/hadoop-lzo •https://github.com/kevinweil/elephant-bird •http://code.google.com/p/protobuf/(IDL)
  • 36. End of session Day –1: Data Management