SlideShare a Scribd company logo
Data Management
Scale-up 
•To understand the popularity of distributed systems (scale-out) vis-à-vis huge monolithic servers (scale-up), consider the price performance of current I/O technology. 
•A high-end machine with four I/O channels each having a throughput of 100 MB/sec will require three hours to read a 4 TB data set! 
•With Hadoop, this same data set will be divided into smaller (typically 64 MB) blocks that are spread among many machines in the cluster via the HadoopDistributed File System (HDFS ). 
•With a modest degree of replication, the cluster machines can read the data set in parallel and provide a much higher throughput. 
•And such a cluster of commodity machines turns out to be cheaper than one high-end server!
Hadoopfocuses on moving code to data 
•The clients send only the MapReduce programs to be executed, and these programs are usually small (often in kilobytes). 
•More importantly, the move-code-to-data philosophy applies within the Hadoop cluster itself. 
•Data is broken up and distributed across the cluster, and as much as possible, computation on a piece of data takes place on the same machine where that piece of data resides. 
•The programs to run (“code”) are orders of magnitude smaller than the data and are easier to move around. 
•Also, it takes more time to move data across a network than to apply the computation to it.
HDFS 
•HDFS is the file system component of Hadoop. 
•Interface to HDFS is patterned after the UNIX file system 
•Faithfulness to standards was sacrificed in favor of improved performance for the applications at hand 
•HDFS stores file system metadata and application data separately 
•“HDFS is a file-system designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware”1 
1 “The Hadoop Distributed File System” by Konstantin Shvachko, HairongKuang, Sanjay Radia, and Robert Chansler(Proceedings of MSST2010, May 2010, http:// storageconference.org/2010/Papers/MSST/Shvachko.pdf)
Key properties of HDFS 
•Very Large 
–“Very large” in this context means files that are hundreds of megabytes, gigabytes, or terabytes in size. 
–There are Hadoop clusters running today that store petabytes of data. 
•Streaming data 
–write-once, read-many-times pattern 
–the time to read the whole dataset is more important than the latency in reading the first record 
•Commodity hardware 
–HDFS is designed to carry on working without a noticeable interruption to the user in the face of such failure
Not a good fit for 
•Low-latency data access 
–HDFS is optimized for delivering a high throughput of data, and this may be at the expense of latency. 
–Hbaseis currently a better choice for low-latency access. 
•Lots of small files 
–Since the namenode holds filesystemmetadata in memory, the limit to the number of files in a filesystemis governed by the amount of memory on the namenode. 
–As a rule of thumb, each file, directory, and block takes about 150 bytes. 
–While storing millions of files is feasible, billions is beyond the capability of current hardware. 
•Multiple writers, arbitrary file modifications 
–Files in HDFS may be written to by a single writer. Writes are always made at the end of the file. 
–There is no support for multiple writers, or for modifications at arbitrary offsets in the file.
Namenode and Datanode 
Master/slave architecture 
HDFS cluster consists of a single Namenode, a master server that manages the file system namespace and regulates access to files by clients. 
There are a number of DataNodesusually one per node in a cluster. 
The DataNodesmanage storage attached to the nodes that they run on. 
HDFS exposes a file system namespace and allows user data to be stored in files. 
A file is split into one or more blocks and set of blocks are stored in DataNodes. 
DataNodes: serves read, write requests, performs block creation, deletion, and replication upon instruction from Namenode.
Web Interface 
•NameNodeand DataNodeeach run an internal web server in order to display basic information about the current status of the cluster. 
• 
•With the default configuration, the NameNodefront page is athttp://namenode-name:50070/. 
•It lists the DataNodesin the cluster and basic statistics of the cluster. 
•The web interface can also be used to browse the file system (using "Browse the file system" link on the NameNodefront page).
HDFS architecture 
Namenode 
B 
replication 
Rack1 
Rack2 
Client 
Blocks 
Datanodes 
Datanodes 
Client 
Write 
Read 
Metadata ops 
Metadata(Name, replicas..) 
(/home/foo/data,6. .. 
Block ops
Namenode 
Keeps image of entire file system namespace and file Blockmapin memory. 
4GB of local RAM is sufficient to support the above data structures that represent the huge number of files and directories. 
When the Namenode starts up it gets the FsImageand Editlogfrom its local file system, update FsImagewith EditLoginformation and then stores a copy of the FsImageon the filesytstemas a checkpoint. 
Periodic checkpointingis done. So that the system can recover back to the last checkpointedstate in case of a crash.
Datanode 
A Datanodestores data in files in its local file system. 
Datanodehas no knowledge about HDFS filesystem 
It stores each block of HDFS data in a separate file. 
Datanodedoes not create all files in the same directory. 
It uses heuristics to determine optimal number of files per directory and creates directories appropriately 
When the filesystemstarts up it generates a list of all HDFS blocks and send this report to Namenode: Blockreport.
HDFS 
Application 
Local file system 
Master node 
Name Nodes 
HDFS Client 
HDFS Server 
Block size: 2K 
Block size: 128M 
Replicated
HDFS: Module view
HDFS: Modules 
•Protocol: The protocol package is used in communication between the client and the namenode and datanode. It describes the messages used between these servers. 
•Security: security is used in authenticating access to the files. The security is based on token-based authentication, where the namenode server controls the distribution of access tokens. 
•server.protocol: server.protocoldefines the communication between namenode and datanode, and between namenode and balancer. 
•server.common: server.commoncontains utilities that are used by the namenode, datanodeand balancer. Examples are classes containing server-wide constants, utilities, and other logic that is shared among the servers. 
•Client: The client contains the logic to access the file system from a user’s computer. It interfaces with the datanodeand namenode servers using the protocol module. In the diagram this module spans two layers. This is because the client module also contains some logic that is shared system wide. 
•Datanode: The datanodeis responsible for storing the actual blocks of filesystemdata. It receives instructions on which blocks to store from the namenode. It also services the client directly to stream file block contents. 
•Namenode: The namenode is responsible for authorizing the user, storing a mapping from filenames to data blocks, and it knows which blocks of data are stored where. 
•Balancer: The balancer is a separate server that tells the namenode to move data blocks between datanodeswhen the load is not evenly balanced among datanodes. 
•Tools: The tools package can be used to administer the filesystem, and also contains debugging code.
File system 
•Hierarchical file system with directories and files 
•Create, remove, move, rename etc. 
•Namenode maintains the file system 
•Any meta information changes to the file system recorded by the Namenode. 
•An application can specify the number of replicas of the file needed: replication factor of the file. 
•This information is stored in the Namenode.
Metadata 
•The HDFS namespace is stored by Namenode. 
•Namenode uses a transaction log called the EditLogto record every change that occurs to the filesystemmeta data. 
–For example, creating a new file. 
–Change replication factor of a file 
–EditLogis stored in the Namenode’slocal filesystem 
•Entire filesystemnamespace including mapping of blocks to files and file system properties is stored in a file FsImage. 
•Stored in Namenode’slocal filesystem.
Applicationcode<-> Client 
•HDFS provides aJava APIfor applications to use. 
•Fundamentally, the application uses the standard java.io interface. 
•A C language wrapper for this Java API is also available. 
•The client and the application code are bound into the same address space.
Client
Java Interface 
•One of the simplest ways to read a file from a Hadoop filesystemis by using a java.net.URLobject to open a stream to read the data from. 
•The general idiom is: 
InputStreamin = null; 
try { 
in = new URL("hdfs://host/path").openStream(); 
// process in 
} finally { 
IOUtils.closeStream(in); 
} 
•There’s a little bit more work required to make Java recognize Hadoop’shdfsURL scheme. 
•This is achieved by calling the setURLStreamHandlerFactorymethod on URL with an instance of FsUrlStreamHandlerFactory.
Example : Displaying files from a Hadoop filesystemon standard output 
public class URLCat{ 
static { 
URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory()); 
} 
public static void main(String[] args) throws Exception { 
InputStreamin = null; 
try { 
in = new URL(args[0]).openStream(); 
IOUtils.copyBytes(in, System.out, 4096, false); 
} finally { 
IOUtils.closeStream(in); 
} 
} 
}
Reading Data Using the FileSystemAPI 
•A file in a Hadoop filesystemis represented by a Hadoop Path object (and not a java.io.Fileobject. 
•There are several static factory methods for getting a FileSysteminstance: 
–public static FileSystemget(Configuration conf) throws IOException 
–public static FileSystemget(URI uri, Configuration conf) throws IOException 
–public static FileSystemget(URI uri, Configuration conf, String user) throws IOException 
•A Configuration object encapsulates a client or server’s configuration, which is set using configuration files read from the classpath, such as conf/core- site.xml. 
•With a FileSysteminstance in hand, we invoke an open() method to get the input stream for a file: 
–public FSDataInputStreamopen(Path f) throws IOException 
–public abstract FSDataInputStreamopen(Path f, intbufferSize) throws IOException
Example : Displaying files with FileSystemAPI 
public class FileSystemCat{ 
public static void main(String[] args) throws Exception { 
String uri= args[0]; 
Configuration conf = new Configuration(); 
FileSystemfs= FileSystem.get(URI.create(uri), conf); 
InputStreamin = null; 
try { 
in = fs.open(new Path(uri)); 
IOUtils.copyBytes(in, System.out, 4096, false); 
} finally { 
IOUtils.closeStream(in); 
} 
} 
}
FSDataInputStream 
•The open() method on FileSystemactually returns a FSDataInputStreamrather than a standard java.io class. 
•This class is a specialization of java.io.DataInputStreamwith support for random access, so you can read from any part of the stream. 
package org.apache.hadoop.fs; 
public class FSDataInputStreamextends DataInputStream 
implements Seekable, PositionedReadable{ 
// implementation 
} 
public interface Seekable{ 
void seek(long pos) throws IOException; 
long getPos() throws IOException; 
} 
public interface PositionedReadable{ 
public intread(long position, byte[] buffer, intoffset, intlength) throws IOException; 
public void readFully(long position, byte[] buffer, intoffset, intlength) throws IOException; 
public void readFully(long position, byte[] buffer) throws IOException; 
}
FSDataOutputStream 
public FSDataOutputStreamcreate(Path f) throws IOException 
package org.apache.hadoop.util; 
public interface Progressable{ 
public void progress(); 
} 
public FSDataOutputStreamappend(Path f) throws IOException
Example: Copying a local file to a Hadoop filesystem 
public class FileCopyWithProgress{ 
public static void main(String[] args) throws Exception { 
String localSrc= args[0]; 
String dst= args[1]; 
InputStreamin = new BufferedInputStream(new FileInputStream(localSrc)); 
Configuration conf = new Configuration(); 
FileSystemfs= FileSystem.get(URI.create(dst), conf); 
OutputStreamout = fs.create(new Path(dst), new Progressable() { 
public void progress() { 
System.out.print("."); 
} 
}); 
IOUtils.copyBytes(in, out, 4096, true); 
} 
}
File-Based Data Structures 
•For some applications, you need a specialized data structure to hold your data. 
•For doing MapReduce-based processing, putting each blob of binary data into its own file doesn’t scale, so Hadoop developed a number of higher- level containers for these situations. 
•Imagine a logfile, where each log record is a new line of text. 
•If you want to log binary types, plain text isn’t a suitable format. 
•Hadoop’sSequenceFileclass fits the bill in this situation, providing a persistent data structure for binary key-value pairs.
SequenceFile 
•SequenceFileis a flat file consisting of binary key/value pairs. 
•It is extensively used inMapReduceas input/output formats. 
•Internally, the temporary outputs of maps are stored using SequenceFile. 
•The SequenceFileprovides a Writer, Reader and Sorter classes for writing, reading and sorting respectively. 
•There are 3 different SequenceFileformats: 
–Uncompressed key/value records. 
–Record compressed key/value records -only 'values' are compressed here. 
–Block compressed key/value records -both keys and values are collected in 'blocks' separately and compressed. The size of the 'block' is configurable. 
•TheSequenceFile.Readeracts as a bridge and can read any of the above SequenceFileformats.
Using SequenceFile 
•To use it as a logfileformat, you would choose a key, such as timestamp represented by a LongWritable, and the value is a Writable that represents the quantity being logged. 
•To create a SequenceFile, use one of its createWriter() static methods, which returns a SequenceFile.Writerinstance. 
•Once you have a SequenceFile.Writer, you then write key-value pairs, using the append() method. 
•Then when you’ve finished, you call the close() method. 
•Reading sequence files from beginning to end is a matter of creating an instance of SequenceFile.Readerand iterating over records by repeatedly invoking one of the next() methods.
Internals of A sequence file 
•A sequence file consists of a header followed by one or more records 
•The header contains other fields including the names of the key and value classes, compression details, user defined metadata, and the sync marker. 
•A MapFileis a sorted SequenceFilewith an index to permit lookups by key.
Compression 
•Hadoop allows users to compress output data, intermediate data, or both. 
•Hadoop checks whether input data is in a compressed format and decompresses the data as needed. 
•Compression codec: 
–two lossless codecs. 
–The default codec is gzip, a combination of the Lempel-Ziv 1977 (LZ77) algorithm and Huffmanencoding. 
–The other codec implements the Lempel-ZivOberhumer(LZO) algorithm, a variant of LZ77 optimized for decompression speed. 
•Compression unit: 
–Hadoop allows both per-record and per-block compression. 
–Thus, the record or block size affects the compressibility of the data.
When to use compression? 
•Compression adds a read-time-penalty, why would one enable any compression? 
•There are a few reasons why the advantages of compression can outweigh the disadvantages: 
–Compression reduces the number of bytes written to/read from HDFS 
–Compression effectively improves the efficiency of network bandwidth and disk space 
–Compression reduces the size of data needed to be read when issuing a read 
•To be as low friction as necessary, a real-time compression library is preferred. 
•To achieve maximal performance and benefit, you must enable LZO. 
•What about parallelism?
compression and Hadoop 
•Storing compressed data in HDFS allows your hardware allocation to go further since compressed data is often 25% of the size of the original data. 
•Furthermore, since MapReduce jobs are nearly always IO- bound, storing compressed data means there is less overall IO to do, meaning jobs run faster. 
•There are two caveats to this, however: 
–some compression formats cannot be split for parallel processing, and 
–others are slow enough at decompression that jobs become CPU-bound, eliminating your gains on IO.
gzipcompression on Hadoop 
•The gzipcompression format illustrates the first caveat, and to understand why we need to go back to how Hadoop’sinput splits work. 
•Imagine you have a 1.1 GB gzipfile, and your cluster has a 128 MB block size. 
•This file will be split into 9 chunks of size approximately 128 MB. 
•In order to process these in parallel in a MapReduce job, a different mapperwill be responsible for each chunk. 
•But this means that the second mapperwill start on an arbitrary byte about 128MB into the file. 
•The contextfuldictionary that gzipuses to decompress input will be empty at this point, which means the gzipdecompressorwill not be able to correctly interpret the bytes. 
•The upshot is that large gzipfiles in Hadoop need to be processed by a single mapper, which defeats the purpose of parallelism.
Bzip2 compression on Hadoop 
•For an example of the second caveat in which jobs become CPU-bound, we can look to the bzip2 compression format. 
•Bzip2 files compress well and are even splittable, but the decompression algorithm is slow and cannot keep up with the streaming disk reads that are common in Hadoop jobs. 
•While Bzip2 compression has some upside because it conserves storage space, running jobs now spend their time waiting on the CPU to finish decompressing data. 
•Which slows them down and offsets the other gains.
LZO and ElephantBird 
•How can we split large compressed data and run them in parallel on Hadoop? 
•One of the biggest drawbacks from compression algorithms like Gzipis that you can’t split them into multiple mappers. 
•This is where LZO comes in 
•Using LZO compression in Hadoop allows for 
–reduced data size and 
–shorter disk read times 
•LZO’s block-based structure allows it to be split into chunks for parallel processing in Hadoop. 
•Taken together, these characteristics make LZO an excellent compression format to use in your cluster. 
•Elephant Bird is Twitter's open source library ofLZO,Thrift, and/orProtocol Buffer-relatedHadoopInputFormats, OutputFormats, Writables,PigLoadFuncs,HiveSerDe,HBasemiscellanea, etc. 
•More: 
•https://github.com/twitter/hadoop-lzo 
•https://github.com/kevinweil/elephant-bird 
•http://code.google.com/p/protobuf/(IDL)
End of session 
Day –1: Data Management

More Related Content

What's hot (20)

PPTX
Scheduling in Cloud Computing
Hitesh Mohapatra
 
PPTX
Multimedia system, Architecture & Databases
Harshita Ved
 
PPT
Hadoop Technology
Atul Kushwaha
 
PPTX
Sih ppt
AbhishekPandey935
 
DOCX
Mobile computing Assignment
Self employed
 
PDF
Dynamic Adaptive Streaming over HTTP: From Content Creation to Consumption
Alpen-Adria-Universität
 
DOCX
VTU final year project report
athiathi3
 
PPTX
Mini Project PPT
Faiz Ahmad Khan
 
PPTX
Distributed shred memory architecture
Maulik Togadiya
 
PPTX
Screenless displays ppt
Jeevan Kumar D
 
PPTX
Task programming
Yogendra Tamang
 
PDF
Cloud computing Report
Virendra Ruhela
 
PDF
Agreement Protocols, distributed File Systems, Distributed Shared Memory
SHIKHA GAUTAM
 
PPTX
Publish subscribe model overview
Ishraq Al Fataftah
 
PPTX
Fog Screen technology
Gajula Vijay Kumar
 
PPTX
Minor Project Presentation 1
Pratishtha Ram
 
PPTX
IoT Levels and Deployment Templates
Prakash Honnur
 
PDF
Hand gesture recognition system(FYP REPORT)
Afnan Rehman
 
PDF
An Overview of Distributed Debugging
Anant Narayanan
 
PPTX
Java RMI
Prajakta Nimje
 
Scheduling in Cloud Computing
Hitesh Mohapatra
 
Multimedia system, Architecture & Databases
Harshita Ved
 
Hadoop Technology
Atul Kushwaha
 
Mobile computing Assignment
Self employed
 
Dynamic Adaptive Streaming over HTTP: From Content Creation to Consumption
Alpen-Adria-Universität
 
VTU final year project report
athiathi3
 
Mini Project PPT
Faiz Ahmad Khan
 
Distributed shred memory architecture
Maulik Togadiya
 
Screenless displays ppt
Jeevan Kumar D
 
Task programming
Yogendra Tamang
 
Cloud computing Report
Virendra Ruhela
 
Agreement Protocols, distributed File Systems, Distributed Shared Memory
SHIKHA GAUTAM
 
Publish subscribe model overview
Ishraq Al Fataftah
 
Fog Screen technology
Gajula Vijay Kumar
 
Minor Project Presentation 1
Pratishtha Ram
 
IoT Levels and Deployment Templates
Prakash Honnur
 
Hand gesture recognition system(FYP REPORT)
Afnan Rehman
 
An Overview of Distributed Debugging
Anant Narayanan
 
Java RMI
Prajakta Nimje
 

Viewers also liked (12)

PDF
Hadoop exercise
Subhas Kumar Ghosh
 
PDF
Matrix methods for Hadoop
David Gleich
 
PPTX
03 pig intro
Subhas Kumar Ghosh
 
PPTX
01 hbase
Subhas Kumar Ghosh
 
PPTX
Simplifying Use of Hive with the Hive Query Tool
DataWorks Summit
 
PDF
Hadoop map reduce v2
Subhas Kumar Ghosh
 
PDF
Resilient Distributed Datasets
Alessandro Menabò
 
PDF
Hadoop introduction
Subhas Kumar Ghosh
 
PDF
02 Hadoop deployment and configuration
Subhas Kumar Ghosh
 
PPTX
03 hive query language (hql)
Subhas Kumar Ghosh
 
PDF
Sparse matrix computations in MapReduce
David Gleich
 
PDF
Hadoop first mr job - inverted index construction
Subhas Kumar Ghosh
 
Hadoop exercise
Subhas Kumar Ghosh
 
Matrix methods for Hadoop
David Gleich
 
03 pig intro
Subhas Kumar Ghosh
 
Simplifying Use of Hive with the Hive Query Tool
DataWorks Summit
 
Hadoop map reduce v2
Subhas Kumar Ghosh
 
Resilient Distributed Datasets
Alessandro Menabò
 
Hadoop introduction
Subhas Kumar Ghosh
 
02 Hadoop deployment and configuration
Subhas Kumar Ghosh
 
03 hive query language (hql)
Subhas Kumar Ghosh
 
Sparse matrix computations in MapReduce
David Gleich
 
Hadoop first mr job - inverted index construction
Subhas Kumar Ghosh
 
Ad

Similar to Hadoop data management (20)

PPTX
Cloud Computing - Cloud Technologies and Advancements
Sathishkumar Jaganathan
 
PPTX
Data Analytics presentation.pptx
SwarnaSLcse
 
PPTX
HADOOP.pptx
Bharathi567510
 
PPTX
Topic 9a-Hadoop Storage- HDFS.pptx
DanishMahmood23
 
PPTX
Big Data-Session, data engineering and scala
ssusera3b277
 
PPTX
Hadoop and HDFS
SatyaHadoop
 
PPTX
Hadoop File System.pptx
AakashBerlia1
 
PPT
Hadoop professional-software-development-course-in-mumbai
Unmesh Baile
 
PPT
Hadoop-professional-software-development-course-in-mumbai
Unmesh Baile
 
PPTX
Managing Big data with Hadoop
Nalini Mehta
 
PPTX
Unit-1 Introduction to Big Data.pptx
AnkitChauhan817826
 
PPTX
Introduction to hadoop and hdfs
shrey mehrotra
 
PDF
getFamiliarWithHadoop
AmirReza Mohammadi
 
PPTX
Module 2 C2_HadoopEcosystemComponents.pptx
Shrinivasa6
 
PDF
Hadoop architecture-tutorial
vinayiqbusiness
 
PPTX
module 2.pptx
ssuser6e8e41
 
PDF
Big data interview questions and answers
Kalyan Hadoop
 
PDF
Chapter2.pdf
WasyihunSema2
 
PDF
Hdfs architecture
Aisha Siddiqa
 
Cloud Computing - Cloud Technologies and Advancements
Sathishkumar Jaganathan
 
Data Analytics presentation.pptx
SwarnaSLcse
 
HADOOP.pptx
Bharathi567510
 
Topic 9a-Hadoop Storage- HDFS.pptx
DanishMahmood23
 
Big Data-Session, data engineering and scala
ssusera3b277
 
Hadoop and HDFS
SatyaHadoop
 
Hadoop File System.pptx
AakashBerlia1
 
Hadoop professional-software-development-course-in-mumbai
Unmesh Baile
 
Hadoop-professional-software-development-course-in-mumbai
Unmesh Baile
 
Managing Big data with Hadoop
Nalini Mehta
 
Unit-1 Introduction to Big Data.pptx
AnkitChauhan817826
 
Introduction to hadoop and hdfs
shrey mehrotra
 
getFamiliarWithHadoop
AmirReza Mohammadi
 
Module 2 C2_HadoopEcosystemComponents.pptx
Shrinivasa6
 
Hadoop architecture-tutorial
vinayiqbusiness
 
module 2.pptx
ssuser6e8e41
 
Big data interview questions and answers
Kalyan Hadoop
 
Chapter2.pdf
WasyihunSema2
 
Hdfs architecture
Aisha Siddiqa
 
Ad

More from Subhas Kumar Ghosh (19)

PPTX
07 logistic regression and stochastic gradient descent
Subhas Kumar Ghosh
 
PPTX
06 how to write a map reduce version of k-means clustering
Subhas Kumar Ghosh
 
PPTX
05 k-means clustering
Subhas Kumar Ghosh
 
PPTX
02 data warehouse applications with hive
Subhas Kumar Ghosh
 
PPTX
06 pig etl features
Subhas Kumar Ghosh
 
PPTX
05 pig user defined functions (udfs)
Subhas Kumar Ghosh
 
PPTX
04 pig data operations
Subhas Kumar Ghosh
 
PPTX
02 naive bays classifier and sentiment analysis
Subhas Kumar Ghosh
 
PPTX
Hadoop performance optimization tips
Subhas Kumar Ghosh
 
PPTX
Hadoop Day 3
Subhas Kumar Ghosh
 
PPTX
Hadoop job chaining
Subhas Kumar Ghosh
 
PDF
Hadoop secondary sort and a custom comparator
Subhas Kumar Ghosh
 
PDF
Hadoop combiner and partitioner
Subhas Kumar Ghosh
 
PPTX
Hadoop deconstructing map reduce job step by step
Subhas Kumar Ghosh
 
PDF
Hadoop map reduce in operation
Subhas Kumar Ghosh
 
PDF
Hadoop map reduce concepts
Subhas Kumar Ghosh
 
PDF
Hadoop availability
Subhas Kumar Ghosh
 
PDF
Hadoop scheduler
Subhas Kumar Ghosh
 
PDF
Greedy embedding problem
Subhas Kumar Ghosh
 
07 logistic regression and stochastic gradient descent
Subhas Kumar Ghosh
 
06 how to write a map reduce version of k-means clustering
Subhas Kumar Ghosh
 
05 k-means clustering
Subhas Kumar Ghosh
 
02 data warehouse applications with hive
Subhas Kumar Ghosh
 
06 pig etl features
Subhas Kumar Ghosh
 
05 pig user defined functions (udfs)
Subhas Kumar Ghosh
 
04 pig data operations
Subhas Kumar Ghosh
 
02 naive bays classifier and sentiment analysis
Subhas Kumar Ghosh
 
Hadoop performance optimization tips
Subhas Kumar Ghosh
 
Hadoop Day 3
Subhas Kumar Ghosh
 
Hadoop job chaining
Subhas Kumar Ghosh
 
Hadoop secondary sort and a custom comparator
Subhas Kumar Ghosh
 
Hadoop combiner and partitioner
Subhas Kumar Ghosh
 
Hadoop deconstructing map reduce job step by step
Subhas Kumar Ghosh
 
Hadoop map reduce in operation
Subhas Kumar Ghosh
 
Hadoop map reduce concepts
Subhas Kumar Ghosh
 
Hadoop availability
Subhas Kumar Ghosh
 
Hadoop scheduler
Subhas Kumar Ghosh
 
Greedy embedding problem
Subhas Kumar Ghosh
 

Recently uploaded (20)

PPTX
This PowerPoint presentation titled "Data Visualization: Turning Data into In...
HemaDivyaKantamaneni
 
PDF
Incident Response and Digital Forensics Certificate
VICTOR MAESTRE RAMIREZ
 
PDF
How to Avoid 7 Costly Mainframe Migration Mistakes
JP Infra Pvt Ltd
 
PPTX
GLOBAL_Gender-module-5_committing-equity-responsive-budget.pptx
rashmisahu90
 
PPTX
Data Analysis for Business - make informed decisions, optimize performance, a...
Slidescope
 
PPTX
apidays Munich 2025 - Federated API Management and Governance, Vince Baker (D...
apidays
 
PDF
The X-Press God-WPS Office.pdf hdhdhdhdhd
ramifatoh4
 
PPTX
GEN CHEM ACCURACY AND PRECISION eme.pptx
yeagere932
 
PDF
MusicVideoProjectRubric Animation production music video.pdf
ALBERTIANCASUGA
 
PPTX
Rocket-Launched-PowerPoint-Template.pptx
Arden31
 
PPTX
Learning Tendency Analysis of Scratch Programming Course(Entry Class) for Upp...
ryouta039
 
PDF
Introduction to Data Science_Washington_
StarToon1
 
DOCX
Discover the Key Benefits of Implementing Data Mesh Architecture.docx
ajaykumar405166
 
PPTX
Spark with anjbnn hfkkjn hbkjbu h jhbk.pptx
nreddyjanga
 
PPT
Data base management system Transactions.ppt
gandhamcharan2006
 
PPTX
UPS Case Study - Group 5 with example and implementation .pptx
yasserabdelwahab6
 
PDF
Basotho Satisfaction with Electricity(Statspack)
KatlehoMefane
 
PPT
01 presentation finyyyal معهد معايره.ppt
eltohamym057
 
PDF
apidays Munich 2025 - Let’s build, debug and test a magic MCP server in Postm...
apidays
 
PDF
Responsibilities of a Certified Data Engineer | IABAC
Seenivasan
 
This PowerPoint presentation titled "Data Visualization: Turning Data into In...
HemaDivyaKantamaneni
 
Incident Response and Digital Forensics Certificate
VICTOR MAESTRE RAMIREZ
 
How to Avoid 7 Costly Mainframe Migration Mistakes
JP Infra Pvt Ltd
 
GLOBAL_Gender-module-5_committing-equity-responsive-budget.pptx
rashmisahu90
 
Data Analysis for Business - make informed decisions, optimize performance, a...
Slidescope
 
apidays Munich 2025 - Federated API Management and Governance, Vince Baker (D...
apidays
 
The X-Press God-WPS Office.pdf hdhdhdhdhd
ramifatoh4
 
GEN CHEM ACCURACY AND PRECISION eme.pptx
yeagere932
 
MusicVideoProjectRubric Animation production music video.pdf
ALBERTIANCASUGA
 
Rocket-Launched-PowerPoint-Template.pptx
Arden31
 
Learning Tendency Analysis of Scratch Programming Course(Entry Class) for Upp...
ryouta039
 
Introduction to Data Science_Washington_
StarToon1
 
Discover the Key Benefits of Implementing Data Mesh Architecture.docx
ajaykumar405166
 
Spark with anjbnn hfkkjn hbkjbu h jhbk.pptx
nreddyjanga
 
Data base management system Transactions.ppt
gandhamcharan2006
 
UPS Case Study - Group 5 with example and implementation .pptx
yasserabdelwahab6
 
Basotho Satisfaction with Electricity(Statspack)
KatlehoMefane
 
01 presentation finyyyal معهد معايره.ppt
eltohamym057
 
apidays Munich 2025 - Let’s build, debug and test a magic MCP server in Postm...
apidays
 
Responsibilities of a Certified Data Engineer | IABAC
Seenivasan
 

Hadoop data management

  • 2. Scale-up •To understand the popularity of distributed systems (scale-out) vis-à-vis huge monolithic servers (scale-up), consider the price performance of current I/O technology. •A high-end machine with four I/O channels each having a throughput of 100 MB/sec will require three hours to read a 4 TB data set! •With Hadoop, this same data set will be divided into smaller (typically 64 MB) blocks that are spread among many machines in the cluster via the HadoopDistributed File System (HDFS ). •With a modest degree of replication, the cluster machines can read the data set in parallel and provide a much higher throughput. •And such a cluster of commodity machines turns out to be cheaper than one high-end server!
  • 3. Hadoopfocuses on moving code to data •The clients send only the MapReduce programs to be executed, and these programs are usually small (often in kilobytes). •More importantly, the move-code-to-data philosophy applies within the Hadoop cluster itself. •Data is broken up and distributed across the cluster, and as much as possible, computation on a piece of data takes place on the same machine where that piece of data resides. •The programs to run (“code”) are orders of magnitude smaller than the data and are easier to move around. •Also, it takes more time to move data across a network than to apply the computation to it.
  • 4. HDFS •HDFS is the file system component of Hadoop. •Interface to HDFS is patterned after the UNIX file system •Faithfulness to standards was sacrificed in favor of improved performance for the applications at hand •HDFS stores file system metadata and application data separately •“HDFS is a file-system designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware”1 1 “The Hadoop Distributed File System” by Konstantin Shvachko, HairongKuang, Sanjay Radia, and Robert Chansler(Proceedings of MSST2010, May 2010, http:// storageconference.org/2010/Papers/MSST/Shvachko.pdf)
  • 5. Key properties of HDFS •Very Large –“Very large” in this context means files that are hundreds of megabytes, gigabytes, or terabytes in size. –There are Hadoop clusters running today that store petabytes of data. •Streaming data –write-once, read-many-times pattern –the time to read the whole dataset is more important than the latency in reading the first record •Commodity hardware –HDFS is designed to carry on working without a noticeable interruption to the user in the face of such failure
  • 6. Not a good fit for •Low-latency data access –HDFS is optimized for delivering a high throughput of data, and this may be at the expense of latency. –Hbaseis currently a better choice for low-latency access. •Lots of small files –Since the namenode holds filesystemmetadata in memory, the limit to the number of files in a filesystemis governed by the amount of memory on the namenode. –As a rule of thumb, each file, directory, and block takes about 150 bytes. –While storing millions of files is feasible, billions is beyond the capability of current hardware. •Multiple writers, arbitrary file modifications –Files in HDFS may be written to by a single writer. Writes are always made at the end of the file. –There is no support for multiple writers, or for modifications at arbitrary offsets in the file.
  • 7. Namenode and Datanode Master/slave architecture HDFS cluster consists of a single Namenode, a master server that manages the file system namespace and regulates access to files by clients. There are a number of DataNodesusually one per node in a cluster. The DataNodesmanage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. A file is split into one or more blocks and set of blocks are stored in DataNodes. DataNodes: serves read, write requests, performs block creation, deletion, and replication upon instruction from Namenode.
  • 8. Web Interface •NameNodeand DataNodeeach run an internal web server in order to display basic information about the current status of the cluster. • •With the default configuration, the NameNodefront page is athttp://namenode-name:50070/. •It lists the DataNodesin the cluster and basic statistics of the cluster. •The web interface can also be used to browse the file system (using "Browse the file system" link on the NameNodefront page).
  • 9. HDFS architecture Namenode B replication Rack1 Rack2 Client Blocks Datanodes Datanodes Client Write Read Metadata ops Metadata(Name, replicas..) (/home/foo/data,6. .. Block ops
  • 10. Namenode Keeps image of entire file system namespace and file Blockmapin memory. 4GB of local RAM is sufficient to support the above data structures that represent the huge number of files and directories. When the Namenode starts up it gets the FsImageand Editlogfrom its local file system, update FsImagewith EditLoginformation and then stores a copy of the FsImageon the filesytstemas a checkpoint. Periodic checkpointingis done. So that the system can recover back to the last checkpointedstate in case of a crash.
  • 11. Datanode A Datanodestores data in files in its local file system. Datanodehas no knowledge about HDFS filesystem It stores each block of HDFS data in a separate file. Datanodedoes not create all files in the same directory. It uses heuristics to determine optimal number of files per directory and creates directories appropriately When the filesystemstarts up it generates a list of all HDFS blocks and send this report to Namenode: Blockreport.
  • 12. HDFS Application Local file system Master node Name Nodes HDFS Client HDFS Server Block size: 2K Block size: 128M Replicated
  • 14. HDFS: Modules •Protocol: The protocol package is used in communication between the client and the namenode and datanode. It describes the messages used between these servers. •Security: security is used in authenticating access to the files. The security is based on token-based authentication, where the namenode server controls the distribution of access tokens. •server.protocol: server.protocoldefines the communication between namenode and datanode, and between namenode and balancer. •server.common: server.commoncontains utilities that are used by the namenode, datanodeand balancer. Examples are classes containing server-wide constants, utilities, and other logic that is shared among the servers. •Client: The client contains the logic to access the file system from a user’s computer. It interfaces with the datanodeand namenode servers using the protocol module. In the diagram this module spans two layers. This is because the client module also contains some logic that is shared system wide. •Datanode: The datanodeis responsible for storing the actual blocks of filesystemdata. It receives instructions on which blocks to store from the namenode. It also services the client directly to stream file block contents. •Namenode: The namenode is responsible for authorizing the user, storing a mapping from filenames to data blocks, and it knows which blocks of data are stored where. •Balancer: The balancer is a separate server that tells the namenode to move data blocks between datanodeswhen the load is not evenly balanced among datanodes. •Tools: The tools package can be used to administer the filesystem, and also contains debugging code.
  • 15. File system •Hierarchical file system with directories and files •Create, remove, move, rename etc. •Namenode maintains the file system •Any meta information changes to the file system recorded by the Namenode. •An application can specify the number of replicas of the file needed: replication factor of the file. •This information is stored in the Namenode.
  • 16. Metadata •The HDFS namespace is stored by Namenode. •Namenode uses a transaction log called the EditLogto record every change that occurs to the filesystemmeta data. –For example, creating a new file. –Change replication factor of a file –EditLogis stored in the Namenode’slocal filesystem •Entire filesystemnamespace including mapping of blocks to files and file system properties is stored in a file FsImage. •Stored in Namenode’slocal filesystem.
  • 17. Applicationcode<-> Client •HDFS provides aJava APIfor applications to use. •Fundamentally, the application uses the standard java.io interface. •A C language wrapper for this Java API is also available. •The client and the application code are bound into the same address space.
  • 19. Java Interface •One of the simplest ways to read a file from a Hadoop filesystemis by using a java.net.URLobject to open a stream to read the data from. •The general idiom is: InputStreamin = null; try { in = new URL("hdfs://host/path").openStream(); // process in } finally { IOUtils.closeStream(in); } •There’s a little bit more work required to make Java recognize Hadoop’shdfsURL scheme. •This is achieved by calling the setURLStreamHandlerFactorymethod on URL with an instance of FsUrlStreamHandlerFactory.
  • 20. Example : Displaying files from a Hadoop filesystemon standard output public class URLCat{ static { URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory()); } public static void main(String[] args) throws Exception { InputStreamin = null; try { in = new URL(args[0]).openStream(); IOUtils.copyBytes(in, System.out, 4096, false); } finally { IOUtils.closeStream(in); } } }
  • 21. Reading Data Using the FileSystemAPI •A file in a Hadoop filesystemis represented by a Hadoop Path object (and not a java.io.Fileobject. •There are several static factory methods for getting a FileSysteminstance: –public static FileSystemget(Configuration conf) throws IOException –public static FileSystemget(URI uri, Configuration conf) throws IOException –public static FileSystemget(URI uri, Configuration conf, String user) throws IOException •A Configuration object encapsulates a client or server’s configuration, which is set using configuration files read from the classpath, such as conf/core- site.xml. •With a FileSysteminstance in hand, we invoke an open() method to get the input stream for a file: –public FSDataInputStreamopen(Path f) throws IOException –public abstract FSDataInputStreamopen(Path f, intbufferSize) throws IOException
  • 22. Example : Displaying files with FileSystemAPI public class FileSystemCat{ public static void main(String[] args) throws Exception { String uri= args[0]; Configuration conf = new Configuration(); FileSystemfs= FileSystem.get(URI.create(uri), conf); InputStreamin = null; try { in = fs.open(new Path(uri)); IOUtils.copyBytes(in, System.out, 4096, false); } finally { IOUtils.closeStream(in); } } }
  • 23. FSDataInputStream •The open() method on FileSystemactually returns a FSDataInputStreamrather than a standard java.io class. •This class is a specialization of java.io.DataInputStreamwith support for random access, so you can read from any part of the stream. package org.apache.hadoop.fs; public class FSDataInputStreamextends DataInputStream implements Seekable, PositionedReadable{ // implementation } public interface Seekable{ void seek(long pos) throws IOException; long getPos() throws IOException; } public interface PositionedReadable{ public intread(long position, byte[] buffer, intoffset, intlength) throws IOException; public void readFully(long position, byte[] buffer, intoffset, intlength) throws IOException; public void readFully(long position, byte[] buffer) throws IOException; }
  • 24. FSDataOutputStream public FSDataOutputStreamcreate(Path f) throws IOException package org.apache.hadoop.util; public interface Progressable{ public void progress(); } public FSDataOutputStreamappend(Path f) throws IOException
  • 25. Example: Copying a local file to a Hadoop filesystem public class FileCopyWithProgress{ public static void main(String[] args) throws Exception { String localSrc= args[0]; String dst= args[1]; InputStreamin = new BufferedInputStream(new FileInputStream(localSrc)); Configuration conf = new Configuration(); FileSystemfs= FileSystem.get(URI.create(dst), conf); OutputStreamout = fs.create(new Path(dst), new Progressable() { public void progress() { System.out.print("."); } }); IOUtils.copyBytes(in, out, 4096, true); } }
  • 26. File-Based Data Structures •For some applications, you need a specialized data structure to hold your data. •For doing MapReduce-based processing, putting each blob of binary data into its own file doesn’t scale, so Hadoop developed a number of higher- level containers for these situations. •Imagine a logfile, where each log record is a new line of text. •If you want to log binary types, plain text isn’t a suitable format. •Hadoop’sSequenceFileclass fits the bill in this situation, providing a persistent data structure for binary key-value pairs.
  • 27. SequenceFile •SequenceFileis a flat file consisting of binary key/value pairs. •It is extensively used inMapReduceas input/output formats. •Internally, the temporary outputs of maps are stored using SequenceFile. •The SequenceFileprovides a Writer, Reader and Sorter classes for writing, reading and sorting respectively. •There are 3 different SequenceFileformats: –Uncompressed key/value records. –Record compressed key/value records -only 'values' are compressed here. –Block compressed key/value records -both keys and values are collected in 'blocks' separately and compressed. The size of the 'block' is configurable. •TheSequenceFile.Readeracts as a bridge and can read any of the above SequenceFileformats.
  • 28. Using SequenceFile •To use it as a logfileformat, you would choose a key, such as timestamp represented by a LongWritable, and the value is a Writable that represents the quantity being logged. •To create a SequenceFile, use one of its createWriter() static methods, which returns a SequenceFile.Writerinstance. •Once you have a SequenceFile.Writer, you then write key-value pairs, using the append() method. •Then when you’ve finished, you call the close() method. •Reading sequence files from beginning to end is a matter of creating an instance of SequenceFile.Readerand iterating over records by repeatedly invoking one of the next() methods.
  • 29. Internals of A sequence file •A sequence file consists of a header followed by one or more records •The header contains other fields including the names of the key and value classes, compression details, user defined metadata, and the sync marker. •A MapFileis a sorted SequenceFilewith an index to permit lookups by key.
  • 30. Compression •Hadoop allows users to compress output data, intermediate data, or both. •Hadoop checks whether input data is in a compressed format and decompresses the data as needed. •Compression codec: –two lossless codecs. –The default codec is gzip, a combination of the Lempel-Ziv 1977 (LZ77) algorithm and Huffmanencoding. –The other codec implements the Lempel-ZivOberhumer(LZO) algorithm, a variant of LZ77 optimized for decompression speed. •Compression unit: –Hadoop allows both per-record and per-block compression. –Thus, the record or block size affects the compressibility of the data.
  • 31. When to use compression? •Compression adds a read-time-penalty, why would one enable any compression? •There are a few reasons why the advantages of compression can outweigh the disadvantages: –Compression reduces the number of bytes written to/read from HDFS –Compression effectively improves the efficiency of network bandwidth and disk space –Compression reduces the size of data needed to be read when issuing a read •To be as low friction as necessary, a real-time compression library is preferred. •To achieve maximal performance and benefit, you must enable LZO. •What about parallelism?
  • 32. compression and Hadoop •Storing compressed data in HDFS allows your hardware allocation to go further since compressed data is often 25% of the size of the original data. •Furthermore, since MapReduce jobs are nearly always IO- bound, storing compressed data means there is less overall IO to do, meaning jobs run faster. •There are two caveats to this, however: –some compression formats cannot be split for parallel processing, and –others are slow enough at decompression that jobs become CPU-bound, eliminating your gains on IO.
  • 33. gzipcompression on Hadoop •The gzipcompression format illustrates the first caveat, and to understand why we need to go back to how Hadoop’sinput splits work. •Imagine you have a 1.1 GB gzipfile, and your cluster has a 128 MB block size. •This file will be split into 9 chunks of size approximately 128 MB. •In order to process these in parallel in a MapReduce job, a different mapperwill be responsible for each chunk. •But this means that the second mapperwill start on an arbitrary byte about 128MB into the file. •The contextfuldictionary that gzipuses to decompress input will be empty at this point, which means the gzipdecompressorwill not be able to correctly interpret the bytes. •The upshot is that large gzipfiles in Hadoop need to be processed by a single mapper, which defeats the purpose of parallelism.
  • 34. Bzip2 compression on Hadoop •For an example of the second caveat in which jobs become CPU-bound, we can look to the bzip2 compression format. •Bzip2 files compress well and are even splittable, but the decompression algorithm is slow and cannot keep up with the streaming disk reads that are common in Hadoop jobs. •While Bzip2 compression has some upside because it conserves storage space, running jobs now spend their time waiting on the CPU to finish decompressing data. •Which slows them down and offsets the other gains.
  • 35. LZO and ElephantBird •How can we split large compressed data and run them in parallel on Hadoop? •One of the biggest drawbacks from compression algorithms like Gzipis that you can’t split them into multiple mappers. •This is where LZO comes in •Using LZO compression in Hadoop allows for –reduced data size and –shorter disk read times •LZO’s block-based structure allows it to be split into chunks for parallel processing in Hadoop. •Taken together, these characteristics make LZO an excellent compression format to use in your cluster. •Elephant Bird is Twitter's open source library ofLZO,Thrift, and/orProtocol Buffer-relatedHadoopInputFormats, OutputFormats, Writables,PigLoadFuncs,HiveSerDe,HBasemiscellanea, etc. •More: •https://github.com/twitter/hadoop-lzo •https://github.com/kevinweil/elephant-bird •http://code.google.com/p/protobuf/(IDL)
  • 36. End of session Day –1: Data Management