0% found this document useful (0 votes)

33 views25 pages

Unit-2 Hadoop HDFS Hadoopecosystem

Hadoop is an open source framework used to store and process large volumes of data in a distributed manner. It consists of HDFS for storage and MapReduce for processing. HDFS stores data across clusters of commodity hardware in a fault-tolerant way using replication. MapReduce allows parallel processing of large datasets using a map and reduce paradigm.

Uploaded by

sisodiyaa853

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views25 pages

Unit-2 Hadoop HDFS Hadoopecosystem

Uploaded by

sisodiyaa853

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 25

Unit-2

What is Hadoop

Hadoop is an open source framework from Apache and is used to store process and analyze
data which are very huge in volume. Hadoop is written in Java and is not OLAP (online
analytical processing). It is used for batch/offline processing. It is being used by Facebook,
Yahoo, Google, Twitter, LinkedIn and many more. Moreover it can be scaled up just by
adding nodes in the cluster.

Modules of Hadoop

1. HDFS: Hadoop Distributed File System. Google published its paper GFS and on the
basis of that HDFS was developed. It states that the files will be broken into blocks
and stored in nodes over the distributed architecture.
2. Yarn: Yet another Resource Negotiator is used for job scheduling and manages the
cluster.
3. Map Reduce: This is a framework which helps Java programs to do the parallel
computation on data using key value pair. The Map task takes input data and converts
it into a data set which can be computed in Key value pair. The output of Map task is
consumed by reduce task and then the out of reducer gives the desired result.
4. Hadoop Common: These Java libraries are used to start Hadoop and are used by
other Hadoop modules.

Hadoop Architecture

At its core, Hadoop has two major layers namely −

 Processing/Computation layer (MapReduce), and

 Storage layer (Hadoop Distributed File System).

1
MapReduce

MapReduce is a parallel programming model for writing distributed applications devised at

Google for efficient processing of large amounts of data (multi-terabyte data-sets), on large
clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.
The MapReduce program runs on Hadoop which is an Apache open-source framework.
Hadoop Distributed File System

The Hadoop Distributed File System (HDFS) is based on the Google File System (GFS) and
provides a distributed file system that is designed to run on commodity hardware. It has
many similarities with existing distributed file systems. However, the differences from other
distributed file systems are significant. It is highly fault-tolerant and is designed to be
deployed on low-cost hardware. It provides high throughput access to application data and is
suitable for applications having large datasets.
Apart from the above-mentioned two core components, Hadoop framework also includes the
following two modules −
 Hadoop Common − These are Java libraries and utilities required by other Hadoop
modules.
 Hadoop YARN − This is a framework for job scheduling and cluster resource
management.

2
How Does Hadoop Work?

It is quite expensive to build bigger servers with heavy configurations that handle large scale
processing, but as an alternative, you can tie together many commodity computers with
single- CPU, as a single functional distributed system and practically, the clustered machines
can read the dataset in parallel and provide a much higher throughput. Moreover, it is
cheaper than one high-end server. So this is the first motivational factor behind using
Hadoop that it runs across clustered and low-cost machines.
Hadoop runs code across a cluster of computers. This process includes the following core
tasks that Hadoop performs −
 Data is initially divided into directories and files. Files are divided into uniform sized
blocks of 128M and 64M (preferably 128M).
 These files are then distributed across various cluster nodes for further processing.
 HDFS, being on top of the local file system, supervises the processing.
 Blocks are replicated for handling hardware failure.
 Checking that the code was executed successfully.
 Performing the sort that takes place between the map and reduce stages.
 Sending the sorted data to a certain computer.
 Writing the debugging logs for each job.
Advantages of Hadoop

 Hadoop framework allows the user to quickly write and test distributed systems. It is
efficient, and it automatic distributes the data and work across the machines and in
turn, utilizes the underlying parallelism of the CPU cores.
 Hadoop does not rely on hardware to provide fault-tolerance and high availability
(FTHA), rather Hadoop library itself has been designed to detect and handle failures
at the application layer.
 Servers can be added or removed from the cluster dynamically and Hadoop continues
to operate without interruption.
 Another big advantage of Hadoop is that apart from being open source, it is
compatible on all the platforms since it is Java based.

Components of Hadoop
Core components of Hadoop There are two major components of the Hadoop framework and
both of them does two of the important task for it.
1. Hadoop MapReduce is the method to split a larger data problem into smaller chunk and
distribute it to many different commodity servers. Each server have their own set of
resources
3
and they have processed them locally. Once the commodity server has processed the data
they send it back collectively to main server. This is effectively a process where we process
large data effectively and efficiently
2. Hadoop Distributed File System (HDFS) is a virtual file system. There is a big
difference between any other file system and Hadoop. When we move a file on HDFS, it is
automatically split into many small pieces. These small chunks of the file are replicated and
stored on other servers (usually 3) for the fault tolerance or high availability.
 Namenode: Namenode is the heart of the hadoop system. The namenode manages
the file system namespace. It stores the metadata information of the data blocks. This
metadata is stored permanently on to local disk in the form of namespace image and
edit log file. The namenode also knows the location of the data blocks on the data
node. However the namenode does not store this information persistently. The
namenode creates the block to datanodemapping when it is restarted. If the
NameNode crashes, then the entire Hadoop system goes down

 Secondary Namenode: The responsibility of secondary name node is to periodically

copy and merge the namespace image and edit log. If the name node crashes, then the
namespace image stored in secondary NameNode can be used to restart the
NameNode.
 DataNode: It stores the blocks of data and retrieves them. The DataNodes also
reports the blocks information to the NameNode periodically.

4
 Job Tracker: Job Tracker responsibility is to schedule the client’s jobs. Job tracker
creates map and reduce tasks and schedules them to run on the DataNodes (task
trackers). Job Tracker also checks for any failed tasks and reschedules the failed tasks
on another DataNode. Job tracker can be run on the NameNode or a separate node.
 Task Tracker: Task tracker runs on the DataNodes. Task trackers responsibility is to
run the map or reduce tasks assigned by the NameNode and to report the status of the
tasks to the NameNode.
3. Common: Common utilities for the other Hadoop modules
4. Hadoop Yarn: A framework for job scheduling and cluster resource management

Hadoop Distributed Hadoop File System

HDFS was developed using distributed file system design. It is run on commodity hardware.
Unlike other distributed systems, HDFS is highly fault tolerant and designed using low-cost
hardware.
HDFS holds very large amount of data and provides easier access. To store such huge data,
the files are stored across multiple machines. These files are stored in redundant fashion to
rescue the system from possible data losses in case of failure. HDFS also makes applications
available to parallel processing.
Features of HDFS

 It is suitable for the distributed storage and processing.

 Hadoop provides a command interface to interact with HDFS.

5
 The built-in servers of namenode and datanode help users to easily check the status of
cluster.
 Streaming access to file system data.
 HDFS provides file permissions and authentication.
HDFS Architecture

Given below is the architecture of a Hadoop File System.

HDFS follows the master-slave architecture and it has the following elements.

Namenode
The name node is the commodity hardware that contains the GNU/Linux operating system
and the name node software. It is software that can be run on commodity hardware. The
system having the name node acts as the master server and it does the following tasks −
 Manages the file system namespace.
 Regulates client’s access to files.
 It also executes file system operations such as renaming, closing, and opening files
and directories.
Datanode
The datanode is a commodity hardware having the GNU/Linux operating system and
datanode software. For every node (Commodity hardware/System) in a cluster, there will be
a datanode. These nodes manage the data storage of their system.
 Datanodes perform read-write operations on the file systems, as per client request.

6
 They also perform operations such as block creation, deletion, and replication
according to the instructions of the namenode.
Block
Generally the user data is stored in the files of HDFS. The file in a file system will be
divided into one or more segments and/or stored in individual data nodes. These file
segments are called as blocks. In other words, the minimum amount of data that HDFS can
read or write is called a Block. The default block size is 64MB, but it can be increased as per
the need to change in HDFS configuration.
Goals of HDFS

Fault detection and recovery − Since HDFS includes a large number of commodity
hardware, failure of components is frequent. Therefore HDFS should have mechanisms for
quick and automatic fault detection and recovery.
Huge datasets − HDFS should have hundreds of nodes per cluster to manage the
applications having huge datasets.
Hardware at data − A requested task can be done efficiently, when the computation takes
place near the data. Especially where huge datasets are involved, it reduces the network
traffic and increases the throughput.
History of Hadoop

The Hadoop was started by Doug Cutting and Mike Cafarella in 2002. Its origin was the
Google File System paper, published by Google.

Let's focus on the history of Hadoop in the following steps: -

o In 2002, Doug Cutting and Mike Cafarella started to work on a project, Apache
Nutch. It is an open source web crawler software project.
o While working on Apache Nutch, they were dealing with big data. To store that data
they have to spend a lot of costs which becomes the consequence of that project. This
problem becomes one of the important reason for the emergence of Hadoop.

7
o In 2003, Google introduced a file system known as GFS (Google file system). It is a
proprietary distributed file system developed to provide efficient access to data.
o In 2004, Google released a white paper on Map Reduce. This technique simplifies the
data processing on large clusters.
o In 2005, Doug Cutting and Mike Cafarella introduced a new file system known as
NDFS (Nutch Distributed File System). This file system also includes Map reduce.
o In 2006, Doug Cutting quit Google and joined Yahoo. On the basis of the Nutch
project, Dough Cutting introduces a new project Hadoop with a file system known as
HDFS (Hadoop Distributed File System). Hadoop first version 0.1.0 released in this
year.
o Doug Cutting gave named his project Hadoop after his son's toy elephant.
o In 2007, Yahoo runs two clusters of 1000 machines.
o In 2008, Hadoop became the fastest system to sort 1 terabyte of data on a 900 node
cluster within 209 seconds.
o In 2013, Hadoop 2.2 was released.
o In 2017, Hadoop 3.0 was released.

Data Formats for Hadoop

Below are some of the most common formats of the Hadoop ecosystem:

 Text/CSV
A plain text file or CSV is the most common format both outside and within the
Hadoop ecosystem. The great disadvantage in the use of this format is that it does not
support block compression, so the compression of a CSV file in Hadoop can have a
high cost in reading.

 SequenceFile
The SequenceFile format stores the data in binary format. This format accepts
compression; however, it does not store metadata and the only option in the evolution
of its scheme is to add new fields at the end. This is usually used to store intermediate
data in the input and output of MapReduce processes.

 Avro
Avro is a row-based storage format. This format includes in each file, the definition of
the scheme of your data in JSON format, improving interoperability and allowing the
evolution of the scheme. Avro also allows block compression in addition to its
divisibility, making it a good choice for most cases when using Hadoop.

8
 Parquet
Parquet is a column-based (column-based) binary storage format that can store nested
data structures. This format is very efficient in terms of disk input / output operations
when the necessary columns to be used are specified. This format is very optimized
for use with Cloudera Impala.

 RCFile (Record Columnar File)

RCFile is a columnar format that divides data into groups of rows, and inside it, data
is stored in columns. This format does not support the evaluation of the scheme and if
you want to add a new column it is necessary to rewrite the file, slowing down the
process.

 ORC (Optimized Row Columnar)

ORC is considered an evolution of the RCFile format and has all its benefits alongside
with some improvements such as better compression, allowing faster queries. This
format also does not support the evolution of the scheme.

Hadoop Ecosystem

Introduction: Hadoop Ecosystem is a platform or a suite which provides various services

to solve the big data problems. It includes Apache projects and various commercial tools
and solutions. There are four major elements of Hadoop i.e. HDFS, MapReduce, YARN,
and Hadoop Common. Most of the tools or solutions are used to supplement or support
these major elements. All these tools work collectively to provide services such as
absorption, analysis, storage and maintenance of data etc.
Following are the components that collectively form a Hadoop ecosystem:

 HDFS: Hadoop Distributed File System

 YARN: Yet Another Resource Negotiator
 MapReduce: Programming based Data Processing
 Spark: In-Memory data processing
 PIG, HIVE: Query based processing of data services
 HBase: NoSQL Database
 Mahout, Spark MLLib: Machine Learning algorithm libraries
 Solar, Lucene: Searching and Indexing
 Zookeeper: Managing cluster
 Oozie: Job Scheduling

9
Note: Apart from the above-mentioned components, there are many other components too
that are part of the Hadoop ecosystem.
All these toolkits or components revolve around one term i.e. Data. That’s the beauty of
Hadoop that it revolves around data and hence making its synthesis easier.

HDFS:

 HDFS is the primary or major component of Hadoop ecosystem and is

responsible for storing large data sets of structured or unstructured data across
various nodes and thereby maintaining the metadata in the form of log files.
 HDFS consists of two core components i.e.
1. Name node
2. Data Node
 Name Node is the prime node which contains metadata (data about data)
requiring comparatively fewer resources than the data nodes that stores the
actual data. These data nodes are commodity hardware in the distributed
environment. Undoubtedly, making Hadoop cost effective.
 HDFS maintains all the coordination between the clusters and hardware, thus
working at the heart of the system.

YARN:

 Yet Another Resource Negotiator, as the name implies, YARN is the one who
helps to manage the resources across the clusters. In short, it performs
scheduling and resource allocation for the Hadoop System.
 Consists of three major components i.e.

10
1. Resource Manager
2. Nodes Manager
3. Application Manager
 Resource manager has the privilege of allocating resources for the applications
in a system whereas Node managers work on the allocation of resources such as
CPU, memory, bandwidth per machine and later on acknowledges the resource
manager. Application manager works as an interface between the resource
manager and node manager and performs negotiations as per the requirement of
the two.

MapReduce:

 By making the use of distributed and parallel algorithms, MapReduce makes it

possible to carry over the processing’s logic and helps to write applications
which transform big data sets into a manageable one.
 MapReduce makes the use of two functions i.e. Map() and Reduce() whose task
is:
1. Map() performs sorting and filtering of data and thereby organizing
them in the form of group. Map generates a key-value pair based
result which is later on processed by the Reduce() method.
2. Reduce(), as the name suggests does the summarization by
aggregating the mapped data. In simple, Reduce() takes the output
generated by Map() as input and combines those tuples into smaller
set of tuples.

PIG:
Pig was basically developed by Yahoo which works on a pig Latin language, which is
Query based language similar to SQL.
 It is a platform for structuring the data flow, processing and analyzing huge data
sets.
 Pig does the work of executing commands and in the background, all the
activities of MapReduce are taken care of. After the processing, pig stores the
result in HDFS.
 Pig Latin language is specially designed for this framework which runs on Pig
Runtime. Just the way Java runs on the JVM.
 Pig helps to achieve ease of programming and optimization and hence is a major
segment of the Hadoop Ecosystem.

HIVE:

 With the help of SQL methodology and interface, HIVE performs reading and
writing of large data sets. However, its query language is called as HQL (Hive
Query Language).
 It is highly scalable as it allows real-time processing and batch processing both.
Also, all the SQL datatypes are supported by Hive thus, making the query
processing easier.
 Similar to the Query Processing frameworks, HIVE too comes with two
components: JDBC Drivers and HIVE Command Line.

11
 JDBC, along with ODBC drivers work on establishing the data storage
permissions and connection whereas HIVE Command line helps in the
processing of queries.

Mahout:

 Mahout, allows Machine Learnability to a system or application. Machine

Learning, as the name suggests helps the system to develop itself based on some
patterns, user/environmental interaction or on the basis of algorithms.
 It provides various libraries or functionalities such as collaborative filtering,
clustering, and classification which are nothing but concepts of Machine
learning. It allows invoking algorithms as per our need with the help of its own
libraries.

Apache Spark:

 It’s a platform that handles all the process consumptive tasks like batch
processing, interactive or iterative real-time processing, graph conversions, and
visualization, etc.
 It consumes in memory resources hence, thus being faster than the prior in terms
of optimization.
 Spark is best suited for real-time data whereas Hadoop is best suited for
structured data or batch processing, hence both are used in most of the
companies interchangeably.

Apache HBase:

 It’s a NoSQL database which supports all kinds of data and thus capable of
handling anything of Hadoop Database. It provides capabilities of Google’s
BigTable, thus able to work on Big Data sets effectively.
 At times where we need to search or retrieve the occurrences of something small
in a huge database, the request must be processed within a short quick span of
time. At such times, HBase comes handy as it gives us a tolerant way of storing
limited data

What is Hadoop Streaming?

Hadoop Streaming is defined as a utility which comes Hadoop distribution that is used to
execute program analysis of big data using programming languages such as Jave, Unix, Perl,
Python, Scala, etc. as this gives the user the liberty to create and run MapReduce jobs with
the scripts hence it’s used for real-time data ingestion which can be used in different real-time
apps(like watching stock portfolio, share market analysis, narrating weather report, etc.) It is
a Hadoop distribution with utility. Utility helps us to create and run specific MapReduce jobs
with an executable or the script as the mapper and/or reducer.
Understanding
There are java utilities provided by the Hadoop distribution which are called Hadoop
streaming. The utility is packaged in a JAR file. Using utility we can create and run
MapReduce jobs with an executable script. Moreover, we can create executable scripts to run
mapper and reducer functions. The executable scripts are passed to Hadoop streaming using a
command. After the scripts are passed to Hadoop streaming, the Hadoop streaming utility

12
creates a map

13
and reduce jobs and submit them to the cluster. These jobs can also be monitored with this
utility.
How does it Work?
The script specified for mapper and reducer works as below-
After the complete initialization of the mapper script, it will launch the instance of the script
with different process ids. The mapper task while running takes the input lines and passes it
to the standard input. At the same time, the outputs from the process’s standard output are
collected by the mapper. It converts each line into a key-value pair. The set of key-value pairs
is then collected as the output from the mapper. The key value pair is selected based on the
first tab character. The part of the line up to the initial tab is selected as key while the rest of
the line is selected as a valuable part. In case, the tab is not present in a line then the total line
is selected as key and there is no value part for the line. This can be adjusted according to
business needs. Purpose
It is used for real-time data ingestion which can be used in different real-time apps. There are
different real-time apps like watching stock portfolios, share market analysis, narrating
weather report, traffic alerts which are done using Hadoop streaming.
Working of Hadoop Streaming
Below is a simple example of how it works:
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper org.apache.hadoop.mapred.lib.IdentityMapper \
-reducer /bin/wc
The input command is used to provide the input directory while the output command is used
to provide the output directory. The mapper command is used to specify the executable
mapper class while the reducer command is used to specify the executable reducer class.
Advantages
Below are the advantages explained:

1. Availability
This doesn’t require any extra separate software to be installed and managed. There are other
tools like a pig, hive which can be installed I need to be managed separately.
2. Learning
It doesn’t require to learn new technologies. It can be leveraged with minimum Unix skills
for data analysis.
Popular Course in this category
3. Reduce Development Time
It requires to write mapper and reducer code while developing streaming applications in Unix
whereas doing the same work using Java MapReduce application is more complex and needs
to be compiled first, then test, then package, followed by exporting JAR file, and then run.
4. Faster Conversion
It takes very little time to convert data from one format to another using Hadoop streaming.
We can use it for converting data from text file to sequence file and then again from sequence
file to text file and many others. This can be achieved using input format and output format
options in Hadoop streaming.
5. Testing
Input and output data can be quickly tested by using it with Unix or Shell Script.
6. Requirement for Business

14
For simple business requirements like simple filtering operations and simple aggregation
operation, we can use this with Unix.
7. Performance
Using this we can get better performance while working with streaming data. There are also
several disadvantages of Hadoop streaming which are addressed by using other tools in the
Hadoop package like Kafka, flume, spark.

Why do we need Hadoop Streaming?

It helps in real-time data analysis which is much faster-using MapReduce programming
running on a multi-node cluster. There are different Technologies like spark Kafka and others
which helps in real time Hadoop streaming.
How this technology will help you in career growth?
Nowadays all major enterprises are moving to Hadoop for their data analysis and many of
them may require analysis of real-time data. The demand for use of real-time data and
processing of the same day by day and this technology is creating a lot of scope for individual
career growth. Conclusion
It offers a huge range of advantages for different real-time data processing using streaming
data.

MapReduce Architecture
MapReduce and HDFS are the two major components of Hadoop which makes it so
powerful and efficient to use. MapReduce is a programming model used for efficient
processing in parallel over large data-sets in a distributed manner. The data is first
split and then combined to produce the final result. The libraries for MapReduce is
written in so many programming languages with various different- different
optimizations. The purpose of MapReduce in Hadoop is to Map each of the jobs and
then it will reduce it to equivalent tasks for providing less overhead over the cluster
network and to reduce the processing power. The MapReduce task is mainly divided
into two phases Map Phase and Reduce Phase.
MapReduce Architecture:

15
Components of MapReduce Architecture:

1. Client: The MapReduce client is the one who brings the Job to the
MapReduce for processing. There can be multiple clients available that
continuously send jobs for processing to the Hadoop MapReduce Manager.
2. Job: The MapReduce Job is the actual work that the client wanted to do
which is comprised of so many smaller tasks that the client wants to process
or execute.
3. Hadoop MapReduce Master: It divides the particular job into subsequent
job-parts.
4. Job-Parts: The task or sub-jobs that are obtained after dividing the main
job. The result of all the job-parts combined to produce the final output.
5. Input Data: The data set that is fed to the MapReduce for processing.
6. Output Data: The final result is obtained after the processing.
In MapReduce, we have a client. The client will submit the job of a particular size to
the Hadoop MapReduce Master. Now, the MapReduce master will divide this job into
further equivalent job-parts. These job-parts are then made available for the Map and
Reduce Task. This Map and Reduce task will contain the program as per the
requirement of the use-case that the particular company is solving. The

16
developer writes their logic to fulfill the requirement that the industry requires. The
input data which we are using is then fed to the Map Task and the Map will generate
intermediate key-value pair as its output. The output of Map i.e. these key-value pairs
are then fed to the Reducer and the final output is stored on the HDFS. There can be n
number of Map and Reduce tasks made available for processing the data as per the
requirement. The algorithm for Map and Reduce is made with a very optimized way
such that the time complexity or space complexity is minimum.
Let’s discuss the MapReduce phases to get a better understanding of its architecture:
The MapReduce task is mainly divided into 2 phases i.e. Map phase and Reduce
phase.
1. Map: As the name suggests its main use is to map the input data in key-
value pairs. The input to the map may be a key-value pair where the key can
be the id of some kind of address and value is the actual value that it keeps.
The Map() function will be executed in its memory repository on each of
these input key-value pairs and generates the intermediate key- value pair
which works as input for the Reducer or Reduce() function.

2. Reduce: The intermediate key-value pairs that work as input for Reducer
are shuffled and sort and send to the Reduce() function. Reducer aggregate
or group the data based on its key-value pair as per the reducer algorithm
written by the developer.
How Job tracker and the task tracker deal with MapReduce:
1. Job Tracker: The work of Job tracker is to manage all the resources and all
the jobs across the cluster and also to schedule each map on the Task
Tracker running on the same data node since there can be hundreds of data
nodes available in the cluster.

2. Task Tracker: The Task Tracker can be considered as the actual slaves
that are working on the instruction given by the Job Tracker. This Task
Tracker is deployed on each of the nodes available in the cluster that
executes the Map and Reduce task as instructed by Job Tracker.
There is also one important component of MapReduce Architecture known as Job
History Server. The Job History Server is a daemon process that saves and stores
historical information about the task or application, like the logs which are generated
during or after the job execution are stored on Job History Server.

17
How Job runs on MapReduce
MapReduce can be used to work with a solitary method call: submit() on a Job
object (you can likewise call waitForCompletion(), which presents the activity on
the off chance that it hasn’t been submitted effectively, at that point sits tight for it
to finish).
Let’s understand the components –
1. Client : Submitting the MapReduce job.
2. Yarn node manager : In a cluster , it monitors and launches the
compute containers on machines.
3. Yarn resource manager : Handles the allocation of compute resources
coordination on the cluster.
4. MapReduce application master : Facilitates the tasks running the
MapReduce work.
5. Distributed Filesystem : Shares job files with other entities.

How to submit Job?

To create an internal JobSubmitter instance, use the submit() which further calls
submitJobInternal() on it. Having submitted the job, waitForCompletion() polls
the job’s progress after submitting the job once per second. If the reports have
changed since the last report, it further reports the progress to the console. The job
counters are displayed when the job completes successfully. Else the error (that
caused the job to fail) is logged to the console. Processes implemented by
JobSubmitter for submitting the Job :
 The resource manager askes for a new application ID that is used for
MapReduce Job ID.
 Output specification of the job is checked. For e.g. an error is thrown to the
MapReduce program or the job is not submitted or the output directory
already exists or it has not been specified.
 If the splits cannot be computed, it computes the input splits for the job.
This can be due to the job is not submitted and an error is thrown to the
18
MapReduce program.
 Resources needed to run the job is copied – it includes the job JAR file,
the computed input splits, to the shared filesystem in a directory named
after the job ID and the configuration file.
 It copies job JAR with a high replication factor, which is controlled
by mapreduce.client.submit.file.replication property. AS there are the
number of copies across the cluster for the node managers to access.
 By calling submitApplication(), submits the job on the resource
manager.

Hadoop File Formats, when and what to use?

Hadoop File Formats, when and what to use? Hadoop is gaining traction and on a higher adaption
curve to liberate the data from the clutches of the applications and native formats. This article
helps us look at the file formats supported by Hadoop ( read, HDFS) file system. A quick broad
categorizations of file formats would be

 Basic file formats are: Text format, Key-Value format, Sequence format
 Other formats which are used and are well known are: Avro, Parquet, RC or Row-
Columnar format, ORC or Optimized Row Columnar format

The need ..

A file format is just a way to define how information is stored in HDFS file system. This is
usually driven by the use case or the processing algorithms for specific domain, File format
should be well- defined and expressive. It should be able to handle variety of data structures
specifically structs, records, maps, arrays along with strings, numbers etc. File format should be
simple, binary and compressed.. When dealing with Hadoop’s filesystem not only do you have all
of these traditional storage formats available to you (like you can store PNG and JPG images on
HDFS if you like), but you also have some Hadoop-focused file formats to use for structured and
unstructured data. A huge bottleneck for HDFS-enabled applications like MapReduce and Spark
is the time it takes to find relevant data in a particular location and the time it takes to write the
data back to another location. These issues are exacerbated with the difficulties managing large
datasets, such as evolving schemas, or storage constraints. The various Hadoop file formats have
evolved as a way to ease these issues across a number of use cases. Choosing an appropriate file
format can have some significant benefits: 1. Faster read times 2. Faster write times 3. Splittable
files (so you don’t need to read the whole file, just a part of it) 4. Schema evolution support
(allowing you to change the fields in a dataset) 5. Advanced compression support (compress the
columnar files with a compression codec without sacrificing these features) Some file formats are
designed for general use (like MapReduce or Spark), others are designed for more specific use
cases (like powering a database), and some are designed with specific data characteristics in
mind. So there really is quite a lot of choice.

19
The generic classification of the characteristics are Expressive, Simple, Binary, Compressed,
Integrity to name few. Typically text-based, serial and columnar types…

Since Protocol buffers & thrift are serializable but not splittable they are not largely popular on
HDFS use cases and thus Avro becomes the first choice …

Text Input Format

Simple text-based files are common in the non-Hadoop world, and they’re super common in the
Hadoop world too. Data is laid out in lines, with each line being a record. Lines are terminated by
a newline character \n in the typical UNIX fashion. Text-files are inherently splittable (just split
on
\n characters!), but if you want to compress them you’ll have to use a file-level compression
codec that support splitting, such as BZIP2 Because these files are just text files you can encode
anything you like in a line of the file. One common example is to make each line a JSON
document to add some structure. While this can waste space with needless column headers, it is a
simple way to start using structured data in HDFS.

 Default, JSON, CSV formats are available

 Slow to read and write
 Can’t split compressed files (Leads to Huge maps)
 Need to read/decompress all fields.

An Input format for plain text files. Files are broken into lines. Either linefeed or carriage-return
are used to signal end of line. Keys are the position in the file, and values are the line of text.
Advantages: Light weight Disadvantages: Slow to read and write, Can’t split compressed files
(Leads to Huge maps)

Sequence File Input Format

Sequence files were originally designed for MapReduce, so the integration is smooth. They
encode a key and a value for each record and nothing more. Records are stored in a binary format
that is smaller than a text-based format would be. Like text files, the format does not encode the
structure of the keys and values, so if you make schema migrations they must be additive.
Typically if you need to store complex data in a sequence file you do so in the value part while
encoding the id in the key. The problem with this is that if you add or change fields in your
Writable class it will not be backwards compatible with the data stored in the sequence file. One
benefit of sequence files is that they support block-level compression, so you can compress the
contents of the file while also maintaining the ability to split the file into segments for multiple
map tasks.

 Traditional map reduce binary file format

 Stores Keys and Values as a class
 Not good for Hive ,Which has sql types
 Hive always stores entire line as a value
20
 Default block size is 1 MB
 Need to read and Decompress all the fields

21
In addition to text files, Hadoop also provides support for binary files. Out of these binary file
formats, Hadoop Sequence Files are one of the Hadoop specific file format that stores serialized
key/value pairs. Advantages: Compact compared to text files, Optional compression support.
Parallel processing. Container for huge number of small files. Disadvantages: Not good for Hive,
Append only like other data formats, Multi Language support not yet provided One key benefit of
sequence files is that they support block-level compression, so you can compress the contents of
the file while also maintaining the ability to split the file into segments for multiple map tasks.
Sequence files are well supported across Hadoop and many other HDFS enabled projects, and I
think represent the easiest next step away from text files.

RC (Row-Columnar) File Input Format

RCFILE stands of Record Columnar File which is another type of binary file format which offers
high compression rate on the top of the rows used when we want to perform operations on
multiple rows at a time. RCFILEs are flat files consisting of binary key/value pairs, which shares
much similarity with SEQUENCE FILE. RCFILE stores columns of a table in form of record in a
columnar manner. It first partitions rows horizontally into row splits and then it vertically
partitions each row split in a columnar way. RCFILE first stores the metadata of a row split, as
the key part of a record, and all the data of a row split as the value part. This means that RCFILE
encourages column oriented storage rather than row oriented storage. This column oriented
storage is very useful while performing analytics. It is easy to perform analytics when we “hive’ a
column oriented storage type. We cannot load data into RCFILE directly. First we need to load
data into another table and then we need to overwrite it into our newly created RCFILE.

 columns stored separately

 Read and decompressed only needed one.
 Better compression
 Columns stored as binary Blobs
 Depend on Meta store to supply Data types
 Large Blocks - 4MB default
 Still search file for split boundary

ORC (Optimized Row Columnar)Input Format

ORC stands for Optimized Row Columnar which means it can store data in an optimized way
than the other file formats. ORC reduces the size of the original data up to 75%. As a result the
speed of data processing also increases and shows better performance than Text, Sequence and
RC file formats. An ORC file contains rows data in groups called as Stripes along with a file
footer. ORC format improves the performance when Hive is processing the data. We cannot load
data into ORCFILE directly. First we need to load data into another table and then we need to
overwrite it into our newly created ORCFILE. ORC File Format Full Form is Optimized Row
Columnar File Format.ORC File format provides very efficient way to store relational data then
RC file,By using

22
ORC File format we can reduce the size of original data up to 75%.Comparing to
Text,Sequence,Rc file formats ORC is better

 Column stored separately

 Knows Types - Uses Types specific en-coders
 Stores statistics (Min,Max,Sum,Count)
 Has Light weight Index
 Skip over blocks of rows that that don’t matter
 Larger Blocks - 256 MB by default, Has an index for block boundaries

Using ORC files improves performance when Hive is reading, writing, and processing data
comparing to Text,Sequence and Rc. RC and ORC shows better performance than Text and
Sequence File formats. Comparing to RC and ORC File formats always ORC is better as ORC
takes less time to access the data comparing to RC File Format and ORC takes Less space space
to store data. However, the ORC file increases CPU overhead by increasing the time it takes to
decompress the relational data. ORC File format feature comes with the Hive 0.11 version and
cannot be used with previous versions.

AVRO Format

Apache Avro is a language-neutral data serialization system. It was developed by Doug Cutting,
the father of Hadoop. Since Hadoop writable classes lack language portability, Avro becomes
quite helpful, as it deals with data formats that can be processed by multiple languages. Avro is a
preferred tool to serialize data in Hadoop. Avro is an opinionated format which understands that
data stored in HDFS is usually not a simple key/value combo like int/string. The format encodes
the schema of its contents directly in the file which allows you to store complex objects natively.
Honestly, Avro is not really a file format, it’s a file format plus a serialization and de-serialization
framework with regular old sequence files you can store complex objects but you have to manage
the process. Avro handles this complexity whilst providing other tools to help manage data over
time and is a well thought out format which defines file data schemas in JSON (for
interoperability), allows for schema evolutions (remove a column, add a column), and multiple
serialization/deserialization use cases. It also supports block-level compression. For most
Hadoop- based use cases Avro becomes really good choice. Avro depends heavily on its schema.
It allows every data to be written with no prior knowledge of the schema. It serializes fast and the
resulting serialized data is lesser in size. Schema is stored along with the Avro data in a file for
any further processing. In RPC, the client and the server exchange schemas during the
connection. This exchange helps in the communication between same named fields, missing
fields, extra fields, etc. Avro schemas are defined with JSON that simplifies its implementation in
languages with JSON libraries. Like Avro, there are other serialization mechanisms in Hadoop
such as Sequence Files, Protocol Buffers, and Thrift.

Thrift & Protocol Buffers Vs. Avro

23
Thrift and Protocol Buffers are the most competent libraries with Avro. Avro differs from these
frameworks in the following ways –

 Avro supports both dynamic and static types as per the requirement. Protocol Buffers
and Thrift use Interface Definition Languages (IDLs) to specify schemas and their
types. These IDLs are used to generate code for serialization and deserialization.
 Avro is built in the Hadoop ecosystem. Thrift and Protocol Buffers are not built in
Hadoop ecosystem. Unlike Thrift and Protocol Buffer, Avro's schema definition is in
JSON and not in any proprietary IDL.

Parquet Format

The latest hotness in file formats for Hadoop is columnar file storage. Basically this means that
instead of just storing rows of data adjacent to one another you also store column values adjacent
to each other. So datasets are partitioned both horizontally and vertically. This is particularly
useful if your data processing framework just needs access to a subset of data that is stored on
disk as it can access all values of a single column very quickly without reading whole records.

 Design based on googles Dreamel paper

 Schema segregated into footer
 Column major format with stripes
 Simple type-model with logical types
 All data pushed to leaves of the tree
 Integrated compression and indexes

Parquet file format is also a columnar format. Instead of just storing rows of data adjacent to one
another you also store column values adjacent to each other. So datasets are partitioned both
horizontally and vertically. This is particularly useful if your data processing framework just
needs access to a subset of data that is stored on disk as it can access all values of a single column
very quickly without reading whole records. Just like ORC file, it’s great for compression with
great query performance especially efficient when querying data from specific columns. Parquet
format is computationally intensive on the write side, but it reduces a lot of I/O cost to make great
read performance. It enjoys more freedom than ORC file in schema evolution, that it can add new
columns to the end of the structure. If you’re chopping and cutting up datasets regularly then
these formats can be very beneficial to the speed of your application, but frankly if you have an
application that usually needs entire rows of data then the columnar formats may actually be a
detriment to performance due to the increased network activity required. One huge benefit of
columnar oriented file formats is that data in the same column tends to be compressed together
which can yield some massive storage optimizations (as data in the same column tends to be
similar). It supports both File-Level Compression and Block-Level Compression. File-level
compression means you compress entire files regardless of the file format, the same way you
would compress a file in Linux. Some of these formats are splittable (e.g. bzip2, or LZO if
indexed). Block-level compression is internal to the file format, so individual blocks of data
within the file
24
are compressed. This means that the file remains splittable even if you use a non-splittable
compression codec like Snappy. However, this is only an option if the specific file format
supports it. Summary Overall these format can drastically optimize workloads, especially for
Hive and Spark which tend to just read segments of records rather than the whole thing (which is
more common in MapReduce). Since Avro and Parquet have so much in common when choosing
a file format to use with HDFS, we need to consider read performance and write performance.
Because the nature of HDFS is to store data that is write once, read multiple times, we want to
emphasize on the read performance. The fundamental difference in terms of how to use either
format is this: Avro is a Row based format. If you want to retrieve the data as a whole, you can
use Avro. Parquet is a Column based format. If your data consists of lot of columns but you are
interested in a subset of columns, you can use Parquet. Hopefully by now you’ve learned a little
about what file formats actually are and why you would think of choosing a specific one. We’ve
discussed the main characteristics of common file formats and talked a little about compression.

Exam - ACA Big Data Certification
100% (3)
Exam - ACA Big Data Certification
13 pages
Course 1 - Big Data Ecosystem
No ratings yet
Course 1 - Big Data Ecosystem
663 pages
Big Data Analytics in Healthcare
100% (3)
Big Data Analytics in Healthcare
193 pages
Fbda Unit-3
No ratings yet
Fbda Unit-3
27 pages
shawn
No ratings yet
shawn
4 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Hadoop 1
No ratings yet
Hadoop 1
75 pages
bd sec b
No ratings yet
bd sec b
19 pages
Compusoft, 2 (11), 370-373 PDF
No ratings yet
Compusoft, 2 (11), 370-373 PDF
4 pages
Hadoop Introduction PDF
No ratings yet
Hadoop Introduction PDF
3 pages
Bda Unit 2
No ratings yet
Bda Unit 2
79 pages
02 Unit-II Hadoop Architecture and HDFS
No ratings yet
02 Unit-II Hadoop Architecture and HDFS
18 pages
Bda - Unit 2
No ratings yet
Bda - Unit 2
56 pages
2 Hadoop
No ratings yet
2 Hadoop
20 pages
Big data unit 2
No ratings yet
Big data unit 2
25 pages
BDA Lab Assignment 2
No ratings yet
BDA Lab Assignment 2
18 pages
Module II
No ratings yet
Module II
46 pages
BDA_UNIT-IV
No ratings yet
BDA_UNIT-IV
37 pages
Lecture-1 - 3 Hadoop - HDFS - Mapreduce (Self Study)
No ratings yet
Lecture-1 - 3 Hadoop - HDFS - Mapreduce (Self Study)
25 pages
Hadoop 10
No ratings yet
Hadoop 10
8 pages
UNIT 5 Combined
No ratings yet
UNIT 5 Combined
13 pages
BDA Lab Assignment 1 PDF
No ratings yet
BDA Lab Assignment 1 PDF
20 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
60 pages
Hadoop PDF
0% (1)
Hadoop PDF
4 pages
Wa0002.
No ratings yet
Wa0002.
32 pages
Unit-2 Introduction To Hadoop
No ratings yet
Unit-2 Introduction To Hadoop
19 pages
Hadoop Overview: Open Source Framework Processing Large Amounts of Heterogeneous Data Sets Distributed Fashion
No ratings yet
Hadoop Overview: Open Source Framework Processing Large Amounts of Heterogeneous Data Sets Distributed Fashion
62 pages
BDA UNIT-2dhhhhbv
No ratings yet
BDA UNIT-2dhhhhbv
23 pages
Unit 2
No ratings yet
Unit 2
21 pages
Hadoop Presentaton
No ratings yet
Hadoop Presentaton
47 pages
Assignment 10
No ratings yet
Assignment 10
5 pages
3
No ratings yet
3
20 pages
UNIT V-Cloud Computing
No ratings yet
UNIT V-Cloud Computing
33 pages
Unit - II
No ratings yet
Unit - II
64 pages
NYOUG Hadoop Presentaton
No ratings yet
NYOUG Hadoop Presentaton
47 pages
Big data Unit 4 own
No ratings yet
Big data Unit 4 own
18 pages
Hadoop
No ratings yet
Hadoop
7 pages
UNIT -2
No ratings yet
UNIT -2
27 pages
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
No ratings yet
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
47 pages
Hadoop Overview
100% (1)
Hadoop Overview
16 pages
UNIT 5-PLH
No ratings yet
UNIT 5-PLH
34 pages
UNIT - 2
No ratings yet
UNIT - 2
42 pages
Hadoop
No ratings yet
Hadoop
7 pages
UNIT-1-part-2-BIG DATA ANALYTICS AND TOOLS
No ratings yet
UNIT-1-part-2-BIG DATA ANALYTICS AND TOOLS
19 pages
Unit V Cloud Technologies and Advancements
No ratings yet
Unit V Cloud Technologies and Advancements
33 pages
BDA-UNIT-2 - 2023
No ratings yet
BDA-UNIT-2 - 2023
58 pages
BDA Manual
No ratings yet
BDA Manual
57 pages
Intro Hadoop Ecosystem Components, Hadoop Ecosystem Tools
No ratings yet
Intro Hadoop Ecosystem Components, Hadoop Ecosystem Tools
15 pages
Chapter_6 - Hadoop
No ratings yet
Chapter_6 - Hadoop
51 pages
UNIT-2
No ratings yet
UNIT-2
14 pages
U-3 Big Data
No ratings yet
U-3 Big Data
23 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
52 pages
Big Data Unit-2 PPT part1
No ratings yet
Big Data Unit-2 PPT part1
76 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
Unit-2-_Hadoop2_
No ratings yet
Unit-2-_Hadoop2_
30 pages
Unit Iv-1
No ratings yet
Unit Iv-1
84 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
8 pages
NguyenNgocMinhKhue 20211124
No ratings yet
NguyenNgocMinhKhue 20211124
5 pages
CC unit5
No ratings yet
CC unit5
27 pages
Unit-Iv CC&BD CS71
No ratings yet
Unit-Iv CC&BD CS71
148 pages
Unit 1 Haoop Architecture
No ratings yet
Unit 1 Haoop Architecture
26 pages
Hadoop, A Distributed Framework For Big Data
No ratings yet
Hadoop, A Distributed Framework For Big Data
55 pages
HDFS 79
No ratings yet
HDFS 79
74 pages
TIE- 21CS71 SIMP with Key Answers (1)
No ratings yet
TIE- 21CS71 SIMP with Key Answers (1)
19 pages
Cloud Security UNIT 5
No ratings yet
Cloud Security UNIT 5
4 pages
Bda Viva Questions
No ratings yet
Bda Viva Questions
8 pages
Thecodingshef: Unit 2 Big Data MCQ Aktu
No ratings yet
Thecodingshef: Unit 2 Big Data MCQ Aktu
10 pages
BigData and Hadoop - Syllabus
No ratings yet
BigData and Hadoop - Syllabus
2 pages
Hadoop IO - Notes
No ratings yet
Hadoop IO - Notes
22 pages
G.R.Anantha Raman - 1A Review On Big Data Analytics in The Field of Agriculture
No ratings yet
G.R.Anantha Raman - 1A Review On Big Data Analytics in The Field of Agriculture
16 pages
CCS334-BigData Analytics Lab Manual Final
No ratings yet
CCS334-BigData Analytics Lab Manual Final
45 pages
SAMATHA HADOOP
No ratings yet
SAMATHA HADOOP
6 pages
Apache Spark Interview Questions
No ratings yet
Apache Spark Interview Questions
12 pages
Explain in Detail About Hadoop Framework
No ratings yet
Explain in Detail About Hadoop Framework
4 pages
Unit 3 ETI (BDA)
No ratings yet
Unit 3 ETI (BDA)
34 pages
Mongodb Spark
No ratings yet
Mongodb Spark
13 pages
Exploiting Dynamic Resource Allocation For Efficient Parallel Data Processing in The Cloud
No ratings yet
Exploiting Dynamic Resource Allocation For Efficient Parallel Data Processing in The Cloud
14 pages
Project Preet PDF
No ratings yet
Project Preet PDF
59 pages
Questions Certif BigData
No ratings yet
Questions Certif BigData
12 pages
Cloud Computing Exam Answers 2024
No ratings yet
Cloud Computing Exam Answers 2024
15 pages
Mid - 2 Questions & Bits
No ratings yet
Mid - 2 Questions & Bits
5 pages
Sreeja Big Data Resume
No ratings yet
Sreeja Big Data Resume
6 pages
Design a systems which finds Top K(Heavy Hitters)
No ratings yet
Design a systems which finds Top K(Heavy Hitters)
8 pages
09 Programming Hadoop - Spark, R and Pig
No ratings yet
09 Programming Hadoop - Spark, R and Pig
80 pages
Discuss Mesos and Yarn and The Relative Placement of The Two Respectively
No ratings yet
Discuss Mesos and Yarn and The Relative Placement of The Two Respectively
6 pages
Shuffle and Sort
No ratings yet
Shuffle and Sort
4 pages
Vipul Sinha BigData-Hadoop Dev
100% (1)
Vipul Sinha BigData-Hadoop Dev
8 pages
Data Analytics Unit 2
No ratings yet
Data Analytics Unit 2
18 pages
Unit 1.1data Science Technology Stack
No ratings yet
Unit 1.1data Science Technology Stack
87 pages
Yarn Tuning Guide
No ratings yet
Yarn Tuning Guide
16 pages

Unit-2 Hadoop HDFS Hadoopecosystem

Uploaded by

Unit-2 Hadoop HDFS Hadoopecosystem

Uploaded by

Unit-2

At its core, Hadoop has two major layers namely −

 Processing/Computation layer (MapReduce), and

MapReduce is a parallel programming model for writing distributed applications devised at

 Secondary Namenode: The responsibility of secondary name node is to periodically

Hadoop Distributed Hadoop File System

 It is suitable for the distributed storage and processing.

Given below is the architecture of a Hadoop File System.

Let's focus on the history of Hadoop in the following steps: -

Data Formats for Hadoop

 RCFile (Record Columnar File)

 ORC (Optimized Row Columnar)

Introduction: Hadoop Ecosystem is a platform or a suite which provides various services

 HDFS: Hadoop Distributed File System

 HDFS is the primary or major component of Hadoop ecosystem and is

 By making the use of distributed and parallel algorithms, MapReduce makes it

 Mahout, allows Machine Learnability to a system or application. Machine

What is Hadoop Streaming?

Why do we need Hadoop Streaming?

How to submit Job?

Hadoop File Formats, when and what to use?

Text Input Format

 Default, JSON, CSV formats are available

Sequence File Input Format

 Traditional map reduce binary file format

RC (Row-Columnar) File Input Format

 columns stored separately

ORC (Optimized Row Columnar)Input Format

 Column stored separately

Thrift & Protocol Buffers Vs. Avro

 Design based on googles Dreamel paper

You might also like