Unit-2 Hadoop HDFS Hadoopecosystem
Unit-2 Hadoop HDFS Hadoopecosystem
What is Hadoop
Hadoop is an open source framework from Apache and is used to store process and analyze
data which are very huge in volume. Hadoop is written in Java and is not OLAP (online
analytical processing). It is used for batch/offline processing. It is being used by Facebook,
Yahoo, Google, Twitter, LinkedIn and many more. Moreover it can be scaled up just by
adding nodes in the cluster.
Modules of Hadoop
1. HDFS: Hadoop Distributed File System. Google published its paper GFS and on the
basis of that HDFS was developed. It states that the files will be broken into blocks
and stored in nodes over the distributed architecture.
2. Yarn: Yet another Resource Negotiator is used for job scheduling and manages the
cluster.
3. Map Reduce: This is a framework which helps Java programs to do the parallel
computation on data using key value pair. The Map task takes input data and converts
it into a data set which can be computed in Key value pair. The output of Map task is
consumed by reduce task and then the out of reducer gives the desired result.
4. Hadoop Common: These Java libraries are used to start Hadoop and are used by
other Hadoop modules.
Hadoop Architecture
1
MapReduce
The Hadoop Distributed File System (HDFS) is based on the Google File System (GFS) and
provides a distributed file system that is designed to run on commodity hardware. It has
many similarities with existing distributed file systems. However, the differences from other
distributed file systems are significant. It is highly fault-tolerant and is designed to be
deployed on low-cost hardware. It provides high throughput access to application data and is
suitable for applications having large datasets.
Apart from the above-mentioned two core components, Hadoop framework also includes the
following two modules −
Hadoop Common − These are Java libraries and utilities required by other Hadoop
modules.
Hadoop YARN − This is a framework for job scheduling and cluster resource
management.
2
How Does Hadoop Work?
It is quite expensive to build bigger servers with heavy configurations that handle large scale
processing, but as an alternative, you can tie together many commodity computers with
single- CPU, as a single functional distributed system and practically, the clustered machines
can read the dataset in parallel and provide a much higher throughput. Moreover, it is
cheaper than one high-end server. So this is the first motivational factor behind using
Hadoop that it runs across clustered and low-cost machines.
Hadoop runs code across a cluster of computers. This process includes the following core
tasks that Hadoop performs −
Data is initially divided into directories and files. Files are divided into uniform sized
blocks of 128M and 64M (preferably 128M).
These files are then distributed across various cluster nodes for further processing.
HDFS, being on top of the local file system, supervises the processing.
Blocks are replicated for handling hardware failure.
Checking that the code was executed successfully.
Performing the sort that takes place between the map and reduce stages.
Sending the sorted data to a certain computer.
Writing the debugging logs for each job.
Advantages of Hadoop
Hadoop framework allows the user to quickly write and test distributed systems. It is
efficient, and it automatic distributes the data and work across the machines and in
turn, utilizes the underlying parallelism of the CPU cores.
Hadoop does not rely on hardware to provide fault-tolerance and high availability
(FTHA), rather Hadoop library itself has been designed to detect and handle failures
at the application layer.
Servers can be added or removed from the cluster dynamically and Hadoop continues
to operate without interruption.
Another big advantage of Hadoop is that apart from being open source, it is
compatible on all the platforms since it is Java based.
Components of Hadoop
Core components of Hadoop There are two major components of the Hadoop framework and
both of them does two of the important task for it.
1. Hadoop MapReduce is the method to split a larger data problem into smaller chunk and
distribute it to many different commodity servers. Each server have their own set of
resources
3
and they have processed them locally. Once the commodity server has processed the data
they send it back collectively to main server. This is effectively a process where we process
large data effectively and efficiently
2. Hadoop Distributed File System (HDFS) is a virtual file system. There is a big
difference between any other file system and Hadoop. When we move a file on HDFS, it is
automatically split into many small pieces. These small chunks of the file are replicated and
stored on other servers (usually 3) for the fault tolerance or high availability.
Namenode: Namenode is the heart of the hadoop system. The namenode manages
the file system namespace. It stores the metadata information of the data blocks. This
metadata is stored permanently on to local disk in the form of namespace image and
edit log file. The namenode also knows the location of the data blocks on the data
node. However the namenode does not store this information persistently. The
namenode creates the block to datanodemapping when it is restarted. If the
NameNode crashes, then the entire Hadoop system goes down
4
Job Tracker: Job Tracker responsibility is to schedule the client’s jobs. Job tracker
creates map and reduce tasks and schedules them to run on the DataNodes (task
trackers). Job Tracker also checks for any failed tasks and reschedules the failed tasks
on another DataNode. Job tracker can be run on the NameNode or a separate node.
Task Tracker: Task tracker runs on the DataNodes. Task trackers responsibility is to
run the map or reduce tasks assigned by the NameNode and to report the status of the
tasks to the NameNode.
3. Common: Common utilities for the other Hadoop modules
4. Hadoop Yarn: A framework for job scheduling and cluster resource management
5
The built-in servers of namenode and datanode help users to easily check the status of
cluster.
Streaming access to file system data.
HDFS provides file permissions and authentication.
HDFS Architecture
HDFS follows the master-slave architecture and it has the following elements.
Namenode
The name node is the commodity hardware that contains the GNU/Linux operating system
and the name node software. It is software that can be run on commodity hardware. The
system having the name node acts as the master server and it does the following tasks −
Manages the file system namespace.
Regulates client’s access to files.
It also executes file system operations such as renaming, closing, and opening files
and directories.
Datanode
The datanode is a commodity hardware having the GNU/Linux operating system and
datanode software. For every node (Commodity hardware/System) in a cluster, there will be
a datanode. These nodes manage the data storage of their system.
Datanodes perform read-write operations on the file systems, as per client request.
6
They also perform operations such as block creation, deletion, and replication
according to the instructions of the namenode.
Block
Generally the user data is stored in the files of HDFS. The file in a file system will be
divided into one or more segments and/or stored in individual data nodes. These file
segments are called as blocks. In other words, the minimum amount of data that HDFS can
read or write is called a Block. The default block size is 64MB, but it can be increased as per
the need to change in HDFS configuration.
Goals of HDFS
Fault detection and recovery − Since HDFS includes a large number of commodity
hardware, failure of components is frequent. Therefore HDFS should have mechanisms for
quick and automatic fault detection and recovery.
Huge datasets − HDFS should have hundreds of nodes per cluster to manage the
applications having huge datasets.
Hardware at data − A requested task can be done efficiently, when the computation takes
place near the data. Especially where huge datasets are involved, it reduces the network
traffic and increases the throughput.
History of Hadoop
The Hadoop was started by Doug Cutting and Mike Cafarella in 2002. Its origin was the
Google File System paper, published by Google.
o In 2002, Doug Cutting and Mike Cafarella started to work on a project, Apache
Nutch. It is an open source web crawler software project.
o While working on Apache Nutch, they were dealing with big data. To store that data
they have to spend a lot of costs which becomes the consequence of that project. This
problem becomes one of the important reason for the emergence of Hadoop.
7
o In 2003, Google introduced a file system known as GFS (Google file system). It is a
proprietary distributed file system developed to provide efficient access to data.
o In 2004, Google released a white paper on Map Reduce. This technique simplifies the
data processing on large clusters.
o In 2005, Doug Cutting and Mike Cafarella introduced a new file system known as
NDFS (Nutch Distributed File System). This file system also includes Map reduce.
o In 2006, Doug Cutting quit Google and joined Yahoo. On the basis of the Nutch
project, Dough Cutting introduces a new project Hadoop with a file system known as
HDFS (Hadoop Distributed File System). Hadoop first version 0.1.0 released in this
year.
o Doug Cutting gave named his project Hadoop after his son's toy elephant.
o In 2007, Yahoo runs two clusters of 1000 machines.
o In 2008, Hadoop became the fastest system to sort 1 terabyte of data on a 900 node
cluster within 209 seconds.
o In 2013, Hadoop 2.2 was released.
o In 2017, Hadoop 3.0 was released.
Below are some of the most common formats of the Hadoop ecosystem:
Text/CSV
A plain text file or CSV is the most common format both outside and within the
Hadoop ecosystem. The great disadvantage in the use of this format is that it does not
support block compression, so the compression of a CSV file in Hadoop can have a
high cost in reading.
SequenceFile
The SequenceFile format stores the data in binary format. This format accepts
compression; however, it does not store metadata and the only option in the evolution
of its scheme is to add new fields at the end. This is usually used to store intermediate
data in the input and output of MapReduce processes.
Avro
Avro is a row-based storage format. This format includes in each file, the definition of
the scheme of your data in JSON format, improving interoperability and allowing the
evolution of the scheme. Avro also allows block compression in addition to its
divisibility, making it a good choice for most cases when using Hadoop.
8
Parquet
Parquet is a column-based (column-based) binary storage format that can store nested
data structures. This format is very efficient in terms of disk input / output operations
when the necessary columns to be used are specified. This format is very optimized
for use with Cloudera Impala.
Hadoop Ecosystem
9
Note: Apart from the above-mentioned components, there are many other components too
that are part of the Hadoop ecosystem.
All these toolkits or components revolve around one term i.e. Data. That’s the beauty of
Hadoop that it revolves around data and hence making its synthesis easier.
HDFS:
YARN:
Yet Another Resource Negotiator, as the name implies, YARN is the one who
helps to manage the resources across the clusters. In short, it performs
scheduling and resource allocation for the Hadoop System.
Consists of three major components i.e.
10
1. Resource Manager
2. Nodes Manager
3. Application Manager
Resource manager has the privilege of allocating resources for the applications
in a system whereas Node managers work on the allocation of resources such as
CPU, memory, bandwidth per machine and later on acknowledges the resource
manager. Application manager works as an interface between the resource
manager and node manager and performs negotiations as per the requirement of
the two.
MapReduce:
PIG:
Pig was basically developed by Yahoo which works on a pig Latin language, which is
Query based language similar to SQL.
It is a platform for structuring the data flow, processing and analyzing huge data
sets.
Pig does the work of executing commands and in the background, all the
activities of MapReduce are taken care of. After the processing, pig stores the
result in HDFS.
Pig Latin language is specially designed for this framework which runs on Pig
Runtime. Just the way Java runs on the JVM.
Pig helps to achieve ease of programming and optimization and hence is a major
segment of the Hadoop Ecosystem.
HIVE:
With the help of SQL methodology and interface, HIVE performs reading and
writing of large data sets. However, its query language is called as HQL (Hive
Query Language).
It is highly scalable as it allows real-time processing and batch processing both.
Also, all the SQL datatypes are supported by Hive thus, making the query
processing easier.
Similar to the Query Processing frameworks, HIVE too comes with two
components: JDBC Drivers and HIVE Command Line.
11
JDBC, along with ODBC drivers work on establishing the data storage
permissions and connection whereas HIVE Command line helps in the
processing of queries.
Mahout:
Apache Spark:
It’s a platform that handles all the process consumptive tasks like batch
processing, interactive or iterative real-time processing, graph conversions, and
visualization, etc.
It consumes in memory resources hence, thus being faster than the prior in terms
of optimization.
Spark is best suited for real-time data whereas Hadoop is best suited for
structured data or batch processing, hence both are used in most of the
companies interchangeably.
Apache HBase:
It’s a NoSQL database which supports all kinds of data and thus capable of
handling anything of Hadoop Database. It provides capabilities of Google’s
BigTable, thus able to work on Big Data sets effectively.
At times where we need to search or retrieve the occurrences of something small
in a huge database, the request must be processed within a short quick span of
time. At such times, HBase comes handy as it gives us a tolerant way of storing
limited data
12
creates a map
13
and reduce jobs and submit them to the cluster. These jobs can also be monitored with this
utility.
How does it Work?
The script specified for mapper and reducer works as below-
After the complete initialization of the mapper script, it will launch the instance of the script
with different process ids. The mapper task while running takes the input lines and passes it
to the standard input. At the same time, the outputs from the process’s standard output are
collected by the mapper. It converts each line into a key-value pair. The set of key-value pairs
is then collected as the output from the mapper. The key value pair is selected based on the
first tab character. The part of the line up to the initial tab is selected as key while the rest of
the line is selected as a valuable part. In case, the tab is not present in a line then the total line
is selected as key and there is no value part for the line. This can be adjusted according to
business needs. Purpose
It is used for real-time data ingestion which can be used in different real-time apps. There are
different real-time apps like watching stock portfolios, share market analysis, narrating
weather report, traffic alerts which are done using Hadoop streaming.
Working of Hadoop Streaming
Below is a simple example of how it works:
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper org.apache.hadoop.mapred.lib.IdentityMapper \
-reducer /bin/wc
The input command is used to provide the input directory while the output command is used
to provide the output directory. The mapper command is used to specify the executable
mapper class while the reducer command is used to specify the executable reducer class.
Advantages
Below are the advantages explained:
1. Availability
This doesn’t require any extra separate software to be installed and managed. There are other
tools like a pig, hive which can be installed I need to be managed separately.
2. Learning
It doesn’t require to learn new technologies. It can be leveraged with minimum Unix skills
for data analysis.
Popular Course in this category
3. Reduce Development Time
It requires to write mapper and reducer code while developing streaming applications in Unix
whereas doing the same work using Java MapReduce application is more complex and needs
to be compiled first, then test, then package, followed by exporting JAR file, and then run.
4. Faster Conversion
It takes very little time to convert data from one format to another using Hadoop streaming.
We can use it for converting data from text file to sequence file and then again from sequence
file to text file and many others. This can be achieved using input format and output format
options in Hadoop streaming.
5. Testing
Input and output data can be quickly tested by using it with Unix or Shell Script.
6. Requirement for Business
14
For simple business requirements like simple filtering operations and simple aggregation
operation, we can use this with Unix.
7. Performance
Using this we can get better performance while working with streaming data. There are also
several disadvantages of Hadoop streaming which are addressed by using other tools in the
Hadoop package like Kafka, flume, spark.
MapReduce Architecture
MapReduce and HDFS are the two major components of Hadoop which makes it so
powerful and efficient to use. MapReduce is a programming model used for efficient
processing in parallel over large data-sets in a distributed manner. The data is first
split and then combined to produce the final result. The libraries for MapReduce is
written in so many programming languages with various different- different
optimizations. The purpose of MapReduce in Hadoop is to Map each of the jobs and
then it will reduce it to equivalent tasks for providing less overhead over the cluster
network and to reduce the processing power. The MapReduce task is mainly divided
into two phases Map Phase and Reduce Phase.
MapReduce Architecture:
15
Components of MapReduce Architecture:
1. Client: The MapReduce client is the one who brings the Job to the
MapReduce for processing. There can be multiple clients available that
continuously send jobs for processing to the Hadoop MapReduce Manager.
2. Job: The MapReduce Job is the actual work that the client wanted to do
which is comprised of so many smaller tasks that the client wants to process
or execute.
3. Hadoop MapReduce Master: It divides the particular job into subsequent
job-parts.
4. Job-Parts: The task or sub-jobs that are obtained after dividing the main
job. The result of all the job-parts combined to produce the final output.
5. Input Data: The data set that is fed to the MapReduce for processing.
6. Output Data: The final result is obtained after the processing.
In MapReduce, we have a client. The client will submit the job of a particular size to
the Hadoop MapReduce Master. Now, the MapReduce master will divide this job into
further equivalent job-parts. These job-parts are then made available for the Map and
Reduce Task. This Map and Reduce task will contain the program as per the
requirement of the use-case that the particular company is solving. The
16
developer writes their logic to fulfill the requirement that the industry requires. The
input data which we are using is then fed to the Map Task and the Map will generate
intermediate key-value pair as its output. The output of Map i.e. these key-value pairs
are then fed to the Reducer and the final output is stored on the HDFS. There can be n
number of Map and Reduce tasks made available for processing the data as per the
requirement. The algorithm for Map and Reduce is made with a very optimized way
such that the time complexity or space complexity is minimum.
Let’s discuss the MapReduce phases to get a better understanding of its architecture:
The MapReduce task is mainly divided into 2 phases i.e. Map phase and Reduce
phase.
1. Map: As the name suggests its main use is to map the input data in key-
value pairs. The input to the map may be a key-value pair where the key can
be the id of some kind of address and value is the actual value that it keeps.
The Map() function will be executed in its memory repository on each of
these input key-value pairs and generates the intermediate key- value pair
which works as input for the Reducer or Reduce() function.
2. Reduce: The intermediate key-value pairs that work as input for Reducer
are shuffled and sort and send to the Reduce() function. Reducer aggregate
or group the data based on its key-value pair as per the reducer algorithm
written by the developer.
How Job tracker and the task tracker deal with MapReduce:
1. Job Tracker: The work of Job tracker is to manage all the resources and all
the jobs across the cluster and also to schedule each map on the Task
Tracker running on the same data node since there can be hundreds of data
nodes available in the cluster.
2. Task Tracker: The Task Tracker can be considered as the actual slaves
that are working on the instruction given by the Job Tracker. This Task
Tracker is deployed on each of the nodes available in the cluster that
executes the Map and Reduce task as instructed by Job Tracker.
There is also one important component of MapReduce Architecture known as Job
History Server. The Job History Server is a daemon process that saves and stores
historical information about the task or application, like the logs which are generated
during or after the job execution are stored on Job History Server.
17
How Job runs on MapReduce
MapReduce can be used to work with a solitary method call: submit() on a Job
object (you can likewise call waitForCompletion(), which presents the activity on
the off chance that it hasn’t been submitted effectively, at that point sits tight for it
to finish).
Let’s understand the components –
1. Client : Submitting the MapReduce job.
2. Yarn node manager : In a cluster , it monitors and launches the
compute containers on machines.
3. Yarn resource manager : Handles the allocation of compute resources
coordination on the cluster.
4. MapReduce application master : Facilitates the tasks running the
MapReduce work.
5. Distributed Filesystem : Shares job files with other entities.
Basic file formats are: Text format, Key-Value format, Sequence format
Other formats which are used and are well known are: Avro, Parquet, RC or Row-
Columnar format, ORC or Optimized Row Columnar format
The need ..
A file format is just a way to define how information is stored in HDFS file system. This is
usually driven by the use case or the processing algorithms for specific domain, File format
should be well- defined and expressive. It should be able to handle variety of data structures
specifically structs, records, maps, arrays along with strings, numbers etc. File format should be
simple, binary and compressed.. When dealing with Hadoop’s filesystem not only do you have all
of these traditional storage formats available to you (like you can store PNG and JPG images on
HDFS if you like), but you also have some Hadoop-focused file formats to use for structured and
unstructured data. A huge bottleneck for HDFS-enabled applications like MapReduce and Spark
is the time it takes to find relevant data in a particular location and the time it takes to write the
data back to another location. These issues are exacerbated with the difficulties managing large
datasets, such as evolving schemas, or storage constraints. The various Hadoop file formats have
evolved as a way to ease these issues across a number of use cases. Choosing an appropriate file
format can have some significant benefits: 1. Faster read times 2. Faster write times 3. Splittable
files (so you don’t need to read the whole file, just a part of it) 4. Schema evolution support
(allowing you to change the fields in a dataset) 5. Advanced compression support (compress the
columnar files with a compression codec without sacrificing these features) Some file formats are
designed for general use (like MapReduce or Spark), others are designed for more specific use
cases (like powering a database), and some are designed with specific data characteristics in
mind. So there really is quite a lot of choice.
19
The generic classification of the characteristics are Expressive, Simple, Binary, Compressed,
Integrity to name few. Typically text-based, serial and columnar types…
Since Protocol buffers & thrift are serializable but not splittable they are not largely popular on
HDFS use cases and thus Avro becomes the first choice …
Simple text-based files are common in the non-Hadoop world, and they’re super common in the
Hadoop world too. Data is laid out in lines, with each line being a record. Lines are terminated by
a newline character \n in the typical UNIX fashion. Text-files are inherently splittable (just split
on
\n characters!), but if you want to compress them you’ll have to use a file-level compression
codec that support splitting, such as BZIP2 Because these files are just text files you can encode
anything you like in a line of the file. One common example is to make each line a JSON
document to add some structure. While this can waste space with needless column headers, it is a
simple way to start using structured data in HDFS.
An Input format for plain text files. Files are broken into lines. Either linefeed or carriage-return
are used to signal end of line. Keys are the position in the file, and values are the line of text.
Advantages: Light weight Disadvantages: Slow to read and write, Can’t split compressed files
(Leads to Huge maps)
Sequence files were originally designed for MapReduce, so the integration is smooth. They
encode a key and a value for each record and nothing more. Records are stored in a binary format
that is smaller than a text-based format would be. Like text files, the format does not encode the
structure of the keys and values, so if you make schema migrations they must be additive.
Typically if you need to store complex data in a sequence file you do so in the value part while
encoding the id in the key. The problem with this is that if you add or change fields in your
Writable class it will not be backwards compatible with the data stored in the sequence file. One
benefit of sequence files is that they support block-level compression, so you can compress the
contents of the file while also maintaining the ability to split the file into segments for multiple
map tasks.
21
In addition to text files, Hadoop also provides support for binary files. Out of these binary file
formats, Hadoop Sequence Files are one of the Hadoop specific file format that stores serialized
key/value pairs. Advantages: Compact compared to text files, Optional compression support.
Parallel processing. Container for huge number of small files. Disadvantages: Not good for Hive,
Append only like other data formats, Multi Language support not yet provided One key benefit of
sequence files is that they support block-level compression, so you can compress the contents of
the file while also maintaining the ability to split the file into segments for multiple map tasks.
Sequence files are well supported across Hadoop and many other HDFS enabled projects, and I
think represent the easiest next step away from text files.
RCFILE stands of Record Columnar File which is another type of binary file format which offers
high compression rate on the top of the rows used when we want to perform operations on
multiple rows at a time. RCFILEs are flat files consisting of binary key/value pairs, which shares
much similarity with SEQUENCE FILE. RCFILE stores columns of a table in form of record in a
columnar manner. It first partitions rows horizontally into row splits and then it vertically
partitions each row split in a columnar way. RCFILE first stores the metadata of a row split, as
the key part of a record, and all the data of a row split as the value part. This means that RCFILE
encourages column oriented storage rather than row oriented storage. This column oriented
storage is very useful while performing analytics. It is easy to perform analytics when we “hive’ a
column oriented storage type. We cannot load data into RCFILE directly. First we need to load
data into another table and then we need to overwrite it into our newly created RCFILE.
ORC stands for Optimized Row Columnar which means it can store data in an optimized way
than the other file formats. ORC reduces the size of the original data up to 75%. As a result the
speed of data processing also increases and shows better performance than Text, Sequence and
RC file formats. An ORC file contains rows data in groups called as Stripes along with a file
footer. ORC format improves the performance when Hive is processing the data. We cannot load
data into ORCFILE directly. First we need to load data into another table and then we need to
overwrite it into our newly created ORCFILE. ORC File Format Full Form is Optimized Row
Columnar File Format.ORC File format provides very efficient way to store relational data then
RC file,By using
22
ORC File format we can reduce the size of original data up to 75%.Comparing to
Text,Sequence,Rc file formats ORC is better
Using ORC files improves performance when Hive is reading, writing, and processing data
comparing to Text,Sequence and Rc. RC and ORC shows better performance than Text and
Sequence File formats. Comparing to RC and ORC File formats always ORC is better as ORC
takes less time to access the data comparing to RC File Format and ORC takes Less space space
to store data. However, the ORC file increases CPU overhead by increasing the time it takes to
decompress the relational data. ORC File format feature comes with the Hive 0.11 version and
cannot be used with previous versions.
AVRO Format
Apache Avro is a language-neutral data serialization system. It was developed by Doug Cutting,
the father of Hadoop. Since Hadoop writable classes lack language portability, Avro becomes
quite helpful, as it deals with data formats that can be processed by multiple languages. Avro is a
preferred tool to serialize data in Hadoop. Avro is an opinionated format which understands that
data stored in HDFS is usually not a simple key/value combo like int/string. The format encodes
the schema of its contents directly in the file which allows you to store complex objects natively.
Honestly, Avro is not really a file format, it’s a file format plus a serialization and de-serialization
framework with regular old sequence files you can store complex objects but you have to manage
the process. Avro handles this complexity whilst providing other tools to help manage data over
time and is a well thought out format which defines file data schemas in JSON (for
interoperability), allows for schema evolutions (remove a column, add a column), and multiple
serialization/deserialization use cases. It also supports block-level compression. For most
Hadoop- based use cases Avro becomes really good choice. Avro depends heavily on its schema.
It allows every data to be written with no prior knowledge of the schema. It serializes fast and the
resulting serialized data is lesser in size. Schema is stored along with the Avro data in a file for
any further processing. In RPC, the client and the server exchange schemas during the
connection. This exchange helps in the communication between same named fields, missing
fields, extra fields, etc. Avro schemas are defined with JSON that simplifies its implementation in
languages with JSON libraries. Like Avro, there are other serialization mechanisms in Hadoop
such as Sequence Files, Protocol Buffers, and Thrift.
23
Thrift and Protocol Buffers are the most competent libraries with Avro. Avro differs from these
frameworks in the following ways –
Avro supports both dynamic and static types as per the requirement. Protocol Buffers
and Thrift use Interface Definition Languages (IDLs) to specify schemas and their
types. These IDLs are used to generate code for serialization and deserialization.
Avro is built in the Hadoop ecosystem. Thrift and Protocol Buffers are not built in
Hadoop ecosystem. Unlike Thrift and Protocol Buffer, Avro's schema definition is in
JSON and not in any proprietary IDL.
Parquet Format
The latest hotness in file formats for Hadoop is columnar file storage. Basically this means that
instead of just storing rows of data adjacent to one another you also store column values adjacent
to each other. So datasets are partitioned both horizontally and vertically. This is particularly
useful if your data processing framework just needs access to a subset of data that is stored on
disk as it can access all values of a single column very quickly without reading whole records.
Parquet file format is also a columnar format. Instead of just storing rows of data adjacent to one
another you also store column values adjacent to each other. So datasets are partitioned both
horizontally and vertically. This is particularly useful if your data processing framework just
needs access to a subset of data that is stored on disk as it can access all values of a single column
very quickly without reading whole records. Just like ORC file, it’s great for compression with
great query performance especially efficient when querying data from specific columns. Parquet
format is computationally intensive on the write side, but it reduces a lot of I/O cost to make great
read performance. It enjoys more freedom than ORC file in schema evolution, that it can add new
columns to the end of the structure. If you’re chopping and cutting up datasets regularly then
these formats can be very beneficial to the speed of your application, but frankly if you have an
application that usually needs entire rows of data then the columnar formats may actually be a
detriment to performance due to the increased network activity required. One huge benefit of
columnar oriented file formats is that data in the same column tends to be compressed together
which can yield some massive storage optimizations (as data in the same column tends to be
similar). It supports both File-Level Compression and Block-Level Compression. File-level
compression means you compress entire files regardless of the file format, the same way you
would compress a file in Linux. Some of these formats are splittable (e.g. bzip2, or LZO if
indexed). Block-level compression is internal to the file format, so individual blocks of data
within the file
24
are compressed. This means that the file remains splittable even if you use a non-splittable
compression codec like Snappy. However, this is only an option if the specific file format
supports it. Summary Overall these format can drastically optimize workloads, especially for
Hive and Spark which tend to just read segments of records rather than the whole thing (which is
more common in MapReduce). Since Avro and Parquet have so much in common when choosing
a file format to use with HDFS, we need to consider read performance and write performance.
Because the nature of HDFS is to store data that is write once, read multiple times, we want to
emphasize on the read performance. The fundamental difference in terms of how to use either
format is this: Avro is a Row based format. If you want to retrieve the data as a whole, you can
use Avro. Parquet is a Column based format. If your data consists of lot of columns but you are
interested in a subset of columns, you can use Parquet. Hopefully by now you’ve learned a little
about what file formats actually are and why you would think of choosing a specific one. We’ve
discussed the main characteristics of common file formats and talked a little about compression.
25