0% found this document useful (0 votes)
22 views13 pages

Unit 3 Bda

Uploaded by

mokshagnapatel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views13 pages

Unit 3 Bda

Uploaded by

mokshagnapatel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 13

UNIT-3

INTRODUCTION:
Hadoop is an open-source software framework that is used for storing and
processing large amounts of data in a distributed computing environment. It is
designed to handle big data and is based on the MapReduce programming model,
which allows for the parallel processing of large datasets.

Hadoop has two main components:


 HDFS (Hadoop Distributed File System): This is the storage component of
Hadoop, which allows for the storage of large amounts of data across
multiple machines. It is designed to work with commodity hardware,
which makes it cost-effective.
 YARN (Yet Another Resource Negotiator): This is the resource
management component of Hadoop, which manages the allocation of
resources (such as CPU and memory) for processing the data stored in
HDFS.
 Hadoop also includes several additional modules that provide additional
functionality, such as Hive (a SQL-like query language), Pig (a high-level
platform for creating MapReduce programs), and HBase (a non-relational,
distributed database).
 Hadoop is commonly used in big data scenarios such as data warehousing,
business intelligence, and machine learning. It’s also used for data
processing, data analysis, and data mining.
 What is Hadoop?
 Hadoop is an open source software programming framework for storing a
large amount of data and performing the computation. Its framework is based
on Java programming with some native code in C and shell scripts.
 Hadoop is an open-source software framework that is used for storing and
processing large amounts of data in a distributed computing environment. It is
designed to handle big data and is based on the MapReduce programming
model, which allows for the parallel processing of large datasets.

Hadoop has several key features that make it well-suited for big data
processing:

 Distributed Storage: Hadoop stores large data sets across multiple


machines, allowing for the storage and processing of extremely large
amounts of data.
 Scalability: Hadoop can scale from a single server to thousands of
machines, making it easy to add more capacity as needed.
 Fault-Tolerance: Hadoop is designed to be highly fault-tolerant, meaning
it can continue to operate even in the presence of hardware failures.
 Data locality: Hadoop provides data locality feature, where the data is
stored on the same node where it will be processed, this feature helps to
reduce the network traffic and improve the performance
 High Availability: Hadoop provides High Availability feature, which
helps to make sure that the data is always available and is not lost.
 Flexible Data Processing: Hadoop’s MapReduce programming model
allows for the processing of data in a distributed fashion, making it easy to
implement a wide variety of data processing tasks.
 Data Integrity: Hadoop provides built-in checksum feature, which helps to
ensure that the data stored is consistent and correct.
 Data Replication: Hadoop provides data replication feature, which helps to
replicate the data across the cluster for fault tolerance.
 Data Compression: Hadoop provides built-in data compression feature,
which helps to reduce the storage space and improve the performance.
 YARN: A resource management platform that allows multiple data
processing engines like real-time streaming, batch processing, and
interactive SQL, to run and process data stored in HDFS.

Overview: Apache Hadoop is an open source framework intended to make


interaction with big data easier, However, for those who are not acquainted with this
technology, one question arises that what is big data ? Big data is a term given to the
data sets which can’t be processed in an efficient manner with the help of traditional
methodology such as RDBMS. Hadoop has made its place in the industries and
companies that need to work on large data sets which are sensitive and needs
efficient handling. Hadoop is a framework that enables processing of large data sets
which reside in the form of clusters. Being a framework, Hadoop is made up of
several modules that are supported by a large ecosystem of technologies.
Introduction: Hadoop Ecosystem is a platform or a suite which provides various
services to solve the big data problems. It includes Apache projects and various
commercial tools and solutions. There are four major elements of
Hadoop i.e. HDFS, MapReduce, YARN, and Hadoop Common. Most of the tools
or solutions are used to supplement or support these major elements. All these tools
work collectively to provide services such as absorption, analysis, storage and
maintenance of data etc.
Following are the components that collectively form a Hadoop ecosystem:

 HDFS: Hadoop Distributed File System


 YARN: Yet Another Resource Negotiator
 MapReduce: Programming based Data Processing
 Spark: In-Memory data processing
 PIG, HIVE: Query based processing of data services
 HBase: NoSQL Database
 Mahout, Spark MLLib: Machine Learning algorithm libraries
 Solar, Lucene: Searching and Indexing
 Zookeeper: Managing cluster
 Oozie: Job Scheduling

Note: Apart from the above-mentioned components, there are many other
components too that are part of the Hadoop ecosystem.
All these toolkits or components revolve around one term i.e. Data. That’s the
beauty of Hadoop that it revolves around data and hence making its synthesis easier.
HDFS:

 HDFS is the primary or major component of Hadoop ecosystem and is


responsible for storing large data sets of structured or unstructured data
across various nodes and thereby maintaining the metadata in the form of
log files.
 HDFS consists of two core components i.e.
1. Name node
2. Data Node
 Name Node is the prime node which contains metadata (data about data)
requiring comparatively fewer resources than the data nodes that stores the
actual data. These data nodes are commodity hardware in the distributed
environment. Undoubtedly, making Hadoop cost effective.
 HDFS maintains all the coordination between the clusters and hardware,
thus working at the heart of the system.
YARN:
Yet Another Resource Negotiator, as the name implies, YARN is the one
who helps to manage the resources across the clusters. In short, it performs
scheduling and resource allocation for the Hadoop System.
 Consists of three major components i.e.
1. Resource Manager
2. Nodes Manager
3. Application Manager
 Resource manager has the privilege of allocating resources for the
applications in a system whereas Node managers work on the allocation of
resources such as CPU, memory, bandwidth per machine and later on
acknowledges the resource manager. Application manager works as an
interface between the resource manager and node manager and performs
negotiations as per the requirement of the two.
MapReduce:

 By making the use of distributed and parallel algorithms, MapReduce


makes it possible to carry over the processing’s logic and helps to write
applications which transform big data sets into a manageable one.
 MapReduce makes the use of two functions i.e. Map() and Reduce()
whose task is:
1. Map() performs sorting and filtering of data and thereby
organizing them in the form of group. Map generates a key-
value pair based result which is later on processed by the
Reduce() method.
2. Reduce(), as the name suggests does the summarization by
aggregating the mapped data. In simple, Reduce() takes the
output generated by Map() as input and combines those tuples
into smaller set of tuples.
PIG:
Pig was basically developed by Yahoo which works on a pig Latin language, which
is Query based language similar to SQL.
 It is a platform for structuring the data flow, processing and analyzing
huge data sets.
 Pig does the work of executing commands and in the background, all the
activities of MapReduce are taken care of. After the processing, pig stores
the result in HDFS.
 Pig Latin language is specially designed for this framework which runs on
Pig Runtime. Just the way Java runs on the JVM.
 Pig helps to achieve ease of programming and optimization and hence is a
major segment of the Hadoop Ecosystem.
HIVE:

 With the help of SQL methodology and interface, HIVE performs reading
and writing of large data sets. However, its query language is called as
HQL (Hive Query Language).
 It is highly scalable as it allows real-time processing and batch processing
both. Also, all the SQL datatypes are supported by Hive thus, making the
query processing easier.
 Similar to the Query Processing frameworks, HIVE too comes with two
components: JDBC Drivers and HIVE Command Line.
 JDBC, along with ODBC drivers work on establishing the data storage
permissions and connection whereas HIVE Command line helps in the
processing of queries.
Note: Apart from the above-mentioned components, there are many other
components too that are part of the Hadoop ecosystem.
All these toolkits or components revolve around one term i.e. Data. That’s the
beauty of Hadoop that it revolves around data and hence making its synthesis easier.
HDFS:

 HDFS is the primary or major component of Hadoop ecosystem and is


responsible for storing large data sets of structured or unstructured data
across various nodes and thereby maintaining the metadata in the form of
log files.
 HDFS consists of two core components i.e.
1. Name node
2. Data Node
 Name Node is the prime node which contains metadata (data about data)
requiring comparatively fewer resources than the data nodes that stores the
actual data. These data nodes are commodity hardware in the distributed
environment. Undoubtedly, making Hadoop cost effective.
 HDFS maintains all the coordination between the clusters and hardware,
thus working at the heart of the system.
YARN:

Yet Another Resource Negotiator, as the name implies, YARN is the one

who helps to manage the resources across the clusters. In short, it performs
scheduling and resource allocation for the Hadoop System.
 Consists of three major components i.e.
1. Resource Manager
2. Nodes Manager
3. Application Manager
 Resource manager has the privilege of allocating resources for the
applications in a system whereas Node managers work on the allocation of
resources such as CPU, memory, bandwidth per machine and later on
acknowledges the resource manager. Application manager works as an
interface between the resource manager and node manager and performs
negotiations as per the requirement of the two.
MapReduce:
 By making the use of distributed and parallel algorithms, MapReduce
makes it possible to carry over the processing’s logic and helps to write
applications which transform big data sets into a manageable one.
 MapReduce makes the use of two functions i.e. Map() and Reduce()
whose task is:
1. Map() performs sorting and filtering of data and thereby
organizing them in the form of group. Map generates a key-
value pair based result which is later on processed by the
Reduce() method.
2. Reduce(), as the name suggests does the summarization by
aggregating the mapped data. In simple, Reduce() takes the
output generated by Map() as input and combines those tuples
into smaller set of tuples.
PIG:
Pig was basically developed by Yahoo which works on a pig Latin language, which
is Query based language similar to SQL.
 It is a platform for structuring the data flow, processing and analyzing
huge data sets.
 Pig does the work of executing commands and in the background, all the
activities of MapReduce are taken care of. After the processing, pig stores
the result in HDFS.
 Pig Latin language is specially designed for this framework which runs on
Pig Runtime. Just the way Java runs on the JVM.
 Pig helps to achieve ease of programming and optimization and hence is a
major segment of the Hadoop Ecosystem.
HIVE:

 With the help of SQL methodology and interface, HIVE performs reading
and writing of large data sets. However, its query language is called as
HQL (Hive Query Language).
 It is highly scalable as it allows real-time processing and batch processing
both. Also, all the SQL datatypes are supported by Hive thus, making the
query processing easier.
 Similar to the Query Processing frameworks, HIVE too comes with two
components: JDBC Drivers and HIVE Command Line.
 JDBC, along with ODBC drivers work on establishing the data storage
permissions and connection whereas HIVE Command line helps in the
processing of queries.

Moving data into and out of Hadoop


Understanding key design considerations for data ingress and egress
tools

· Low-level methods for moving data into and out of Hadoop


· Techniques for moving log files and relational and NoSQL data, as
well as data in Kafka, in and out of HDFS

Data movement is one of those things that you aren’t likely to think
too much about until you’re fully committed to using Hadoop on a
project, at which point it becomes this big scary unknown that has to
be tackled. How do you get your log data sitting across thousands of
hosts into Hadoop? What’s the most efficient way to get your data out
of your relational and No/NewSQL systems and into Hadoop? How
do you get Lucene indexes generated in Hadoop out to your servers?
And how can these processes be automated?

Welcome to chapter 5, where the goal is to answer these questions


and set you on your path to worry-free data movement. In this chapter
you’ll first see how data across a broad spectrum of locations and
formats can be moved into Hadoop, and then you’ll see how data can
be moved out of Hadoop.

This chapter starts by highlighting key data-movement properties, so


that as you go through the rest of this chapter you can evaluate the fit
of the various tools. It goes on to look at low-level and high-level
tools that can be used to move your data. We’ll start with some simple
techniques, such as using the command line and Java for ingress,[1] but
we’ll quickly move on to more advanced techniques like using NFS
and DistCp.
1
Ingress and egress refer to data movement into and out of a system,
respectively.

Once the low-level tooling is out of the way, we’ll survey higher-level
tools that have simplified the process of ferrying data into Hadoop.
We’ll look at how you can automate the movement of log files with
Flume, and how Sqoop can be used to move relational data. So as not
to ignore some of the emerging data systems, you’ll also be
introduced to methods that can be employed to move data from
HBase and Kafka into Hadoop.

We’ll cover a lot of ground in this chapter, and it’s likely that you’ll
have specific types of data you need to work with. If this is the case,
feel free to jump directly to the section that provides the details you
need.

Let’s start things off with a look at key ingress and egress system
considerations.

5.1. Key elements of data movement

Moving large quantities of data in and out of Hadoop offers logistical


challenges that include consistency guarantees and resource impacts
on data sources and destinations. Before we dive into the techniques,
however, we need to discuss the design elements you should be aware
of when working with data movement.

Idempotence

An idempotent operation produces the same result no matter how


many times it’s executed. In a relational database, the inserts typically
aren’t idempotent, because executing them multiple times doesn’t
produce the same resulting database state. Alternatively, updates often
are idempotent, because they’ll produce the same end result.

Any time data is being written, idempotence should be a


consideration, and data ingress and egress in Hadoop are no different.
How well do distributed log collection frameworks deal with data
retransmissions? How do you ensure idempotent behavior in a
MapReduce job where multiple tasks are inserting into a database in
parallel? We’ll examine and answer these questions in this chapter.

Aggregation

The data aggregation process combines multiple data elements. In the


context of data ingress, this can be useful because moving large
quantities of small files into HDFS potentially translates into
NameNode memory woes, as well as slow MapReduce execution
times. Having the ability to aggregate files or data together mitigates
this problem and is a feature to consider.

Data format transformation


The data format transformation process converts one data format into
another. Often your source data isn’t in a format that’s ideal for
processing in tools such as Map-Reduce. If your source data is in
multiline XML or JSON form, for example, you may want to consider
a preprocessing step. This would convert the data into a form that can
be split, such as one JSON or XML element per line, or convert it into
a format such as Avro. Chapter 3 contains more details on these data
formats.

Compression

Compression not only helps by reducing the footprint of data at rest,


but also has I/O advantages when reading and writing data.

Availability and recoverability

Recoverability allows an ingress or egress tool to retry in the event of


a failed operation. Because it’s unlikely that any data source, sink, or
Hadoop itself can be 100% available, it’s important that an ingress or
egress action be retried in the event of failure.

Reliable data transfer and data validation

In the context of data transportation, checking for correctness is how


you verify that no data corruption occurred as the data was in transit.
When you work with heterogeneous systems such as Hadoop data
ingress and egress, the fact that data is being transported across
different hosts, networks, and protocols only increases the potential
for problems during data transfer. A common method for checking the
correctness of raw data, such as storage devices, is Cyclic
Redundancy Checks (CRCs), which are what HDFS uses internally to
maintain block-level integrity.

In addition, it’s possible that there are problems in the source data
itself due to bugs in the software generating the data. Performing
these checks at ingress time allows you to do a one-time check,
instead of dealing with all the downstream consumers of the data that
would have to be updated to handle errors in the data.
Resource consumption and performance

Resource consumption and performance are measures of system


resource utilization and system efficiency, respectively. Ingress and
egress tools don’t typically impose significant load (resource
consumption) on a system, unless you have appreciable data volumes.
For performance, the questions to ask include whether the tool
performs ingress and egress activities in parallel, and if so, what
mechanisms it provides to tune the amount of parallelism. For
example, if your data source is a production database and you’re
using MapReduce to ingest that data, don’t use a large number of
concurrent map tasks to import data.

Monitoring

Monitoring ensures that functions are performing as expected in


automated systems. For data ingress and egress, monitoring breaks
down into two elements: ensuring that the processes involved in
ingress and egress are alive, and validating that source and destination
data are being produced as expected. Monitoring should also include
verifying that the data volumes being moved are at expected levels;
unexpected drops or highs in your data will alert you to potential
system issues or bugs in your software.

Speculative execution

MapReduce has a feature called speculative execution that launches


duplicate tasks near the end of a job for tasks that are still executing.
This helps prevent slow hardware from impacting job execution
times. But if you’re using a map task to perform inserts into a
relational database, for example, you should be aware that you could
have two parallel processes inserting the same data.[2]
2
Map- and reduce-side speculative execution can be disabled via the
mapreduce.map.speculative and mapreduce.reduce.speculative
configurables in Hadoop 2.

On to the techniques. Let’s start with how you can leverage Hadoop’s
built-in ingress mechanisms.
5.2. Moving data into Hadoop

The first step in working with data in Hadoop is to make it available


to Hadoop. There are two primary methods that can be used to move
data into Hadoop: writing external data at the HDFS level (a data
push), or reading external data at the MapReduce level (more like a
pull). Reading data in MapReduce has advantages in the ease with
which the operation can be parallelized and made fault tolerant. Not
all data is accessible from MapReduce, however, such as in the case
of log files, which is where other systems need to be relied on for
transportation, including HDFS for the final data hop.

In this section we’ll look at methods for moving source data into
Hadoop. I’ll use the design considerations in the previous section as
the criteria for examining and understanding the different tools.

We’ll get things started with a look at some low-level methods you
can use to move data into Hadoop.

5.2.1. Roll your own ingest

Hadoop comes bundled with a number of methods to get your data


into HDFS. This section will examine various ways that these built-in
tools can be used for your data movement needs. The first and
potentially easiest tool you can use is the HDFS command line.

Picking the right ingest tool for the job

The low-level tools in this section work well for one-off file
movement activities, or when working with legacy data sources and
destinations that are file-based. But moving data in this way is quickly
becoming obsolete by the availability of tools such as Flume and
Kafka (covered later in this chapter), which offer automated data
movement pipelines.

Kafka is a much better platform for getting data from A to B (and B


can be a Hadoop cluster) than the old-school “let’s copy files
around!” With Kafka, you only need to pump your data into Kafka,
and you have the ability to consume the data in real time (such as via
Storm) or in offline/batch jobs (such as via Camus).

File-based ingestion flows are, to me at least, a relic of the past


(because everybody knows how scp works :-P), and they primarily
exist for legacy reasons—the upstream data sources may have
existing tools to create file snapshots (such as dump tools for the
database), and there’s no infrastructure to migrate or move the data
into a real-time messaging system such as Kafka.

Technique 33 Using the CLI to load files

If you have a manual activity that you need to perform, such as


moving the examples bundled with this book into HDFS, then the
HDFS command-line interface (CLI) is the tool for you. It’ll allow
you to perform most of the operations that you’re used to performing
on a regular Linux filesystem. In this section we’ll focus on copying
data from a local filesystem into HDFS.

Problem

You want to copy files into HDFS using the shell.

Solution

The HDFS command-line interface can be used for one-off moves, or


it can be incorporated into scripts for a series of moves.

Discussion

Copying a file from local disk to HDFS is done with the hadoop
command:

$ hadoop fs -put local-file.txt hdfs-file.txt

The behavior of the Hadoop -put command differs from the Linux cp
command—in Linux if the destination already exists, it is overwritten;
in Hadoop the copy fails with an error:
put: `hdfs-file.txt': File exists

The -f option must be added to force the file to be overwritten:

$ hadoop fs -put -f local-file.txt hdfs-file.txt

Much like with the Linux cp command, multiple files can be copied
using the same command. In this case, the final argument must be the
directory in HDFS into which the local files are copied:

$ hadoop fs -put local-file1.txt local-file2.txt /hdfs/dest/

You might also like