Unit 3 Bda
Unit 3 Bda
INTRODUCTION:
Hadoop is an open-source software framework that is used for storing and
processing large amounts of data in a distributed computing environment. It is
designed to handle big data and is based on the MapReduce programming model,
which allows for the parallel processing of large datasets.
Hadoop has several key features that make it well-suited for big data
processing:
Note: Apart from the above-mentioned components, there are many other
components too that are part of the Hadoop ecosystem.
All these toolkits or components revolve around one term i.e. Data. That’s the
beauty of Hadoop that it revolves around data and hence making its synthesis easier.
HDFS:
With the help of SQL methodology and interface, HIVE performs reading
and writing of large data sets. However, its query language is called as
HQL (Hive Query Language).
It is highly scalable as it allows real-time processing and batch processing
both. Also, all the SQL datatypes are supported by Hive thus, making the
query processing easier.
Similar to the Query Processing frameworks, HIVE too comes with two
components: JDBC Drivers and HIVE Command Line.
JDBC, along with ODBC drivers work on establishing the data storage
permissions and connection whereas HIVE Command line helps in the
processing of queries.
Note: Apart from the above-mentioned components, there are many other
components too that are part of the Hadoop ecosystem.
All these toolkits or components revolve around one term i.e. Data. That’s the
beauty of Hadoop that it revolves around data and hence making its synthesis easier.
HDFS:
Yet Another Resource Negotiator, as the name implies, YARN is the one
who helps to manage the resources across the clusters. In short, it performs
scheduling and resource allocation for the Hadoop System.
Consists of three major components i.e.
1. Resource Manager
2. Nodes Manager
3. Application Manager
Resource manager has the privilege of allocating resources for the
applications in a system whereas Node managers work on the allocation of
resources such as CPU, memory, bandwidth per machine and later on
acknowledges the resource manager. Application manager works as an
interface between the resource manager and node manager and performs
negotiations as per the requirement of the two.
MapReduce:
By making the use of distributed and parallel algorithms, MapReduce
makes it possible to carry over the processing’s logic and helps to write
applications which transform big data sets into a manageable one.
MapReduce makes the use of two functions i.e. Map() and Reduce()
whose task is:
1. Map() performs sorting and filtering of data and thereby
organizing them in the form of group. Map generates a key-
value pair based result which is later on processed by the
Reduce() method.
2. Reduce(), as the name suggests does the summarization by
aggregating the mapped data. In simple, Reduce() takes the
output generated by Map() as input and combines those tuples
into smaller set of tuples.
PIG:
Pig was basically developed by Yahoo which works on a pig Latin language, which
is Query based language similar to SQL.
It is a platform for structuring the data flow, processing and analyzing
huge data sets.
Pig does the work of executing commands and in the background, all the
activities of MapReduce are taken care of. After the processing, pig stores
the result in HDFS.
Pig Latin language is specially designed for this framework which runs on
Pig Runtime. Just the way Java runs on the JVM.
Pig helps to achieve ease of programming and optimization and hence is a
major segment of the Hadoop Ecosystem.
HIVE:
With the help of SQL methodology and interface, HIVE performs reading
and writing of large data sets. However, its query language is called as
HQL (Hive Query Language).
It is highly scalable as it allows real-time processing and batch processing
both. Also, all the SQL datatypes are supported by Hive thus, making the
query processing easier.
Similar to the Query Processing frameworks, HIVE too comes with two
components: JDBC Drivers and HIVE Command Line.
JDBC, along with ODBC drivers work on establishing the data storage
permissions and connection whereas HIVE Command line helps in the
processing of queries.
Data movement is one of those things that you aren’t likely to think
too much about until you’re fully committed to using Hadoop on a
project, at which point it becomes this big scary unknown that has to
be tackled. How do you get your log data sitting across thousands of
hosts into Hadoop? What’s the most efficient way to get your data out
of your relational and No/NewSQL systems and into Hadoop? How
do you get Lucene indexes generated in Hadoop out to your servers?
And how can these processes be automated?
Once the low-level tooling is out of the way, we’ll survey higher-level
tools that have simplified the process of ferrying data into Hadoop.
We’ll look at how you can automate the movement of log files with
Flume, and how Sqoop can be used to move relational data. So as not
to ignore some of the emerging data systems, you’ll also be
introduced to methods that can be employed to move data from
HBase and Kafka into Hadoop.
We’ll cover a lot of ground in this chapter, and it’s likely that you’ll
have specific types of data you need to work with. If this is the case,
feel free to jump directly to the section that provides the details you
need.
Let’s start things off with a look at key ingress and egress system
considerations.
Idempotence
Aggregation
Compression
In addition, it’s possible that there are problems in the source data
itself due to bugs in the software generating the data. Performing
these checks at ingress time allows you to do a one-time check,
instead of dealing with all the downstream consumers of the data that
would have to be updated to handle errors in the data.
Resource consumption and performance
Monitoring
Speculative execution
On to the techniques. Let’s start with how you can leverage Hadoop’s
built-in ingress mechanisms.
5.2. Moving data into Hadoop
In this section we’ll look at methods for moving source data into
Hadoop. I’ll use the design considerations in the previous section as
the criteria for examining and understanding the different tools.
We’ll get things started with a look at some low-level methods you
can use to move data into Hadoop.
The low-level tools in this section work well for one-off file
movement activities, or when working with legacy data sources and
destinations that are file-based. But moving data in this way is quickly
becoming obsolete by the availability of tools such as Flume and
Kafka (covered later in this chapter), which offer automated data
movement pipelines.
Problem
Solution
Discussion
Copying a file from local disk to HDFS is done with the hadoop
command:
The behavior of the Hadoop -put command differs from the Linux cp
command—in Linux if the destination already exists, it is overwritten;
in Hadoop the copy fails with an error:
put: `hdfs-file.txt': File exists
Much like with the Linux cp command, multiple files can be copied
using the same command. In this case, the final argument must be the
directory in HDFS into which the local files are copied: