0% found this document useful (0 votes)

22 views13 pages

Unit 3 Bda

Uploaded by

mokshagnapatel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views13 pages

Unit 3 Bda

Uploaded by

mokshagnapatel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

You are on page 1/ 13

UNIT-3

INTRODUCTION:
Hadoop is an open-source software framework that is used for storing and
processing large amounts of data in a distributed computing environment. It is
designed to handle big data and is based on the MapReduce programming model,
which allows for the parallel processing of large datasets.

Hadoop has two main components:

 HDFS (Hadoop Distributed File System): This is the storage component of
Hadoop, which allows for the storage of large amounts of data across
multiple machines. It is designed to work with commodity hardware,
which makes it cost-effective.
 YARN (Yet Another Resource Negotiator): This is the resource
management component of Hadoop, which manages the allocation of
resources (such as CPU and memory) for processing the data stored in
HDFS.
 Hadoop also includes several additional modules that provide additional
functionality, such as Hive (a SQL-like query language), Pig (a high-level
platform for creating MapReduce programs), and HBase (a non-relational,
distributed database).
 Hadoop is commonly used in big data scenarios such as data warehousing,
business intelligence, and machine learning. It’s also used for data
processing, data analysis, and data mining.
 What is Hadoop?
 Hadoop is an open source software programming framework for storing a
large amount of data and performing the computation. Its framework is based
on Java programming with some native code in C and shell scripts.
 Hadoop is an open-source software framework that is used for storing and
processing large amounts of data in a distributed computing environment. It is
designed to handle big data and is based on the MapReduce programming
model, which allows for the parallel processing of large datasets.

Hadoop has several key features that make it well-suited for big data
processing:

 Distributed Storage: Hadoop stores large data sets across multiple

machines, allowing for the storage and processing of extremely large
amounts of data.
 Scalability: Hadoop can scale from a single server to thousands of
machines, making it easy to add more capacity as needed.
 Fault-Tolerance: Hadoop is designed to be highly fault-tolerant, meaning
it can continue to operate even in the presence of hardware failures.
 Data locality: Hadoop provides data locality feature, where the data is
stored on the same node where it will be processed, this feature helps to
reduce the network traffic and improve the performance
 High Availability: Hadoop provides High Availability feature, which
helps to make sure that the data is always available and is not lost.
 Flexible Data Processing: Hadoop’s MapReduce programming model
allows for the processing of data in a distributed fashion, making it easy to
implement a wide variety of data processing tasks.
 Data Integrity: Hadoop provides built-in checksum feature, which helps to
ensure that the data stored is consistent and correct.
 Data Replication: Hadoop provides data replication feature, which helps to
replicate the data across the cluster for fault tolerance.
 Data Compression: Hadoop provides built-in data compression feature,
which helps to reduce the storage space and improve the performance.
 YARN: A resource management platform that allows multiple data
processing engines like real-time streaming, batch processing, and
interactive SQL, to run and process data stored in HDFS.

Overview: Apache Hadoop is an open source framework intended to make

interaction with big data easier, However, for those who are not acquainted with this
technology, one question arises that what is big data ? Big data is a term given to the
data sets which can’t be processed in an efficient manner with the help of traditional
methodology such as RDBMS. Hadoop has made its place in the industries and
companies that need to work on large data sets which are sensitive and needs
efficient handling. Hadoop is a framework that enables processing of large data sets
which reside in the form of clusters. Being a framework, Hadoop is made up of
several modules that are supported by a large ecosystem of technologies.
Introduction: Hadoop Ecosystem is a platform or a suite which provides various
services to solve the big data problems. It includes Apache projects and various
commercial tools and solutions. There are four major elements of
Hadoop i.e. HDFS, MapReduce, YARN, and Hadoop Common. Most of the tools
or solutions are used to supplement or support these major elements. All these tools
work collectively to provide services such as absorption, analysis, storage and
maintenance of data etc.
Following are the components that collectively form a Hadoop ecosystem:

 HDFS: Hadoop Distributed File System

 YARN: Yet Another Resource Negotiator
 MapReduce: Programming based Data Processing
 Spark: In-Memory data processing
 PIG, HIVE: Query based processing of data services
 HBase: NoSQL Database
 Mahout, Spark MLLib: Machine Learning algorithm libraries
 Solar, Lucene: Searching and Indexing
 Zookeeper: Managing cluster
 Oozie: Job Scheduling

Note: Apart from the above-mentioned components, there are many other
components too that are part of the Hadoop ecosystem.
All these toolkits or components revolve around one term i.e. Data. That’s the
beauty of Hadoop that it revolves around data and hence making its synthesis easier.
HDFS:

 HDFS is the primary or major component of Hadoop ecosystem and is

responsible for storing large data sets of structured or unstructured data
across various nodes and thereby maintaining the metadata in the form of
log files.
 HDFS consists of two core components i.e.
1. Name node
2. Data Node
 Name Node is the prime node which contains metadata (data about data)
requiring comparatively fewer resources than the data nodes that stores the
actual data. These data nodes are commodity hardware in the distributed
environment. Undoubtedly, making Hadoop cost effective.
 HDFS maintains all the coordination between the clusters and hardware,
thus working at the heart of the system.
YARN:
Yet Another Resource Negotiator, as the name implies, YARN is the one
who helps to manage the resources across the clusters. In short, it performs
scheduling and resource allocation for the Hadoop System.
 Consists of three major components i.e.
1. Resource Manager
2. Nodes Manager
3. Application Manager
 Resource manager has the privilege of allocating resources for the
applications in a system whereas Node managers work on the allocation of
resources such as CPU, memory, bandwidth per machine and later on
acknowledges the resource manager. Application manager works as an
interface between the resource manager and node manager and performs
negotiations as per the requirement of the two.
MapReduce:

 By making the use of distributed and parallel algorithms, MapReduce

makes it possible to carry over the processing’s logic and helps to write
applications which transform big data sets into a manageable one.
 MapReduce makes the use of two functions i.e. Map() and Reduce()
whose task is:
1. Map() performs sorting and filtering of data and thereby
organizing them in the form of group. Map generates a key-
value pair based result which is later on processed by the
Reduce() method.
2. Reduce(), as the name suggests does the summarization by
aggregating the mapped data. In simple, Reduce() takes the
output generated by Map() as input and combines those tuples
into smaller set of tuples.
PIG:
Pig was basically developed by Yahoo which works on a pig Latin language, which
is Query based language similar to SQL.
 It is a platform for structuring the data flow, processing and analyzing
huge data sets.
 Pig does the work of executing commands and in the background, all the
activities of MapReduce are taken care of. After the processing, pig stores
the result in HDFS.
 Pig Latin language is specially designed for this framework which runs on
Pig Runtime. Just the way Java runs on the JVM.
 Pig helps to achieve ease of programming and optimization and hence is a
major segment of the Hadoop Ecosystem.
HIVE:

 With the help of SQL methodology and interface, HIVE performs reading
and writing of large data sets. However, its query language is called as
HQL (Hive Query Language).
 It is highly scalable as it allows real-time processing and batch processing
both. Also, all the SQL datatypes are supported by Hive thus, making the
query processing easier.
 Similar to the Query Processing frameworks, HIVE too comes with two
components: JDBC Drivers and HIVE Command Line.
 JDBC, along with ODBC drivers work on establishing the data storage
permissions and connection whereas HIVE Command line helps in the
processing of queries.
Note: Apart from the above-mentioned components, there are many other
components too that are part of the Hadoop ecosystem.
All these toolkits or components revolve around one term i.e. Data. That’s the
beauty of Hadoop that it revolves around data and hence making its synthesis easier.
HDFS:

 HDFS is the primary or major component of Hadoop ecosystem and is

Yet Another Resource Negotiator, as the name implies, YARN is the one

who helps to manage the resources across the clusters. In short, it performs
scheduling and resource allocation for the Hadoop System.
 Consists of three major components i.e.
1. Resource Manager
2. Nodes Manager
3. Application Manager
 Resource manager has the privilege of allocating resources for the
applications in a system whereas Node managers work on the allocation of
resources such as CPU, memory, bandwidth per machine and later on
acknowledges the resource manager. Application manager works as an
interface between the resource manager and node manager and performs
negotiations as per the requirement of the two.
MapReduce:
 By making the use of distributed and parallel algorithms, MapReduce
makes it possible to carry over the processing’s logic and helps to write
applications which transform big data sets into a manageable one.
 MapReduce makes the use of two functions i.e. Map() and Reduce()
whose task is:
1. Map() performs sorting and filtering of data and thereby
organizing them in the form of group. Map generates a key-
value pair based result which is later on processed by the
Reduce() method.
2. Reduce(), as the name suggests does the summarization by
aggregating the mapped data. In simple, Reduce() takes the
output generated by Map() as input and combines those tuples
into smaller set of tuples.
PIG:
Pig was basically developed by Yahoo which works on a pig Latin language, which
is Query based language similar to SQL.
 It is a platform for structuring the data flow, processing and analyzing
huge data sets.
 Pig does the work of executing commands and in the background, all the
activities of MapReduce are taken care of. After the processing, pig stores
the result in HDFS.
 Pig Latin language is specially designed for this framework which runs on
Pig Runtime. Just the way Java runs on the JVM.
 Pig helps to achieve ease of programming and optimization and hence is a
major segment of the Hadoop Ecosystem.
HIVE:

Moving data into and out of Hadoop

Understanding key design considerations for data ingress and egress
tools

· Low-level methods for moving data into and out of Hadoop

· Techniques for moving log files and relational and NoSQL data, as
well as data in Kafka, in and out of HDFS

Data movement is one of those things that you aren’t likely to think
too much about until you’re fully committed to using Hadoop on a
project, at which point it becomes this big scary unknown that has to
be tackled. How do you get your log data sitting across thousands of
hosts into Hadoop? What’s the most efficient way to get your data out
of your relational and No/NewSQL systems and into Hadoop? How
do you get Lucene indexes generated in Hadoop out to your servers?
And how can these processes be automated?

Welcome to chapter 5, where the goal is to answer these questions

and set you on your path to worry-free data movement. In this chapter
you’ll first see how data across a broad spectrum of locations and
formats can be moved into Hadoop, and then you’ll see how data can
be moved out of Hadoop.

This chapter starts by highlighting key data-movement properties, so

that as you go through the rest of this chapter you can evaluate the fit
of the various tools. It goes on to look at low-level and high-level
tools that can be used to move your data. We’ll start with some simple
techniques, such as using the command line and Java for ingress,[1] but
we’ll quickly move on to more advanced techniques like using NFS
and DistCp.
1
Ingress and egress refer to data movement into and out of a system,
respectively.

Once the low-level tooling is out of the way, we’ll survey higher-level
tools that have simplified the process of ferrying data into Hadoop.
We’ll look at how you can automate the movement of log files with
Flume, and how Sqoop can be used to move relational data. So as not
to ignore some of the emerging data systems, you’ll also be
introduced to methods that can be employed to move data from
HBase and Kafka into Hadoop.

We’ll cover a lot of ground in this chapter, and it’s likely that you’ll
have specific types of data you need to work with. If this is the case,
feel free to jump directly to the section that provides the details you
need.

Let’s start things off with a look at key ingress and egress system
considerations.

5.1. Key elements of data movement

Moving large quantities of data in and out of Hadoop offers logistical

challenges that include consistency guarantees and resource impacts
on data sources and destinations. Before we dive into the techniques,
however, we need to discuss the design elements you should be aware
of when working with data movement.

Idempotence

An idempotent operation produces the same result no matter how

many times it’s executed. In a relational database, the inserts typically
aren’t idempotent, because executing them multiple times doesn’t
produce the same resulting database state. Alternatively, updates often
are idempotent, because they’ll produce the same end result.

Any time data is being written, idempotence should be a

consideration, and data ingress and egress in Hadoop are no different.
How well do distributed log collection frameworks deal with data
retransmissions? How do you ensure idempotent behavior in a
MapReduce job where multiple tasks are inserting into a database in
parallel? We’ll examine and answer these questions in this chapter.

Aggregation

The data aggregation process combines multiple data elements. In the

context of data ingress, this can be useful because moving large
quantities of small files into HDFS potentially translates into
NameNode memory woes, as well as slow MapReduce execution
times. Having the ability to aggregate files or data together mitigates
this problem and is a feature to consider.

Data format transformation

The data format transformation process converts one data format into
another. Often your source data isn’t in a format that’s ideal for
processing in tools such as Map-Reduce. If your source data is in
multiline XML or JSON form, for example, you may want to consider
a preprocessing step. This would convert the data into a form that can
be split, such as one JSON or XML element per line, or convert it into
a format such as Avro. Chapter 3 contains more details on these data
formats.

Compression

Compression not only helps by reducing the footprint of data at rest,

but also has I/O advantages when reading and writing data.

Availability and recoverability

Recoverability allows an ingress or egress tool to retry in the event of

a failed operation. Because it’s unlikely that any data source, sink, or
Hadoop itself can be 100% available, it’s important that an ingress or
egress action be retried in the event of failure.

Reliable data transfer and data validation

In the context of data transportation, checking for correctness is how

you verify that no data corruption occurred as the data was in transit.
When you work with heterogeneous systems such as Hadoop data
ingress and egress, the fact that data is being transported across
different hosts, networks, and protocols only increases the potential
for problems during data transfer. A common method for checking the
correctness of raw data, such as storage devices, is Cyclic
Redundancy Checks (CRCs), which are what HDFS uses internally to
maintain block-level integrity.

In addition, it’s possible that there are problems in the source data
itself due to bugs in the software generating the data. Performing
these checks at ingress time allows you to do a one-time check,
instead of dealing with all the downstream consumers of the data that
would have to be updated to handle errors in the data.
Resource consumption and performance

Resource consumption and performance are measures of system

resource utilization and system efficiency, respectively. Ingress and
egress tools don’t typically impose significant load (resource
consumption) on a system, unless you have appreciable data volumes.
For performance, the questions to ask include whether the tool
performs ingress and egress activities in parallel, and if so, what
mechanisms it provides to tune the amount of parallelism. For
example, if your data source is a production database and you’re
using MapReduce to ingest that data, don’t use a large number of
concurrent map tasks to import data.

Monitoring

Monitoring ensures that functions are performing as expected in

automated systems. For data ingress and egress, monitoring breaks
down into two elements: ensuring that the processes involved in
ingress and egress are alive, and validating that source and destination
data are being produced as expected. Monitoring should also include
verifying that the data volumes being moved are at expected levels;
unexpected drops or highs in your data will alert you to potential
system issues or bugs in your software.

Speculative execution

MapReduce has a feature called speculative execution that launches

duplicate tasks near the end of a job for tasks that are still executing.
This helps prevent slow hardware from impacting job execution
times. But if you’re using a map task to perform inserts into a
relational database, for example, you should be aware that you could
have two parallel processes inserting the same data.[2]
2
Map- and reduce-side speculative execution can be disabled via the
mapreduce.map.speculative and mapreduce.reduce.speculative
configurables in Hadoop 2.

On to the techniques. Let’s start with how you can leverage Hadoop’s
built-in ingress mechanisms.
5.2. Moving data into Hadoop

The first step in working with data in Hadoop is to make it available

to Hadoop. There are two primary methods that can be used to move
data into Hadoop: writing external data at the HDFS level (a data
push), or reading external data at the MapReduce level (more like a
pull). Reading data in MapReduce has advantages in the ease with
which the operation can be parallelized and made fault tolerant. Not
all data is accessible from MapReduce, however, such as in the case
of log files, which is where other systems need to be relied on for
transportation, including HDFS for the final data hop.

In this section we’ll look at methods for moving source data into
Hadoop. I’ll use the design considerations in the previous section as
the criteria for examining and understanding the different tools.

We’ll get things started with a look at some low-level methods you
can use to move data into Hadoop.

5.2.1. Roll your own ingest

Hadoop comes bundled with a number of methods to get your data

into HDFS. This section will examine various ways that these built-in
tools can be used for your data movement needs. The first and
potentially easiest tool you can use is the HDFS command line.

Picking the right ingest tool for the job

The low-level tools in this section work well for one-off file
movement activities, or when working with legacy data sources and
destinations that are file-based. But moving data in this way is quickly
becoming obsolete by the availability of tools such as Flume and
Kafka (covered later in this chapter), which offer automated data
movement pipelines.

Kafka is a much better platform for getting data from A to B (and B

can be a Hadoop cluster) than the old-school “let’s copy files
around!” With Kafka, you only need to pump your data into Kafka,
and you have the ability to consume the data in real time (such as via
Storm) or in offline/batch jobs (such as via Camus).

File-based ingestion flows are, to me at least, a relic of the past

(because everybody knows how scp works :-P), and they primarily
exist for legacy reasons—the upstream data sources may have
existing tools to create file snapshots (such as dump tools for the
database), and there’s no infrastructure to migrate or move the data
into a real-time messaging system such as Kafka.

Technique 33 Using the CLI to load files

If you have a manual activity that you need to perform, such as

moving the examples bundled with this book into HDFS, then the
HDFS command-line interface (CLI) is the tool for you. It’ll allow
you to perform most of the operations that you’re used to performing
on a regular Linux filesystem. In this section we’ll focus on copying
data from a local filesystem into HDFS.

Problem

You want to copy files into HDFS using the shell.

Solution

The HDFS command-line interface can be used for one-off moves, or

it can be incorporated into scripts for a series of moves.

Discussion

Copying a file from local disk to HDFS is done with the hadoop
command:

$ hadoop fs -put local-file.txt hdfs-file.txt

The behavior of the Hadoop -put command differs from the Linux cp
command—in Linux if the destination already exists, it is overwritten;
in Hadoop the copy fails with an error:
put: `hdfs-file.txt': File exists

The -f option must be added to force the file to be overwritten:

$ hadoop fs -put -f local-file.txt hdfs-file.txt

Much like with the Linux cp command, multiple files can be copied
using the same command. In this case, the final argument must be the
directory in HDFS into which the local files are copied:

$ hadoop fs -put local-file1.txt local-file2.txt /hdfs/dest/

Unit Iii
No ratings yet
Unit Iii
20 pages
Unit 2
No ratings yet
Unit 2
23 pages
UNIT III
No ratings yet
UNIT III
9 pages
Big Data Unit 4
No ratings yet
Big Data Unit 4
96 pages
Unit 2 Big Data Notes
No ratings yet
Unit 2 Big Data Notes
21 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
49 pages
CC Unit - 5
No ratings yet
CC Unit - 5
27 pages
BDA Presentations Unit-4 - Hadoop, Ecosystem
100% (1)
BDA Presentations Unit-4 - Hadoop, Ecosystem
25 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Hadoop Ecosystem: Hdfs Mapreduce Yarn Hadoop Common
No ratings yet
Hadoop Ecosystem: Hdfs Mapreduce Yarn Hadoop Common
5 pages
Bda Unit 2
No ratings yet
Bda Unit 2
21 pages
BDA Unit 3
No ratings yet
BDA Unit 3
30 pages
1.1.1
No ratings yet
1.1.1
30 pages
U2_Hadoop EcoSytem
No ratings yet
U2_Hadoop EcoSytem
6 pages
Module 2. 16974328568170
No ratings yet
Module 2. 16974328568170
113 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
4 pages
Big data 2 - part
No ratings yet
Big data 2 - part
40 pages
Bda Module 2
No ratings yet
Bda Module 2
12 pages
CC UNIT 2 (1)
No ratings yet
CC UNIT 2 (1)
29 pages
hadoop ecosystem-converted
No ratings yet
hadoop ecosystem-converted
5 pages
INTRODUCTION TO DATA SCIENCE
No ratings yet
INTRODUCTION TO DATA SCIENCE
14 pages
Unit-2 (HADOOP)
No ratings yet
Unit-2 (HADOOP)
20 pages
Bda QB Soln
No ratings yet
Bda QB Soln
22 pages
Part Big Data Unit-IV[1]
No ratings yet
Part Big Data Unit-IV[1]
12 pages
Hadoop
No ratings yet
Hadoop
11 pages
Bda Lab Manual
0% (1)
Bda Lab Manual
40 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
7 pages
BIG DATA UNIT 2
No ratings yet
BIG DATA UNIT 2
277 pages
BD Unit-4
No ratings yet
BD Unit-4
79 pages
BDA Unit2 Notes
No ratings yet
BDA Unit2 Notes
23 pages
BD - Unit - II - Hadoop Frameworks and HDFS
No ratings yet
BD - Unit - II - Hadoop Frameworks and HDFS
37 pages
L-2
No ratings yet
L-2
5 pages
Unit-III (Big Data) Final
No ratings yet
Unit-III (Big Data) Final
34 pages
Module-2
No ratings yet
Module-2
23 pages
UNIT II
No ratings yet
UNIT II
30 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
UNIT-I Introduction To Hadoop - A20
No ratings yet
UNIT-I Introduction To Hadoop - A20
24 pages
Unit 2 - Hadoop PDF
No ratings yet
Unit 2 - Hadoop PDF
7 pages
Hadoop Ecosystem and Their Components
No ratings yet
Hadoop Ecosystem and Their Components
19 pages
Big Data Analytics – Unit 4
No ratings yet
Big Data Analytics – Unit 4
32 pages
Big Data Lab Manual
No ratings yet
Big Data Lab Manual
44 pages
SDCBDASPARKWEEK1-1
No ratings yet
SDCBDASPARKWEEK1-1
9 pages
BigData Unit 2
No ratings yet
BigData Unit 2
15 pages
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
BDP UNIT 4
No ratings yet
BDP UNIT 4
28 pages
HADOOP
No ratings yet
HADOOP
10 pages
HADOOP ECOSSYTEM, COMPONENTS, Loading, Getting Data From Hadoop
No ratings yet
HADOOP ECOSSYTEM, COMPONENTS, Loading, Getting Data From Hadoop
10 pages
HADOOP
No ratings yet
HADOOP
19 pages
unit 5 bda (1)
No ratings yet
unit 5 bda (1)
8 pages
Unit III
No ratings yet
Unit III
15 pages
Chapter 2 Introduction To Hadoop
No ratings yet
Chapter 2 Introduction To Hadoop
31 pages
BDA Module 2
No ratings yet
BDA Module 2
40 pages
Bda Summer 2022 Solution
No ratings yet
Bda Summer 2022 Solution
30 pages
Hadoop Components
No ratings yet
Hadoop Components
5 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
5 pages
UNIT2 BDA
No ratings yet
UNIT2 BDA
12 pages
Hadoop Ecosystem PDF
No ratings yet
Hadoop Ecosystem PDF
6 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
unit 2
No ratings yet
unit 2
9 pages
Apache Hadoop
No ratings yet
Apache Hadoop
27 pages
Resume
No ratings yet
Resume
2 pages
As 118270 Kv-Xle02 Um B04GB WW GB 2041 1
No ratings yet
As 118270 Kv-Xle02 Um B04GB WW GB 2041 1
1,598 pages
Tableau Forensic TD4 Duplicator
No ratings yet
Tableau Forensic TD4 Duplicator
3 pages
CIS Ubuntu Linux 20.04 LTS STIG Benchmark v1.0.0
No ratings yet
CIS Ubuntu Linux 20.04 LTS STIG Benchmark v1.0.0
912 pages
Week 12
No ratings yet
Week 12
99 pages
Bit 1101 Computer Architecture Complete Lecture Notes for First
No ratings yet
Bit 1101 Computer Architecture Complete Lecture Notes for First
90 pages
Direct Part Mark Verifier
No ratings yet
Direct Part Mark Verifier
2 pages
MySQL Alter Table
No ratings yet
MySQL Alter Table
3 pages
6SigmaRoom Tutorial 6
No ratings yet
6SigmaRoom Tutorial 6
18 pages
Certificate: Fatima College of Business Administration
No ratings yet
Certificate: Fatima College of Business Administration
26 pages
WCE700 MX51 ER 1106 ReferenceManual
No ratings yet
WCE700 MX51 ER 1106 ReferenceManual
245 pages
DBMS File
No ratings yet
DBMS File
61 pages
06-237379-001 - Novec 1230 - ECS - Flow - Calc - 4 - 0 - Rev - AA
No ratings yet
06-237379-001 - Novec 1230 - ECS - Flow - Calc - 4 - 0 - Rev - AA
68 pages
Document 1961414.1 Check Printing Issue
No ratings yet
Document 1961414.1 Check Printing Issue
2 pages
Gintec SurPad 4.0
No ratings yet
Gintec SurPad 4.0
2 pages
Unit - 4
No ratings yet
Unit - 4
63 pages
Laghouat by Lazhar and Ahmed
No ratings yet
Laghouat by Lazhar and Ahmed
9 pages
Introduction To Visual Representation
No ratings yet
Introduction To Visual Representation
4 pages
Module 3 Introduction To SQL
No ratings yet
Module 3 Introduction To SQL
21 pages
Py Report
No ratings yet
Py Report
13 pages
SHWETA MGT COVER PROJECT
No ratings yet
SHWETA MGT COVER PROJECT
8 pages
Lyx Classicthesis Template
100% (3)
Lyx Classicthesis Template
7 pages
List of Excel Shortcuts Keys PDF For Windows & Mac
No ratings yet
List of Excel Shortcuts Keys PDF For Windows & Mac
1 page
Curated List of AI and Machine Learning Resources From Around The Web - by Robbie Allen - Machine Learning in Practice - Medium
No ratings yet
Curated List of AI and Machine Learning Resources From Around The Web - by Robbie Allen - Machine Learning in Practice - Medium
9 pages
Array and List Sorting Worksheet
No ratings yet
Array and List Sorting Worksheet
6 pages
Running C Programs Bare Metal Arm Gnu Toolchain Foss GBG 20180926
No ratings yet
Running C Programs Bare Metal Arm Gnu Toolchain Foss GBG 20180926
118 pages
Hacettepe University: Laboratory Rules Experiments
No ratings yet
Hacettepe University: Laboratory Rules Experiments
8 pages
Library Manager
No ratings yet
Library Manager
20 pages
LLM Hallucinations in Practical Code Generation
No ratings yet
LLM Hallucinations in Practical Code Generation
13 pages
Report Event
No ratings yet
Report Event
24 pages

Unit 3 Bda

Uploaded by

Unit 3 Bda

Uploaded by

UNIT-3

Hadoop has two main components:

 Distributed Storage: Hadoop stores large data sets across multiple

Overview: Apache Hadoop is an open source framework intended to make

 HDFS: Hadoop Distributed File System

 HDFS is the primary or major component of Hadoop ecosystem and is

 By making the use of distributed and parallel algorithms, MapReduce

 HDFS is the primary or major component of Hadoop ecosystem and is

Moving data into and out of Hadoop

· Low-level methods for moving data into and out of Hadoop

Welcome to chapter 5, where the goal is to answer these questions

This chapter starts by highlighting key data-movement properties, so

5.1. Key elements of data movement

Moving large quantities of data in and out of Hadoop offers logistical

An idempotent operation produces the same result no matter how

Any time data is being written, idempotence should be a

The data aggregation process combines multiple data elements. In the

Data format transformation

Compression not only helps by reducing the footprint of data at rest,

Availability and recoverability

Recoverability allows an ingress or egress tool to retry in the event of

Reliable data transfer and data validation

In the context of data transportation, checking for correctness is how

Resource consumption and performance are measures of system

Monitoring ensures that functions are performing as expected in

MapReduce has a feature called speculative execution that launches

The first step in working with data in Hadoop is to make it available

5.2.1. Roll your own ingest

Hadoop comes bundled with a number of methods to get your data

Picking the right ingest tool for the job

Kafka is a much better platform for getting data from A to B (and B

File-based ingestion flows are, to me at least, a relic of the past

Technique 33 Using the CLI to load files

If you have a manual activity that you need to perform, such as

You want to copy files into HDFS using the shell.

The HDFS command-line interface can be used for one-off moves, or

$ hadoop fs -put local-file.txt hdfs-file.txt

The -f option must be added to force the file to be overwritten:

$ hadoop fs -put -f local-file.txt hdfs-file.txt

$ hadoop fs -put local-file1.txt local-file2.txt /hdfs/dest/

You might also like