0% found this document useful (0 votes)
8 views76 pages

Big Data Unit-2 PPT part1

The document provides a comprehensive overview of the history and architecture of Apache Hadoop, an open-source framework for storing and processing large datasets. It details the evolution of Hadoop from its origins in Apache Nutch, the introduction of the Hadoop Distributed File System (HDFS), and the role of YARN in resource management. Additionally, it discusses data formats used in Hadoop and the concepts of scaling up versus scaling out in data storage solutions.

Uploaded by

guptaraman600
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views76 pages

Big Data Unit-2 PPT part1

The document provides a comprehensive overview of the history and architecture of Apache Hadoop, an open-source framework for storing and processing large datasets. It details the evolution of Hadoop from its origins in Apache Nutch, the introduction of the Hadoop Distributed File System (HDFS), and the role of YARN in resource management. Additionally, it discusses data formats used in Hadoop and the concepts of scaling up versus scaling out in data storage solutions.

Uploaded by

guptaraman600
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 76

1

2
3
HISTORY
• In 2002, Doug Cutting and Mike Cafarella
started to work on a project, Apache Nutch. It
is an open source web crawler software project.
• While working on Apache Nutch, To store the big
data they have to spend a lot of costs which
becomes the consequence of that project. This
problem becomes one of the important reason
for the emergence of Hadoop.
• In 2003, Google introduced a file system
known as GFS (Google file system). It is a
proprietary distributed file system developed to
provide efficient access to data.

4
• In 2004, Google released a white paper on Map
Reduce. This technique simplifies the data
processing on large clusters.
• In 2005, Doug Cutting and Mike Cafarella introduced
a new file system known as NDFS (Nutch
Distributed File System). Hadoop, orginally
called Nutch Distributed File System
• The name ‘Hadoop’ was given by one of Doug
cutting’s son to
that son’s toy elephant.
• In 2006, Doug Cutting quit Google and joined Yahoo.
On the basis of the Nutch project, Dough Cutting
introduces a new project Hadoop with a file
system known as HDFS (Hadoop Distributed File
System).
5
Histor
y

6
Apache Hadoop
⚫ As the amount of data grew rapidly in early 2000, there
were issues
with storing such vast volumes of data.
⚫ Doug Cutting came up with the idea to develop Hadoop
in 2002
⚫ Hadoop was created by Doug Cutting and Mike
Cafarella in 2005.
⚫ Hadoop was developed at the Apache Software
Foundation in
2005.
⚫ It is written in Java.
⚫ Hadoop is an java based Apache open source
framework that allows to store and process large
data in a parallel and distributed manner.
⚫ Hadoop has units like HDFS - to store Big
Data , MapReduce - to process Big Data and 7
Is Hadoop a
Database?

8
⚫ Hadoop is not a database, but rather an
open-source software framework
specifically built to handle large volumes of
structured, semi-structured data and
unstructured data
⚫ Hadoop is an open-source software
framework for storing and processing large
datasets ranging in size from gigabytes to
petabytes

9
How does a Hadoop work?

⚫ Itis quite expensive to build bigger servers


with heavy configurations that handle large scale
processing, but as an alternative, you can tie
together many commodity computers with
single CPU
⚫ the clustered machines can read the dataset

in parallel and provide a much higher


throughput.
⚫ it is cheaper than one high-end server.

⚫ So this is the first motivational factor


behind using Hadoop that it runs across
clustered and low-cost machines.
10
Features

⚫ Hadoop is not only a storage system but is


a
platform for data storage as well as
processing.
⚫ It is scalable (as we can add more nodes on
the fly).
⚫ Fault tolerant (Even if nodes go down, data
processed by another node).
⚫ It efficiently processes large volumes of data
on a
cluster of commodity hardware.
11
Why to use Apache
Hadoop?
⚫ Hadoop works in master-slave fashion.
There is a master node and there are n
numbers of slave nodes .
⚫ Master manages, maintains and monitors
the slaves.
⚫ Master stores the metadata (data about
data) while
slaves are the nodes which store the
data.
⚫ Distributed data stores in the cluster.
⚫ The client connects with master node 12
Components of Hadoop

13
Hadoop distributed File System

⚫ In Hadoop data resides in a distributed file


system which is called a Hadoop Distributed
File system.
⚫ The Hadoop Distributed File System (HDFS) is
based on the Google File System (GFS) and
provides a distributed file system that is
designed to run on commodity hardware
⚫ Hadoop Distributed File System (HDFS) is the
storage unit
of Hadoop.
⚫ HDFS splits files into blocks and sends
them across various nodes in form of
large clusters. 14
⚫ HDFS is designed for storing very large
data files, running on clusters of
commodity hardware.
⚫ It is fault tolerant, scalable, and extremely
simple to expand.
⚫ Hadoop HDFS has a Master/Slave
architecture in which Master is
NameNode and Slave is DataNode.
⚫ HDFS Architecture consists of single
NameNode and all the other nodes are
DataNodes.
15
HDFS Architecture

16
HDFS NameNode
⚫ It is also known as Master node.
⚫ HDFS Namenode stores meta-data i.e.
number of data blocks, replicas and other
details.
⚫ This meta-data is available in memory in the
master for
faster retrieval of data.
⚫ NameNode maintains and manages the
slave nodes, and assigns tasks to them.
⚫ It should deploy on reliable hardware as it is
the
17
Functions of NameNode
⚫ Manage file system namespace.
⚫ Regulates client’s access to files.
⚫ It also executes file system execution such
as naming, closing, opening
files/directories.
⚫ All DataNodes sends a Heartbeat and block
report to the NameNode in the Hadoop
cluster
⚫ NameNode is also responsible for taking
care of the
Replication Factor of all the blocks 18
Files present in the
NameNode metadata are

19
FsImage –
⚫ It is an “Image file”
⚫ Fsimage stands for File System image
⚫ It contains the complete directory
structure (namespace) of the HDFS with
details about the location of the data on
the Data Blocks and which blocks are
stored on which node
• It stored as a file in the namenode’s
local file system.
• Fsimage is a point-in-time snapshot of
HDFS's namespace
• The last snapshot is actually stored in
FSImage. 20
EditLogs –
⚫ EditLogs is a transaction log that records
the changes in the HDFS file system or any
action performed on the HDFS cluster such as
addition of a new block, replication,
deletion etc
⚫ Edit log records every changes from the last
snapshot.
⚫ In short, it records the changes since the
last FsImage was created
⚫ It contains all the recent modifications
made to the file system on the most recent
FsImage.
⚫ Namenode receives a create/update/delete
request from the client. After that this 21
HDFS DataNode
⚫ It is also known as Slave.
⚫ In Hadoop HDFS Architecture, DataNode
stores actual data in HDFS.
⚫ It performs read and write operation
as per the request of the client.
⚫ DataNodes can deploy on commodity
hardware

22
Functions of DataNode
⚫ Block replica creation, deletion, and
replication
according to the instruction of Namenode.
⚫ DataNode manages data storage of the
system.
⚫ DataNodes send heartbeat to the
NameNode
⚫ By default, this frequency is set to 3
seconds.
⚫ Every 3 seconds, each DataNode sends a
heartbeat signal to the NameNode. 23
Blocks
⚫ HDFS in Apache Hadoop split huge files
into small chunks known as Blocks.
⚫ These are the smallest unit of data in a
filesystem.
⚫ We (client and admin) do not have any
control on the block like block location.

24
Block size of a HDFS

A typical block size used by


HDFS is
128 MB. Thus, an HDFS
file is chopped up into 128 MB
chunks, and if possible, each
chunk will reside on a different 25
Replication Management
⚫ Block replication provides fault tolerance. If
one copy is not accessible and corrupted then
we can read data from other copy.
⚫ The number of copies or replicas of each
block of a file is replication factor. The default
replication factor is 3 which are again
configurable. So, each block replicates
three times and stored on different
DataNodes.
⚫ NameNode receives block report from
DataNode
periodically to maintain the replication factor.

26
Secondary Namenode:
⚫ Secondary NameNode downloads the
FsImage and EditLogs from the
NameNode.
⚫ And then merges EditLogs with the
FsImage
(FileSystem Image).
⚫ It keeps edits log size within a limit.
⚫ It stores the modified FsImage into
persistent storage.
⚫ And we can use it in the case of
NameNode failure. 27
Rack Awareness
The Rack is the collection of around 40-50
DataNodes connected using the same network
switch.
If the network goes down, the whole rack
will be unavailable. A large Hadoop cluster
is deployed in multiple racks.

28
Rack Awareness
⚫ In HDFS Architecture, NameNode makes
sure that all the replicas are not stored
on the same rack or single rack.
⚫ It follows Rack Awareness Algorithm
to reduce latency as well as fault
tolerance.
⚫ We know that default replication
factor is 3.
⚫ Rack Awareness is important to improve:
 Data high availability and reliability.
 The performance of the cluster.
 To improve network bandwidth 29
30
Why did we have YARN?
(Yet Another Resource Negotiator)
⚫ In Hadoop 1.x version the only way to
process data was through
MapReduce.( MapReduce is a
processing framework or a program written
in java on Hadoop.)
⚫ Yarn was introduced in Hadoop 2.x
version.
⚫ In Hadoop 1.x , JobTracker was a single of
failure and
all the cluster only supported MapReduce
jobs.
⚫ In Hadoop 2.x we got Yarn which
brought High Availability of the31
In YARN, there are at least three
actors:
⚫ the Job Submitter (the client)
⚫ the Resource Manager (the
master)
⚫ the Node Manager (the slave)

32
The application startup process is the following:
⚫ 1. a client submits an application to the
Resource Manager
⚫ 2. the Resource Manager allocates a container
⚫ 3. the Resource Manager contacts the related
Node
Manager
⚫ 4. the Node Manager launches the container
⚫ 5. the Container executes the Application
Master

33
YARN Infrastructure (Yet Another
Resource Negotiator)
⚫ Itis the framework responsible for
providing the computational resources
(e.g., CPUs, memory, etc.) needed for
application executions.
Two important elements are:
⚫ Resource Manager (one per cluster) is
the master. It knows where the slaves
are located (Rack Awareness) and how
many resources they have.
⚫ It runs several services, the most
important is the Resource Scheduler
which decides how to assign the 34
⚫ Node Manager (many per cluster) is the
slave of the infrastructure.
⚫ When it starts, it announces
himself to the Resource Manager.
⚫ Periodically, node manager sends a
heartbeat to the Resource Manager.
Container
A container can be understood as logical
reservation of resources that will be utilized
by task running in that container.
Application Manager: accepts job-
submission
,provides service to relaunch AM in case of
failure. 35
DATA
FORMAT
⚫ A data/file format defines how information is
stored in HDFS.
⚫ Hadoop does not have a default file format
and the choice of a format depends on its
use. Choosing the correct file format is one of the crucial
steps in big-data projects.
⚫ The big problem in the performance of
applications that use HDFS is the information
search time and the writing time.
⚫ Managing the processing and storage of large
volumes of information is very complex that’s
why a certain data format is required.
36
Advantages Of Using Appropriate File
Formats:
1.Faster read
2.Faster write
3.Splitable files support
4.Schema evolution can be supported
5.Advanced compression can be
achieved

37
Some of the most commonly used formats of the
Hadoop ecosystem are :
⚫ Text/CSV (comma-separated values)
⚫ SequenceFile:
⚫ Avro:
⚫ Parquet:
⚫ RCFile (Record Columnar File):
⚫ ORC (Optimized Row Columnar):

38
⚫ Some of the most commonly used
formats of the Hadoop ecosystem are :
● Text/CSV:
⚫ CSV (comma-separated values)
⚫ CSV files are comma-delimited files,
where data is stored in row-based file
format.
⚫ They are mostly used for exchanging
tabular data in CSV files where each
header/value will be delimited using
comma (“ ,”), pipe(“ |”) etc. Generally, the
first row contains header names.

39
40
● SequenceFile:

The SequenceFile format stores the data in


binary format, this format accepts
compression but does not store metadata.
● Avro:
Avro is a row-based storage format. This
format includes the definition of the scheme
of your data in JSON format.
Avro allows block compression along with
its divisibility, making it a good choice for most
cases when using Hadoop.

41
● Parquet:
Parquet is a column-based binary storage
format that can store nested data
structures.
This format is very efficient in terms of disk
input/output operations when the necessary
columns to be used are specified.
● RCFile (Record Columnar File):
RCFile is a columnar format that divides
data into groups of rows, and inside it, data is
stored in columns.
●ORC (Optimized Row Columnar): ORC is
considered an evolution of the RCFile format
and has all its benefits alongside some
improvements such as better compression, 42
Scaling Up Vs Scaling Out

There are two commonly used types of data


scaling :
1. Scaling Up
2. Scaling Out

Both approaches are used to increase


storage capacity

43
44
Scaling In Vs Scaling Out
Scaling up/Vertical scaling : When we add
more resources to a single machine when
the load increases. For example you need
20gb of ram but currently your server has 10
GB of ram so you add extra ram to the same
server to meet the needs.
Horizontal scaling / scaling out : when you
add more machines to match the resources
need it's called horizontal scaling. So if I have
a machine of already 10 GB I'll add an extra
machine with 10 GB ram.

45
Scaling up, or vertical scaling :
It involves obtaining a faster server with more
powerful processors and more memory.

For many platforms, it may only provide a


short-term fix, especially if continued growth is
expected.
⚫ Cloud computing providers, such as
Microsoft Azure and Google Cloud, allow
you to scale-up your virtual machine with a
few clicks.

46
Scaling
In scaling : When we
Scaling up/Vertical
add more resources to a single
machine when the load increases.
⚫ When running massive data centers, you
may face the need to increase your
machine’s capacity to run larger workloads.
In this case, apply vertical scaling or
scale-up.
⚫ Scale-up is a simple method of increasing
your computing capacity by adding
additional resources such as a central
processing unit (CPU) and Dynamic
random-access memory (DRAM) to on-
premises servers or improving the
performance of your disk by changing it to
a faster one. 47
Scaling
Out machines to match the
⚫ when you add more
resources need it's called horizontal scaling.
⚫ few storage hardware and
controllers/nodes/ machines /server would
be added in order to increase capacity.
⚫ It involves adding servers for parallel
computing. The scale-out technique is a long-
term solution, as more and more servers
may be added when needed.
⚫ This approach is popular among companies
such as Amazon, Uber, and Netflix, that
want to provide customers all over the
world with the same user experience.
(Instead of buying one powerful machine,
horizontal scaling means adding simple48
.
Hadoop uses Scale out feature to
improve the Performance
⚫ Incremental, scale-out architecture.

⚫ It is designed for distributed, parallel processing and


storage, making it more efficient and cost-effective for
handling large datasets compared to scaling up with expensive
high-performance servers.

⚫ If more storage is needed, additional nodes can be added to the


cluster, making it scalable horizontally.

⚫ add a new server to the cluster or adding more storage as


well as more CPU. So the system never slows down as it
49
grows.
Analyzing Data with
Hadoop
• Big Data is a term used to refer to a huge collection of
data that comprises both structured data found in
traditional databases and unstructured data like text
documents, video and audio.

• Big Data is not merely data but also a collection of


various tools, techniques, frameworks and platforms.

50
Analyzing Data with
Hadoop
Prerequisites for using Hadoop
• Linux based operating systems like Ubuntu or Debian
are preferred for setting up Hadoop.

• Hadoop's internal framework is written in Java, it


supports multiple programming languages for data
processing.

51
Analysing Data with
Hadoop
There are four main libraries in Hadoop.
1. Hadoop Common: This provides utilities used by all other
modules in Hadoop.
2. Hadoop Distributed File System – HDFS: This stores data
and maintains records over various machines or clusters. It
also allows the data to be stored in an accessible format
3. Hadoop MapReduce: This works as a parallel framework for
scheduling and processing the data
4. Hadoop YARN: This is an acronym for Yet Another Resource
Navigator. It is an improved version of MapReduce and is used
for processes running over Hadoop.

52
Analysing Data with Hadoop

Other packages that can support Hadoop are listed below.


Apache Oozie: A scheduling system that manages processes
taking place in Hadoop

Apache Pig: A platform to run programs made on Hadoop

Cloudera Impala: A processing database for Hadoop.


Originally it was created by the software organisation
Cloudera, but was later released as open source software

Apache HBase: A non-relational database for Hadoop

Apache Hive: A data warehouse used for summarisation,


querying and the analysis of data
53
Analysing Data with Hadoop

Apache Sqoop: Is used to store


data between Hadoop and structured
data sources

Apache Flume: A tool used to move data


to HDFS

Cassandra: A scalable multi-database


system

54
Hadoop Ecosystem

55
Hadoop Ecosystem
⚫ Hadoop Kafka and flume
Flume and Kakfa both can act as the event
backbone for
real-time event processing.
Kafka can process and monitor data in
distributed systems whereas Flume gathers
data from distributed systems to land data on
a centralized data store.
⚫ Oozie
Apache Oozie is a Java Web application
used to schedule Apache Hadoop jobs.
Oozie combines multiple jobs sequentially
into one logical unit of work.
56
Hive and Pig
Hive uses HQL(Hive Query Language) similar to
SQL
They are popular choices that provide
SQL-like and procedural data flow-like
languages, respectively
Pig Hadoop is an abstraction over
MapReduce. It is a tool/platform used to
analyze larger sets of data by
representing them as data flows.
HBase
⚫ It is also a popular way to store and analyze
data in HDFS.
⚫ It is a column-oriented database, and unlike
MapReduce, provides random read and
write access to data with low latency.
⚫ MapReduce jobs can read and write data in57
HADOOP STREAMING
⚫ “Streaming is a technique for transferring
data so that it can be processed at a defined
frame rate and as continuous
stream/flow of data”
⚫ It is a feature that comes with a Hadoop
distribution that allows developers or
programmers to write the Map-Reduce
program using different programming
languages like Ruby, Perl, Python, C++,
etc.
⚫ Hadoop Streaming supports the execution of
Java, as well as non-Java, programmed
MapReduce jobs execution over the Hadoop58
HADOOP STREAMING
⚫ We can use any language that can read
from the standard input(STDIN) like
keyboard input and all and write using
standard output(STDOUT).
⚫ You have Apache
streami data Flume that can
write them intocollect
HDFSall
of
ng your in
batches records.
streaming and hour) without any loss
⚫ Apache (say
Kafka is anofevent Streaming
platform every
which combines messages, storage
and processing of data.

59
⚫ Reads input data stored in HDFS and converts it into key value pair
⚫ We have an Input Reader which is responsible for reading the
input data and
produces the list of key-value pairs.
The format sdepends on the input type(text files, CSV, JSON)
⚫ The input key-value pairs are passed to an external Mapper
program (e.g., a Python script).
⚫ This script reads from STDIN (Standard Input) and processes the
data. The Mapper script outputs intermediate key-value pairs via
STDOUT (Standard Output).
60
⚫ Intermediate Key-Value Pairs (Shuffle & Sort - Handled by
Hadoop)Hadoop automatically groups, sorts, and shuffles the
intermediate key-value pairs. All occurrences of the same key are
combined before going to the Reducer.
⚫ The grouped key-value pairs are sent to an external Reducer
program.This script reads from STDIN, aggregates the values,
and writes the results to STDOUT.
⚫ The final output from the Reducer is written back to HDFS in a
structured format (CSV, JSON, or plain text).

61
HADOOP PIPES

Hadoop Pipes is a C++ API for Hadoop


MapReduce, which allows developers to write
MapReduce programs in C++ instead of Java while
still leveraging Hadoop’s distributed computing
capabilities.

Unlike Streaming, this uses standard input and


output to communicate with the map and reduce
code.
Pipes uses sockets as the channel over which
the task tracker communicates with the process62
When to Use Hadoop Pipes?
• When performance is critical (C++ is faster than Java for
some tasks).
• If you have existing C++ libraries that you want to integrate
with Hadoop.
• For CPU-intensive tasks, where C++ can be more efficient
than Java.
However, since Hadoop primarily runs on the JVM, using C++
requires additional configuration and maintenance.
HADOOP Streaming and Piping

⚫ Hadoop Streaming, API to MapReduce to


write non –
java map and reduce function
⚫ Hadoop pipes is the C++ interface to Map
Reduce

6
Streaming and Piping
Streaming and Piping
Streaming and Piping
Read and write data in HDFS
⚫ You can execute various reading, writing
operations such as creating a directory,
providing permissions, copying files,
updating files, deleting, etc.

68
File Sizes, Block Sizes, and Block
Abstraction in HDFS
File Size in HDFS
HDFS is designed to store very large files (terabytes to
petabytes in size).
Since HDFS is optimized for large-scale data processing,
small files are not efficient due to the overhead in metadata
management.
Files stored in HDFS are divided into fixed-size blocks,
which are then distributed across different nodes in the
cluster.

69
. Block Size in HDFS
In traditional file systems, block sizes are typically 4KB or
8KB, but in HDFS, blocks are much larger (default:
128MB or 256MB).
The large block size helps reduce the overhead of metadata
stored in the NameNode.
Blocks are replicated across multiple nodes (default: 3
replicas) to ensure fault tolerance and high availability.
The block size can be configured when a file is written to
HDFS.

70
Block Abstraction in HDFS
When a file is stored in HDFS, it is divided into multiple
blocks, but users interact with the file as a single entity.
Physical Storage: Each block is stored independently on
different nodes in the cluster.
Replication: Blocks are replicated across multiple nodes
based on the replication factor (default: 3) for data
reliability.
Fault Tolerance: If a node fails, the missing block(s) can
be retrieved from another node containing a replica.
Distributed Processing: Since HDFS integrates with
MapReduce and other big data frameworks, the block
abstraction helps optimize parallel processing by allowing
different nodes to process different blocks of a file
simultaneously. 71
Read and write data in HDFS
⚫ You can execute various reading, writing
operations such as creating a directory,
providing permissions, copying files,
updating files, deleting, etc.

72
Read and write data in
HDFS
HDFS Operations to Read the file
⚫ To read any file from the HDFS, you have to
interact with the NameNode as it stores
the metadata about the DataNodes.
⚫ The user gets a token from the NameNode
and that specifies the address where the data
is stored.
⚫ You can put a read request to
NameNode for a particular block location
through distributed file systems.
⚫ The NameNode will then check your
privilege to access the DataNode and
allows you to read the address block if
the access is valid. 73
Read & Write Operations in
HDFS
⚫ You can execute various reading, writing
operations such as creating a directory,
providing permissions, copying files,
updating files, deleting, etc.
⚫ HDFS Operations to Read the file
⚫ To read any file from the HDFS, you have to
interact with the NameNode as it stores
the metadata about the DataNodes.
⚫ The user gets a token from the NameNode
and that specifies the address where the data is
stored.
⚫ You can put a read request to NameNode for a
particular block location through distributed file
systems.
⚫ The NameNode will then check your privilege to
access the DataNode and allows you to read the
address block if the access is valid. 74
Challenges of
HDFS

75
Analysing Data with
⚫ To Hadoopdeveloper
increase sever
productivity, languages and APIs alhave
higher-level
been created that abstract away the
low-level details of the MapReduce
programming model.
⚫ There are several choices available
for writing data analysis jobs.
The Hive and Pig projects
Hbase

76

You might also like