Big Data Unit-2 PPT part1
Big Data Unit-2 PPT part1
2
3
HISTORY
• In 2002, Doug Cutting and Mike Cafarella
started to work on a project, Apache Nutch. It
is an open source web crawler software project.
• While working on Apache Nutch, To store the big
data they have to spend a lot of costs which
becomes the consequence of that project. This
problem becomes one of the important reason
for the emergence of Hadoop.
• In 2003, Google introduced a file system
known as GFS (Google file system). It is a
proprietary distributed file system developed to
provide efficient access to data.
4
• In 2004, Google released a white paper on Map
Reduce. This technique simplifies the data
processing on large clusters.
• In 2005, Doug Cutting and Mike Cafarella introduced
a new file system known as NDFS (Nutch
Distributed File System). Hadoop, orginally
called Nutch Distributed File System
• The name ‘Hadoop’ was given by one of Doug
cutting’s son to
that son’s toy elephant.
• In 2006, Doug Cutting quit Google and joined Yahoo.
On the basis of the Nutch project, Dough Cutting
introduces a new project Hadoop with a file
system known as HDFS (Hadoop Distributed File
System).
5
Histor
y
6
Apache Hadoop
⚫ As the amount of data grew rapidly in early 2000, there
were issues
with storing such vast volumes of data.
⚫ Doug Cutting came up with the idea to develop Hadoop
in 2002
⚫ Hadoop was created by Doug Cutting and Mike
Cafarella in 2005.
⚫ Hadoop was developed at the Apache Software
Foundation in
2005.
⚫ It is written in Java.
⚫ Hadoop is an java based Apache open source
framework that allows to store and process large
data in a parallel and distributed manner.
⚫ Hadoop has units like HDFS - to store Big
Data , MapReduce - to process Big Data and 7
Is Hadoop a
Database?
8
⚫ Hadoop is not a database, but rather an
open-source software framework
specifically built to handle large volumes of
structured, semi-structured data and
unstructured data
⚫ Hadoop is an open-source software
framework for storing and processing large
datasets ranging in size from gigabytes to
petabytes
9
How does a Hadoop work?
13
Hadoop distributed File System
16
HDFS NameNode
⚫ It is also known as Master node.
⚫ HDFS Namenode stores meta-data i.e.
number of data blocks, replicas and other
details.
⚫ This meta-data is available in memory in the
master for
faster retrieval of data.
⚫ NameNode maintains and manages the
slave nodes, and assigns tasks to them.
⚫ It should deploy on reliable hardware as it is
the
17
Functions of NameNode
⚫ Manage file system namespace.
⚫ Regulates client’s access to files.
⚫ It also executes file system execution such
as naming, closing, opening
files/directories.
⚫ All DataNodes sends a Heartbeat and block
report to the NameNode in the Hadoop
cluster
⚫ NameNode is also responsible for taking
care of the
Replication Factor of all the blocks 18
Files present in the
NameNode metadata are
19
FsImage –
⚫ It is an “Image file”
⚫ Fsimage stands for File System image
⚫ It contains the complete directory
structure (namespace) of the HDFS with
details about the location of the data on
the Data Blocks and which blocks are
stored on which node
• It stored as a file in the namenode’s
local file system.
• Fsimage is a point-in-time snapshot of
HDFS's namespace
• The last snapshot is actually stored in
FSImage. 20
EditLogs –
⚫ EditLogs is a transaction log that records
the changes in the HDFS file system or any
action performed on the HDFS cluster such as
addition of a new block, replication,
deletion etc
⚫ Edit log records every changes from the last
snapshot.
⚫ In short, it records the changes since the
last FsImage was created
⚫ It contains all the recent modifications
made to the file system on the most recent
FsImage.
⚫ Namenode receives a create/update/delete
request from the client. After that this 21
HDFS DataNode
⚫ It is also known as Slave.
⚫ In Hadoop HDFS Architecture, DataNode
stores actual data in HDFS.
⚫ It performs read and write operation
as per the request of the client.
⚫ DataNodes can deploy on commodity
hardware
22
Functions of DataNode
⚫ Block replica creation, deletion, and
replication
according to the instruction of Namenode.
⚫ DataNode manages data storage of the
system.
⚫ DataNodes send heartbeat to the
NameNode
⚫ By default, this frequency is set to 3
seconds.
⚫ Every 3 seconds, each DataNode sends a
heartbeat signal to the NameNode. 23
Blocks
⚫ HDFS in Apache Hadoop split huge files
into small chunks known as Blocks.
⚫ These are the smallest unit of data in a
filesystem.
⚫ We (client and admin) do not have any
control on the block like block location.
24
Block size of a HDFS
26
Secondary Namenode:
⚫ Secondary NameNode downloads the
FsImage and EditLogs from the
NameNode.
⚫ And then merges EditLogs with the
FsImage
(FileSystem Image).
⚫ It keeps edits log size within a limit.
⚫ It stores the modified FsImage into
persistent storage.
⚫ And we can use it in the case of
NameNode failure. 27
Rack Awareness
The Rack is the collection of around 40-50
DataNodes connected using the same network
switch.
If the network goes down, the whole rack
will be unavailable. A large Hadoop cluster
is deployed in multiple racks.
28
Rack Awareness
⚫ In HDFS Architecture, NameNode makes
sure that all the replicas are not stored
on the same rack or single rack.
⚫ It follows Rack Awareness Algorithm
to reduce latency as well as fault
tolerance.
⚫ We know that default replication
factor is 3.
⚫ Rack Awareness is important to improve:
Data high availability and reliability.
The performance of the cluster.
To improve network bandwidth 29
30
Why did we have YARN?
(Yet Another Resource Negotiator)
⚫ In Hadoop 1.x version the only way to
process data was through
MapReduce.( MapReduce is a
processing framework or a program written
in java on Hadoop.)
⚫ Yarn was introduced in Hadoop 2.x
version.
⚫ In Hadoop 1.x , JobTracker was a single of
failure and
all the cluster only supported MapReduce
jobs.
⚫ In Hadoop 2.x we got Yarn which
brought High Availability of the31
In YARN, there are at least three
actors:
⚫ the Job Submitter (the client)
⚫ the Resource Manager (the
master)
⚫ the Node Manager (the slave)
32
The application startup process is the following:
⚫ 1. a client submits an application to the
Resource Manager
⚫ 2. the Resource Manager allocates a container
⚫ 3. the Resource Manager contacts the related
Node
Manager
⚫ 4. the Node Manager launches the container
⚫ 5. the Container executes the Application
Master
33
YARN Infrastructure (Yet Another
Resource Negotiator)
⚫ Itis the framework responsible for
providing the computational resources
(e.g., CPUs, memory, etc.) needed for
application executions.
Two important elements are:
⚫ Resource Manager (one per cluster) is
the master. It knows where the slaves
are located (Rack Awareness) and how
many resources they have.
⚫ It runs several services, the most
important is the Resource Scheduler
which decides how to assign the 34
⚫ Node Manager (many per cluster) is the
slave of the infrastructure.
⚫ When it starts, it announces
himself to the Resource Manager.
⚫ Periodically, node manager sends a
heartbeat to the Resource Manager.
Container
A container can be understood as logical
reservation of resources that will be utilized
by task running in that container.
Application Manager: accepts job-
submission
,provides service to relaunch AM in case of
failure. 35
DATA
FORMAT
⚫ A data/file format defines how information is
stored in HDFS.
⚫ Hadoop does not have a default file format
and the choice of a format depends on its
use. Choosing the correct file format is one of the crucial
steps in big-data projects.
⚫ The big problem in the performance of
applications that use HDFS is the information
search time and the writing time.
⚫ Managing the processing and storage of large
volumes of information is very complex that’s
why a certain data format is required.
36
Advantages Of Using Appropriate File
Formats:
1.Faster read
2.Faster write
3.Splitable files support
4.Schema evolution can be supported
5.Advanced compression can be
achieved
37
Some of the most commonly used formats of the
Hadoop ecosystem are :
⚫ Text/CSV (comma-separated values)
⚫ SequenceFile:
⚫ Avro:
⚫ Parquet:
⚫ RCFile (Record Columnar File):
⚫ ORC (Optimized Row Columnar):
38
⚫ Some of the most commonly used
formats of the Hadoop ecosystem are :
● Text/CSV:
⚫ CSV (comma-separated values)
⚫ CSV files are comma-delimited files,
where data is stored in row-based file
format.
⚫ They are mostly used for exchanging
tabular data in CSV files where each
header/value will be delimited using
comma (“ ,”), pipe(“ |”) etc. Generally, the
first row contains header names.
39
40
● SequenceFile:
41
● Parquet:
Parquet is a column-based binary storage
format that can store nested data
structures.
This format is very efficient in terms of disk
input/output operations when the necessary
columns to be used are specified.
● RCFile (Record Columnar File):
RCFile is a columnar format that divides
data into groups of rows, and inside it, data is
stored in columns.
●ORC (Optimized Row Columnar): ORC is
considered an evolution of the RCFile format
and has all its benefits alongside some
improvements such as better compression, 42
Scaling Up Vs Scaling Out
43
44
Scaling In Vs Scaling Out
Scaling up/Vertical scaling : When we add
more resources to a single machine when
the load increases. For example you need
20gb of ram but currently your server has 10
GB of ram so you add extra ram to the same
server to meet the needs.
Horizontal scaling / scaling out : when you
add more machines to match the resources
need it's called horizontal scaling. So if I have
a machine of already 10 GB I'll add an extra
machine with 10 GB ram.
45
Scaling up, or vertical scaling :
It involves obtaining a faster server with more
powerful processors and more memory.
46
Scaling
In scaling : When we
Scaling up/Vertical
add more resources to a single
machine when the load increases.
⚫ When running massive data centers, you
may face the need to increase your
machine’s capacity to run larger workloads.
In this case, apply vertical scaling or
scale-up.
⚫ Scale-up is a simple method of increasing
your computing capacity by adding
additional resources such as a central
processing unit (CPU) and Dynamic
random-access memory (DRAM) to on-
premises servers or improving the
performance of your disk by changing it to
a faster one. 47
Scaling
Out machines to match the
⚫ when you add more
resources need it's called horizontal scaling.
⚫ few storage hardware and
controllers/nodes/ machines /server would
be added in order to increase capacity.
⚫ It involves adding servers for parallel
computing. The scale-out technique is a long-
term solution, as more and more servers
may be added when needed.
⚫ This approach is popular among companies
such as Amazon, Uber, and Netflix, that
want to provide customers all over the
world with the same user experience.
(Instead of buying one powerful machine,
horizontal scaling means adding simple48
.
Hadoop uses Scale out feature to
improve the Performance
⚫ Incremental, scale-out architecture.
50
Analyzing Data with
Hadoop
Prerequisites for using Hadoop
• Linux based operating systems like Ubuntu or Debian
are preferred for setting up Hadoop.
51
Analysing Data with
Hadoop
There are four main libraries in Hadoop.
1. Hadoop Common: This provides utilities used by all other
modules in Hadoop.
2. Hadoop Distributed File System – HDFS: This stores data
and maintains records over various machines or clusters. It
also allows the data to be stored in an accessible format
3. Hadoop MapReduce: This works as a parallel framework for
scheduling and processing the data
4. Hadoop YARN: This is an acronym for Yet Another Resource
Navigator. It is an improved version of MapReduce and is used
for processes running over Hadoop.
52
Analysing Data with Hadoop
54
Hadoop Ecosystem
55
Hadoop Ecosystem
⚫ Hadoop Kafka and flume
Flume and Kakfa both can act as the event
backbone for
real-time event processing.
Kafka can process and monitor data in
distributed systems whereas Flume gathers
data from distributed systems to land data on
a centralized data store.
⚫ Oozie
Apache Oozie is a Java Web application
used to schedule Apache Hadoop jobs.
Oozie combines multiple jobs sequentially
into one logical unit of work.
56
Hive and Pig
Hive uses HQL(Hive Query Language) similar to
SQL
They are popular choices that provide
SQL-like and procedural data flow-like
languages, respectively
Pig Hadoop is an abstraction over
MapReduce. It is a tool/platform used to
analyze larger sets of data by
representing them as data flows.
HBase
⚫ It is also a popular way to store and analyze
data in HDFS.
⚫ It is a column-oriented database, and unlike
MapReduce, provides random read and
write access to data with low latency.
⚫ MapReduce jobs can read and write data in57
HADOOP STREAMING
⚫ “Streaming is a technique for transferring
data so that it can be processed at a defined
frame rate and as continuous
stream/flow of data”
⚫ It is a feature that comes with a Hadoop
distribution that allows developers or
programmers to write the Map-Reduce
program using different programming
languages like Ruby, Perl, Python, C++,
etc.
⚫ Hadoop Streaming supports the execution of
Java, as well as non-Java, programmed
MapReduce jobs execution over the Hadoop58
HADOOP STREAMING
⚫ We can use any language that can read
from the standard input(STDIN) like
keyboard input and all and write using
standard output(STDOUT).
⚫ You have Apache
streami data Flume that can
write them intocollect
HDFSall
of
ng your in
batches records.
streaming and hour) without any loss
⚫ Apache (say
Kafka is anofevent Streaming
platform every
which combines messages, storage
and processing of data.
59
⚫ Reads input data stored in HDFS and converts it into key value pair
⚫ We have an Input Reader which is responsible for reading the
input data and
produces the list of key-value pairs.
The format sdepends on the input type(text files, CSV, JSON)
⚫ The input key-value pairs are passed to an external Mapper
program (e.g., a Python script).
⚫ This script reads from STDIN (Standard Input) and processes the
data. The Mapper script outputs intermediate key-value pairs via
STDOUT (Standard Output).
60
⚫ Intermediate Key-Value Pairs (Shuffle & Sort - Handled by
Hadoop)Hadoop automatically groups, sorts, and shuffles the
intermediate key-value pairs. All occurrences of the same key are
combined before going to the Reducer.
⚫ The grouped key-value pairs are sent to an external Reducer
program.This script reads from STDIN, aggregates the values,
and writes the results to STDOUT.
⚫ The final output from the Reducer is written back to HDFS in a
structured format (CSV, JSON, or plain text).
61
HADOOP PIPES
6
Streaming and Piping
Streaming and Piping
Streaming and Piping
Read and write data in HDFS
⚫ You can execute various reading, writing
operations such as creating a directory,
providing permissions, copying files,
updating files, deleting, etc.
68
File Sizes, Block Sizes, and Block
Abstraction in HDFS
File Size in HDFS
HDFS is designed to store very large files (terabytes to
petabytes in size).
Since HDFS is optimized for large-scale data processing,
small files are not efficient due to the overhead in metadata
management.
Files stored in HDFS are divided into fixed-size blocks,
which are then distributed across different nodes in the
cluster.
69
. Block Size in HDFS
In traditional file systems, block sizes are typically 4KB or
8KB, but in HDFS, blocks are much larger (default:
128MB or 256MB).
The large block size helps reduce the overhead of metadata
stored in the NameNode.
Blocks are replicated across multiple nodes (default: 3
replicas) to ensure fault tolerance and high availability.
The block size can be configured when a file is written to
HDFS.
70
Block Abstraction in HDFS
When a file is stored in HDFS, it is divided into multiple
blocks, but users interact with the file as a single entity.
Physical Storage: Each block is stored independently on
different nodes in the cluster.
Replication: Blocks are replicated across multiple nodes
based on the replication factor (default: 3) for data
reliability.
Fault Tolerance: If a node fails, the missing block(s) can
be retrieved from another node containing a replica.
Distributed Processing: Since HDFS integrates with
MapReduce and other big data frameworks, the block
abstraction helps optimize parallel processing by allowing
different nodes to process different blocks of a file
simultaneously. 71
Read and write data in HDFS
⚫ You can execute various reading, writing
operations such as creating a directory,
providing permissions, copying files,
updating files, deleting, etc.
72
Read and write data in
HDFS
HDFS Operations to Read the file
⚫ To read any file from the HDFS, you have to
interact with the NameNode as it stores
the metadata about the DataNodes.
⚫ The user gets a token from the NameNode
and that specifies the address where the data
is stored.
⚫ You can put a read request to
NameNode for a particular block location
through distributed file systems.
⚫ The NameNode will then check your
privilege to access the DataNode and
allows you to read the address block if
the access is valid. 73
Read & Write Operations in
HDFS
⚫ You can execute various reading, writing
operations such as creating a directory,
providing permissions, copying files,
updating files, deleting, etc.
⚫ HDFS Operations to Read the file
⚫ To read any file from the HDFS, you have to
interact with the NameNode as it stores
the metadata about the DataNodes.
⚫ The user gets a token from the NameNode
and that specifies the address where the data is
stored.
⚫ You can put a read request to NameNode for a
particular block location through distributed file
systems.
⚫ The NameNode will then check your privilege to
access the DataNode and allows you to read the
address block if the access is valid. 74
Challenges of
HDFS
75
Analysing Data with
⚫ To Hadoopdeveloper
increase sever
productivity, languages and APIs alhave
higher-level
been created that abstract away the
low-level details of the MapReduce
programming model.
⚫ There are several choices available
for writing data analysis jobs.
The Hive and Pig projects
Hbase
76