SlideShare a Scribd company logo
Chapter 2
BIG DATA PROCESSING
Big Data technologies
• Big data technologies are essential for offering more precise analysis, which may
lead to more tangible decision making resulting in better operational
efficiencies, cost reductions, and reduced risks for the business.
• To control the power of big data you would need an infrastructure that can
handle and process huge volumes of structured and unstructured data in real
time and can preserve data privacy and security.
• Today, the various architectures and papers that were contributed by these and
other developers across the world have culminated into several open-source
projects under the Apache Software Foundation and the NoSQL movement.
• All of these technologies have been identified as Big Data processing platforms,
including Hadoop, Hive, HBase, Cassandra, and Map Reduce.
• NoSQL platforms include MongoDB, Neo4J, Riak, Amazon DynamoDB,
MemcachedDB, BerkleyDB, Voldemort, and many more.
Distributed Data Processing
• Distributed data processing has been in
existence since the late 1970s.
• The primary concept was to replicate the
DBMS in a master–slave configuration and
process data across multiple instances
• Each slave would engage in a two-phase
commit with its master in a query processing
situation.
Why did distributed data processing fail to meet the
requirements in the relational data processing
architecture?
• Complex architectures for consistency management
• Latencies across the system
• Slow networks
• Infrastructure cost
• Complex data processing and transformation
requirements
Client–Server Data
Processing
Benefits:
• Centralization of administration, security, and setup.
• Back-up and recovery of data is inexpensive, as outages can occur at the server
or a client and can be restored.
• Scalability of infrastructure by adding more server capacity or client capacity
can be accomplished. The scalability is not linear.
• Accessibility of the server from heterogeneous platforms locally or remotely.
• Clients can use servers for different types of processing.
Limitations:
• The server is the central point of failure.
• Very limited scalability.
• Performance can degrade with network congestion.
• Too many clients accessing a single server cannot process data in a quick time.
Big Data Processing Requirements
Volume:
• Size of data to be processed is large—it needs
to be broken into manageable chunks.
• Data needs to be processed in parallel across
multiple systems.
• Data needs to be processed across several
program modules simultaneously.
Velocity:
• Data needs to be processed at streaming speeds during data collection.
• Data needs to be processed for multiple acquisition points.
Variety:
• Data of different formats needs to be processed.
• Data of different types needs to be processed.
• Data of different structures needs to be processed.
• Data from different regions needs to be processed.
Ambiguity:
• Big Data is ambiguous by nature due to the lack of relevant metadata and context in
many cases. An example is the use of M and F in a sentence—it can mean,
respectively, Monday and Friday, male and female, or mother and father.
• Big Data that is within the corporation also exhibits this ambiguity to a lesser degree.
For example, employment agreements have standard and custom sections and the
latter is ambiguous without the right context.
Complexity:
• Big Data complexity needs to use many algorithms to process data quickly and
efficiently.
• Several types of data need multipass processing and scalability is extremely
important.
Google File System
https://youtu.be/eRgFNW4QFDc
• Developed by Google to hold Google’s increasing
data processing requirements.
• It is one of the scalable distributed file system.
• GFS is improved to hold the data used by Google
and their requirement of storage, like search
engine, which produce large amounts of data that
needs to be stored.
• The main purpose behind the design of GFS is to
hold the Google’s huge cluster requirements
without making extra load on applications.
• Google organized the GFS into clusters of computers.
• Each cluster might contain hundreds or even thousands
of machines. Within GFS clusters there are three kinds
of entities: clients, master servers and chunkservers.
• "client" refers to any entity that makes a file request.
• Requests can range from retrieving and manipulating
existing files to creating new files on the system.
• Clients can be other computers or computer
applications.
• You can think of clients as the customers of the GFS.
GFS Architecture
Chaptor 2- Big Data Processing in big data technologies
Chaptor 2- Big Data Processing in big data technologies
Chaptor 2- Big Data Processing in big data technologies
Chaptor 2- Big Data Processing in big data technologies
Chaptor 2- Big Data Processing in big data technologies
Master Server
• The master server acts as the coordinator for the cluster.
• The master's duties include maintaining an operation log, which keeps track of
the activities of the master's cluster.
• The operation log helps keep service interruptions to a minimum -- if the
master server crashes, a replacement server that has monitored the operation
log can take its place.
• The master server also keeps track of metadata, which is the information that
describes chunks.
• The metadata tells the master server to which files the chunks belong and
where they fit within the overall file.
• Upon startup, the master polls all the chunkservers in its cluster.
• The chunkservers respond by telling the master server the contents of their
inventories.
• From that moment on, the master server keeps track of the location of chunks
within the cluster.
• There's only one active master server per cluster at any one time (though each
cluster has multiple copies of the master server in case of a hardware failure)
Chunkservers
• Chunkservers are the workhorses of the GFS.
• They're responsible for storing the 64-MB file chunks.
• The chunkservers don't send chunks to the master
server.
• Instead, they send requested chunks directly to the
client.
• The GFS copies every chunk multiple times and stores it
on different chunkservers.
• Each copy is called a replica.
• By default, the GFS makes three replicas per chunk, but
users can change the setting and make more or fewer
replicas if desired.
A GFS cluster:
• A single master
• Multiple chunk servers (workers or slaves) per
master
• Accessed by multiple clients
• Running on commodity Linux machines
A file:
• Represented as fixed-sized chunks
• Labeled with 64-bit unique global IDs
• Stored at chunk servers and three-way mirrored
across chunk servers
• In the GFS cluster, input data files are divided into
chunks (64 MB is the standard chunk size), each
assigned its unique 64-bit handle, and stored on
local chunk server systems as files.
• To ensure fault tolerance and scalability, each chunk
is replicated at least once on another server, and the
default design is to create three copies of a chunk.
• The role of the master is to communicate to clients
which chunk servers have which chunks and their
metadata information.
• Clients’ tasks then interact directly with chunk
servers for all subsequent operations, and use the
master only in a minimal fashion.
• Another important issue to understand in the GFS architecture is the single point
of failure (SPOF) of the master node and all the metadata that keeps track of the
chunks and their state.
• To avoid this situation, GFS was designed to have the master keep data in
memory for speed, keep a log on the master’s local disk, and replicate the disk
across remote nodes.
• This way if there is a crash in the master node, a shadow can be up and running
almost instantly.
• The master stores three types of metadata:
1. File and chunk names or namespaces.
2. Mapping from files to chunks (i.e., the chunks that make up each file).
3. Locations of each chunk’s replicas. The replica locations for each chunk are
stored on the local chunk server apart from being replicated, and the
information of the replications is provided to the master at startup or when a
chunk server is added to a cluster.
• Since the master controls the chunk placement, it always updates metadata as
new chunks get written.
• To recover from any corruption, GFS appends data as it is
available rather than updates an existing data set; this provides the
ability to recover from corruption or failure quickly.
• When a corruption is detected, with a combination of frequent
checkpoints, snapshots, and replicas, data is recovered with
minimal chance of data loss.
The GFS architecture has the following
strengths:
● Availability:
1. Triple replication–based redundancy (or more if you choose).
2. Chunk replication.
3. Rapid failovers for any master failure.
4. Automatic replication management.
● Performance:
1. The biggest workload for GFS is read-on large data sets, which based on the architecture
discussion, will be a nonissue.
2. There are minimal writes to the chunks directly, thus providing auto availability.
● Management:
1. GFS manages itself through multiple failure modes.
2. Automatic load balancing.
3. Storage management and pooling.
4. Chunk management.
5. Failover management.
● Cost:
1. Is not a constraint due to use of commodity hardware and Linux platforms.
Hadoop
• The Hadoop framework application works in an
environment that provides distributed storage and
computation across clusters of computers.
• Hadoop is designed to scale up from single server to
thousands of machines, each offering local
computation and storage.
• Hadoop is an open source, Java-based programming
framework that supports the processing and storage
of extremely large data sets in a distributed
computing environment.
Hadoop Architecture
Components of Hadoop
Map Reduce
• MapReduce is a parallel programming model for
writing distributed applications devised at Google
• For efficient processing of large amounts of data
(multiterabyte datasets), on large clusters
(thousands of nodes) of commodity hardware in a
reliable, fault tolerant manner.
• The MapReduce program runs on Hadoop which is
an Apache open source framework.
Hadoop Distributed File System
• The Hadoop Distributed File System (HDFS)is
based on the Google File System (GFS) and
provides a distributed file system that is
designed to run on commodity hardware.
• It is highly fault tolerant and is designed to be
deployed on low cost hardware.
• It provides high throughput access to
application data and is suitable for applications
having large datasets.
• Apart from the above mentioned two core
components, Hadoop framework also includes
the following two modules:
• Hadoop Common:
These are Java libraries and utilities required by
other Hadoop modules.
• Hadoop YARN:
• This is a framework for job scheduling and cluster
resource management.
How it works?
• Data is initially divided into directories and files. Files are
divided into uniform sized blocks of 128M and
64M(preferably 128M)
• These files are then distributed across various cluster nodes
for further processing.
• HDFS being on top of the local file system supervises the
processing
• Blocks are replicated for handling hardware failure
• Checking that the code was executed successfully
• Performing the sort that takes place between the map and
reduce stages.
• Sending the sorted data to a certain computer.
• Writing the debugging logs for each job
HDFS
• Hadoop Distributed File System is a block-structured file system
where each file is divided into blocks of a pre-determined size.
• These blocks are stored across a cluster of one or several machines.
• Apache Hadoop HDFS Architecture follows a Master/Slave
Architecture, where a cluster comprises of a single NameNode
(Master node) and all the other nodes are DataNodes (Slave nodes).
• The HDFS architecture was designed to solve two known problems
experienced by the early developers of large-scale data processing.
• The first problem was the ability to break down the files across
multiple systems and process each piece of the file independent of
the other pieces and finally consolidate all the outputs in a single
result set.
• The second problem was the fault tolerance both at the file
processing level and the overall system level in the distributed data
processing systems.
Chaptor 2- Big Data Processing in big data technologies
Namenode and Datanodes
 Master/slave architecture
 HDFS cluster consists of a single Namenode, a master server that
manages the file system namespace and regulates access to files by
clients.
 There are a number of DataNodes usually one per node in a cluster.
 The DataNodes manage storage attached to the nodes that they run on.
 HDFS exposes a file system namespace and allows user data to be
stored in files.
 A file is split into one or more blocks and set of blocks are stored in
DataNodes.
 DataNodes: serves read, write requests, performs block creation,
deletion, and replication upon instruction from Namenode.
01/06/2025 32
HDFS Architecture
01/06/2025 33
Namenode
B
replication
Rack1 Rack2
Client
Blocks
Datanodes Datanodes
Client
Write
Read
Metadata ops
Metadata(Name, replicas..)
(/home/foo/data,6. ..
Block ops
File System Namespace
01/06/2025 34
• Hierarchical file system with directories and files
• Create, remove, move, rename etc.
• Namenode maintains the file system
• Any meta information changes to the file system
recorded by the Namenode.
• An application can specify the number of replicas
of the file needed: replication factor of the file.
This information is stored in the Namenode.
Data Replication
01/06/2025 35
 HDFS is designed to store very large files across
machines in a large cluster.
 Each file is a sequence of blocks.
 All blocks in the file except the last are of the same size.
 Blocks are replicated for fault tolerance.
 Block size and replicas are configurable per file.
 The Namenode receives a Heartbeat and a BlockReport
from each DataNode in the cluster.
 BlockReport contains all the blocks on a Datanode.
Datanode
01/06/2025 36
• A Datanode stores data in files in its local file system.
• Datanode has no knowledge about HDFS filesystem
• It stores each block of HDFS data in a separate file.
• Datanode does not create all files in the same directory.
• A typical HDFS cluster can have thousands of DataNodes and tens of
thousands of HDFS clients per cluster, since each DataNode may execute
multiple application tasks simultaneously.
• The DataNodes are responsible for managing read and write requests
from the file system’s clients, and block maintenance and perform
replication as directed by the NameNode.
• The size of the data file equals the actual length of the block. This means
if a block is half full it needs only half of the space of the full block on the
local drive, thereby optimizing storage space for compactness, and there
is no extra space consumed on the block unlike a regular file system.
• Image : An image represents the metadata of the namespace (inodes and
lists of blocks).
• On startup,the NameNode pins the entire namespace image in memory.
The in-memory persistence enables the NameNode to service multiple
client requests concurrently.
• Journal : The journal represents the modification log of the image in the
local host’s native file system.
• During normal operations, each client transaction is recorded in the journal,
and the journal file is flushed and synced before the acknowledgment is
sent to the client. The NameNode upon startup or from a recovery can
replay this journal.
• Checkpoint : To enable recovery, the persistent record of the image is
also stored in the local host’s native files system and is called a
checkpoint.
• Once the system starts up, the NameNode never modifies or updates the
checkpoint file.
• A new checkpoint file can be created during the next startup, on a restart,
or on demand when requested by the administrator or by the
CheckpointNode
Checkpoint Node and Backup Node
• There are two roles that a NameNode can be
designated to perform apart from servicing client
requests and managing Data Nodes.
• These roles are specified during startup and can
be the Checkpoint Node or the Backup Node.
Checkpoint Node
• The Checkpoint Node serves as a journal-capture
architecture to create a recovery mechanism for the
NameNode.
• The Checkpoint Node combines the existing
checkpoint and journal to create a new checkpoint
and an empty journal in specific intervals.
• It returns the new checkpoint to the NameNode.
• The Checkpoint Node runs on a different host from
the NameNode since it has the same memory
requirements as the NameNode.
• This mechanism provides a protection.
Backup Node
• The Backup Node can be considered as a read-only NameNode.
• It contains all file system metadata information except for block
locations.
• It accepts a stream of namespace transactions from the active
NameNode and saves them to its own storage directories, and
applies these transactions to its own namespace image in its
memory.
• If the NameNode fails, the Backup Node’s image in memory and the
checkpoint on disk are a record of the latest namespace state and
can be used to create a checkpoint for recovery.
• Creating a checkpoint from a Backup Node is very efficient as it
processes the entire image in its own disk and memory.
• A Backup Node can perform all operations of the regular
NameNode that do not involve modification of the namespace or
management of block locations.
MapReduce
• The key features of MapReduce that make it the
interface on Hadoop or Cassandra include:
1. Automatic parallelization
2. Automatic distribution
3. Fault-tolerance
4. Status and monitoring tools
5. Easy abstraction for programmers
6. Programming language flexibility
7. Extensibility
MapReduce Programming Model
• MapReduce is based on functional programming models largely from Lisp.
Typically, the users will implement two functions:
Map (in_key, in_value) -> (out_key, intermediate_value) list
• The Map function written by the user will receive an input pair of keys and values,
and after the computation cycles, will produce a set of intermediate key-value
pairs.
• Library functions then are used to group together all intermediate values
associated with an intermediate key I and passes them to the Reduce function.
Reduce (out_key, intermediate_value list) -> out_value list
• The Reduce function written by the user will accept an intermediate key I, and the
set of values for the key.
• It will merge together these values to form a possibly smaller set of values.
• Reducer outputs are just zero or one output value per invocation.
• The intermediate values are supplied to the Reduce function via an iterator.
• The Iterator function allows us to handle large lists of values that cannot be fit in
memory or a single pass.
MapReduce Architecture
• The main components of this architecture include:
• Mapper—maps input key-value pairs to a set of
intermediate key-value pairs.
• For an input pair the mapper can map to zero or
many output pairs. By default the mapper spawns
one map task for each input.
• Reducer—performs a number of tasks:
Sort and group mapper outputs.
Shuffle partitions.
Perform secondary sorting as necessary.
Manage overrides specified by users for grouping
and partitioning.
• Reporter—is used to report progress, set application-level status messages,
update any user set counters, and indicate long running tasks or jobs are alive.
• Combiner—an optional performance booster that can be specified to perform
local aggregation of the intermediate outputs to manage the amount of data
transferred from the Mapper to the Reducer.
• Partitioner—controls the partitioning of the keys of the intermediate map
outputs. The key (or a subset of the key) is used to derive the partition and default
partitions are created by a hash function. The total number of partitions will be
same as the number of reduce tasks for the job.
• Output collector—collects the output of Mappers and Reducers.
• Job configuration—is the primary user interface to manage MapReduce jobs.
• It is typically used to specify the Mapper, Combiner, Partitioner, Reducer,
InputFormat, OutputFormat, and OutputCommitter for every job.
• It also indicates the set of input files and where the output files should be written.
Optionally used to specify other advanced options for the job such as the
comparator to be used, files to be put in the DistributedCache, and compression on
intermediate and/or final job outputs.
• It is used for debugging via user-provided scripts, whether job tasks can be
executed in a speculative manner, the maximum number of attempts per task for
any possible failure, and the percentage of task failures that can be tolerated by the
job overall.
• Output committer—is used to manage the commit for jobs and tasks in
MapReduce. Key tasks executed are:
• Set up the job during initialization. For example, create the intermediate directory
for the job during the initialization of the job.
• Clean up the job after the job completion. For example, remove the temporary
output directory after the job completion.
• Set up any task temporary output.
• Check whether a task needs a commit. This will avoid overheads on unnecessary
commits.
• Commit of the task output on completion.
• On failure, discard the task commit and clean up all intermediate results, memory
release, and other user-specified tasks.
• Job input:
• Specifies the input format for a Map/Reduce job.
• Validate the input specification of the job.
• Split up the input file(s) into logical instances to be assigned to an individual
Mapper.
• Provide input records from the logical splits for processing by the Mapper.
• Memory management, JVM reuse, and compression are managed with the job
configuration set of classes.
Chaptor 2- Big Data Processing in big data technologies
Example
Input: Bus, Car, bus, car, train, car, bus, car,
train, bus, TRAIN,BUS, buS, caR, CAR, car, BUS,
TRAIN.
Chaptor 2- Big Data Processing in big data technologies
Anatomy of File Write and Read
• HDFS has a master and slave kind of architecture.
• Namenode acts as master and Datanodes as worker.
• All the metadata information is with namenode and the
original data is stored on the datanodes.
• Keeping all these in mind the below figure will give idea
about how data flow happens between the Client
interacting with HDFS, i.e. the Namenode and the
Datanodes.
• The following steps are involved in reading the file from HDFS:
Let’s suppose a Client (a HDFS Client) wants to read a file from HDFS.
Step 1: First the Client will open the file by giving a call to open() method on
FileSystem object, which for HDFS is an instance of DistributedFileSystem class.
Step 2: DistributedFileSystem calls the Namenode, using RPC (Remote Procedure
Call), to determine the locations of the blocks for the first few blocks of the file.
For each block, the NameNode returns the addresses of all the DataNode’s that
have a copy of that block. Client will interact with respective DataNode’sto read
the file. NameNode also provide a token to the client which it shows to data node
for authentication.
• The DistributedFileSystem returns an object of FSDataInputStream(an input
stream that supports file seeks) to the client for it to read data from
FSDataInputStream in turn wraps a DFSInputStream, which manages the
datanode and namenode I/O
Step 3: The client then calls read() on the stream. DFSInputStream, which has stored
the DataNode addresses for the first few blocks in the file, then connects to the
first closest DataNode for the first block in the file.
• Step 4: Data is streamed from the DataNode back to the
client, which calls read() repeatedly on the stream.
• Step 5: When the end of the block is reached,
DFSInputStream will close the connection to the
DataNode , then find the best DataNode for the next
block. This happens transparently to the client, which
from its point of view is just reading a continuous
stream.
• Step 6: Blocks are read in order, with the
DFSInputStream opening new connections to datanodes
as the client reads through the stream. It will also call the
namnode to retrieve the datanode locations for the next
batch of blocks as needed. When the client has finished
reading, it calls close() on the FSDataInputStream.
Chaptor 2- Big Data Processing in big data technologies
Now we will look at what happens when you write a File in
HDFS.
• DistributedFileSystem object do a RPC call to namenode to create a new file in
filesystem namespace with no blocks associated to it
• Namenode process performs various checks like a) client has required permissions
to create a file or not 2) file should not exists earlier. In case of above exceptions it
will throw an IOexception to client
• Once the file is registered with the namenode then client will get an object i.e.
FSDataOutputStream which in turns embed DFSOutputStream object for the client
to start writing data to DFSoutputStream handles communication with the
datanodes and namenode.
• As client writes data DFSOutputStream split it into packets and write it to its
internal queue i.e. data queue and also maintains an acknowledgement queue.
• Data queue is then consumed by a Data Streamer process which is responsible for
asking namenode to allocate new blocks by picking a list of suitable datanodes to
store the replicas.
• The list of datanodes forms a pipeline and
assuming a replication factor of three, so there will
be three nodes in the pipeline.
• The data streamer streams the packets to the first
datanode in the pipeline, which then stores the
packet and forward it to second datanode in the
pipeline. Similarly the second node stores the
packet and forward it to next datanode or last
datanode in the pipeline
• Once each datanode in the pipeline acknowledge
the packet the packet is removed from the
acknowledgement queue.
• Now what happens when one of the machines i.e.
part of the pipeline which has datanode process
running fails. Hadoop has inbuilt functionality to
handle this scenario.If a datanode fails while data
is being written to it, then the following actions
are taken, which are transparent to the client
writing the data.
• First, the pipeline is closed, and any packets in the ack queue are added to the front
of the data queue so that datanodes that are downstream from the failed node will
not miss any packets.
• The current block on the good datanodes is given a new identity, which is
communicated to the namenode, so that the partial block on the failed datanode will
be deleted if the failed datanode recovers later on.
• The failed datanode is removed from the pipeline, and the remainder of the block’s
data is written to the two good datanodes in the pipeline.
• The namenode notices that the block is under-replicated, and it arranges for a
further replica to be created on another node. Subsequent blocks are then treated as
normal.
It’s possible, but unlikely, that multiple datanodes fail while a block is being written.
• As long as dfs.replication.min replicas (which default to one) are written, the write
will succeed, and the block will be asynchronously replicated across the cluster until
its target replication factor is reached (dfs.replication, which defaults to three).
Ad

More Related Content

Similar to Chaptor 2- Big Data Processing in big data technologies (20)

Google File System
Google File SystemGoogle File System
Google File System
DreamJobs1
 
Lecture-7 Main Memroy.pptx
Lecture-7 Main Memroy.pptxLecture-7 Main Memroy.pptx
Lecture-7 Main Memroy.pptx
Amanuelmergia
 
Big Data Storage Concepts from the "Big Data concepts Technology and Architec...
Big Data Storage Concepts from the "Big Data concepts Technology and Architec...Big Data Storage Concepts from the "Big Data concepts Technology and Architec...
Big Data Storage Concepts from the "Big Data concepts Technology and Architec...
raghdooosh
 
Gfs sosp2003
Gfs sosp2003Gfs sosp2003
Gfs sosp2003
睿琦 崔
 
Gfs
GfsGfs
Gfs
Shahbaz Sidhu
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journey
Peter Clapham
 
Gfs google-file-system-13331
Gfs google-file-system-13331Gfs google-file-system-13331
Gfs google-file-system-13331
Fengchang Xie
 
Designing your SaaS Database for Scale with Postgres
Designing your SaaS Database for Scale with PostgresDesigning your SaaS Database for Scale with Postgres
Designing your SaaS Database for Scale with Postgres
Ozgun Erdogan
 
From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.
Taras Matyashovsky
 
unit 2 - book ppt.pptxtyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
unit 2 - book ppt.pptxtyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyunit 2 - book ppt.pptxtyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
unit 2 - book ppt.pptxtyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
0710harish
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
Mohit Tare
 
Advanced Topics on Database - Unit-1 AU17
Advanced Topics on Database - Unit-1 AU17Advanced Topics on Database - Unit-1 AU17
Advanced Topics on Database - Unit-1 AU17
LOGANATHANK24
 
Big Data for QAs
Big Data for QAsBig Data for QAs
Big Data for QAs
Ahmed Misbah
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
musrath mohammad
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
Lokesh Ramaswamy
 
MOD-2 presentation on engineering students
MOD-2 presentation on engineering studentsMOD-2 presentation on engineering students
MOD-2 presentation on engineering students
rishavkumar1402
 
Operating system memory management
Operating system memory managementOperating system memory management
Operating system memory management
rprajat007
 
High performance computing
High performance computingHigh performance computing
High performance computing
punjab engineering college, chandigarh
 
Training Webinar: Enterprise application performance with distributed caching
Training Webinar: Enterprise application performance with distributed cachingTraining Webinar: Enterprise application performance with distributed caching
Training Webinar: Enterprise application performance with distributed caching
OutSystems
 
lecture-13.pptx
lecture-13.pptxlecture-13.pptx
lecture-13.pptx
laiba29012
 
Google File System
Google File SystemGoogle File System
Google File System
DreamJobs1
 
Lecture-7 Main Memroy.pptx
Lecture-7 Main Memroy.pptxLecture-7 Main Memroy.pptx
Lecture-7 Main Memroy.pptx
Amanuelmergia
 
Big Data Storage Concepts from the "Big Data concepts Technology and Architec...
Big Data Storage Concepts from the "Big Data concepts Technology and Architec...Big Data Storage Concepts from the "Big Data concepts Technology and Architec...
Big Data Storage Concepts from the "Big Data concepts Technology and Architec...
raghdooosh
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journey
Peter Clapham
 
Gfs google-file-system-13331
Gfs google-file-system-13331Gfs google-file-system-13331
Gfs google-file-system-13331
Fengchang Xie
 
Designing your SaaS Database for Scale with Postgres
Designing your SaaS Database for Scale with PostgresDesigning your SaaS Database for Scale with Postgres
Designing your SaaS Database for Scale with Postgres
Ozgun Erdogan
 
From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.
Taras Matyashovsky
 
unit 2 - book ppt.pptxtyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
unit 2 - book ppt.pptxtyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyunit 2 - book ppt.pptxtyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
unit 2 - book ppt.pptxtyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
0710harish
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
Mohit Tare
 
Advanced Topics on Database - Unit-1 AU17
Advanced Topics on Database - Unit-1 AU17Advanced Topics on Database - Unit-1 AU17
Advanced Topics on Database - Unit-1 AU17
LOGANATHANK24
 
MOD-2 presentation on engineering students
MOD-2 presentation on engineering studentsMOD-2 presentation on engineering students
MOD-2 presentation on engineering students
rishavkumar1402
 
Operating system memory management
Operating system memory managementOperating system memory management
Operating system memory management
rprajat007
 
Training Webinar: Enterprise application performance with distributed caching
Training Webinar: Enterprise application performance with distributed cachingTraining Webinar: Enterprise application performance with distributed caching
Training Webinar: Enterprise application performance with distributed caching
OutSystems
 
lecture-13.pptx
lecture-13.pptxlecture-13.pptx
lecture-13.pptx
laiba29012
 

Recently uploaded (20)

Oil-gas_Unconventional oil and gass_reseviours.pdf
Oil-gas_Unconventional oil and gass_reseviours.pdfOil-gas_Unconventional oil and gass_reseviours.pdf
Oil-gas_Unconventional oil and gass_reseviours.pdf
M7md3li2
 
MAQUINARIA MINAS CEMA 6th Edition (1).pdf
MAQUINARIA MINAS CEMA 6th Edition (1).pdfMAQUINARIA MINAS CEMA 6th Edition (1).pdf
MAQUINARIA MINAS CEMA 6th Edition (1).pdf
ssuser562df4
 
Development of MLR, ANN and ANFIS Models for Estimation of PCUs at Different ...
Development of MLR, ANN and ANFIS Models for Estimation of PCUs at Different ...Development of MLR, ANN and ANFIS Models for Estimation of PCUs at Different ...
Development of MLR, ANN and ANFIS Models for Estimation of PCUs at Different ...
Journal of Soft Computing in Civil Engineering
 
railway wheels, descaling after reheating and before forging
railway wheels, descaling after reheating and before forgingrailway wheels, descaling after reheating and before forging
railway wheels, descaling after reheating and before forging
Javad Kadkhodapour
 
DT REPORT by Tech titan GROUP to introduce the subject design Thinking
DT REPORT by Tech titan GROUP to introduce the subject design ThinkingDT REPORT by Tech titan GROUP to introduce the subject design Thinking
DT REPORT by Tech titan GROUP to introduce the subject design Thinking
DhruvChotaliya2
 
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITYADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ijscai
 
introduction to machine learining for beginers
introduction to machine learining for beginersintroduction to machine learining for beginers
introduction to machine learining for beginers
JoydebSheet
 
Mathematical foundation machine learning.pdf
Mathematical foundation machine learning.pdfMathematical foundation machine learning.pdf
Mathematical foundation machine learning.pdf
TalhaShahid49
 
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E..."Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
Infopitaara
 
15th International Conference on Computer Science, Engineering and Applicatio...
15th International Conference on Computer Science, Engineering and Applicatio...15th International Conference on Computer Science, Engineering and Applicatio...
15th International Conference on Computer Science, Engineering and Applicatio...
IJCSES Journal
 
Structural Response of Reinforced Self-Compacting Concrete Deep Beam Using Fi...
Structural Response of Reinforced Self-Compacting Concrete Deep Beam Using Fi...Structural Response of Reinforced Self-Compacting Concrete Deep Beam Using Fi...
Structural Response of Reinforced Self-Compacting Concrete Deep Beam Using Fi...
Journal of Soft Computing in Civil Engineering
 
Value Stream Mapping Worskshops for Intelligent Continuous Security
Value Stream Mapping Worskshops for Intelligent Continuous SecurityValue Stream Mapping Worskshops for Intelligent Continuous Security
Value Stream Mapping Worskshops for Intelligent Continuous Security
Marc Hornbeek
 
Data Structures_Searching and Sorting.pptx
Data Structures_Searching and Sorting.pptxData Structures_Searching and Sorting.pptx
Data Structures_Searching and Sorting.pptx
RushaliDeshmukh2
 
π0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalizationπ0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalization
NABLAS株式会社
 
fluke dealers in bangalore..............
fluke dealers in bangalore..............fluke dealers in bangalore..............
fluke dealers in bangalore..............
Haresh Vaswani
 
DSP and MV the Color image processing.ppt
DSP and MV the  Color image processing.pptDSP and MV the  Color image processing.ppt
DSP and MV the Color image processing.ppt
HafizAhamed8
 
Avnet Silica's PCIM 2025 Highlights Flyer
Avnet Silica's PCIM 2025 Highlights FlyerAvnet Silica's PCIM 2025 Highlights Flyer
Avnet Silica's PCIM 2025 Highlights Flyer
WillDavies22
 
"Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G...
"Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G..."Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G...
"Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G...
Infopitaara
 
AI-assisted Software Testing (3-hours tutorial)
AI-assisted Software Testing (3-hours tutorial)AI-assisted Software Testing (3-hours tutorial)
AI-assisted Software Testing (3-hours tutorial)
Vəhid Gəruslu
 
Compiler Design Unit1 PPT Phases of Compiler.pptx
Compiler Design Unit1 PPT Phases of Compiler.pptxCompiler Design Unit1 PPT Phases of Compiler.pptx
Compiler Design Unit1 PPT Phases of Compiler.pptx
RushaliDeshmukh2
 
Oil-gas_Unconventional oil and gass_reseviours.pdf
Oil-gas_Unconventional oil and gass_reseviours.pdfOil-gas_Unconventional oil and gass_reseviours.pdf
Oil-gas_Unconventional oil and gass_reseviours.pdf
M7md3li2
 
MAQUINARIA MINAS CEMA 6th Edition (1).pdf
MAQUINARIA MINAS CEMA 6th Edition (1).pdfMAQUINARIA MINAS CEMA 6th Edition (1).pdf
MAQUINARIA MINAS CEMA 6th Edition (1).pdf
ssuser562df4
 
railway wheels, descaling after reheating and before forging
railway wheels, descaling after reheating and before forgingrailway wheels, descaling after reheating and before forging
railway wheels, descaling after reheating and before forging
Javad Kadkhodapour
 
DT REPORT by Tech titan GROUP to introduce the subject design Thinking
DT REPORT by Tech titan GROUP to introduce the subject design ThinkingDT REPORT by Tech titan GROUP to introduce the subject design Thinking
DT REPORT by Tech titan GROUP to introduce the subject design Thinking
DhruvChotaliya2
 
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITYADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ADVXAI IN MALWARE ANALYSIS FRAMEWORK: BALANCING EXPLAINABILITY WITH SECURITY
ijscai
 
introduction to machine learining for beginers
introduction to machine learining for beginersintroduction to machine learining for beginers
introduction to machine learining for beginers
JoydebSheet
 
Mathematical foundation machine learning.pdf
Mathematical foundation machine learning.pdfMathematical foundation machine learning.pdf
Mathematical foundation machine learning.pdf
TalhaShahid49
 
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E..."Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
"Boiler Feed Pump (BFP): Working, Applications, Advantages, and Limitations E...
Infopitaara
 
15th International Conference on Computer Science, Engineering and Applicatio...
15th International Conference on Computer Science, Engineering and Applicatio...15th International Conference on Computer Science, Engineering and Applicatio...
15th International Conference on Computer Science, Engineering and Applicatio...
IJCSES Journal
 
Value Stream Mapping Worskshops for Intelligent Continuous Security
Value Stream Mapping Worskshops for Intelligent Continuous SecurityValue Stream Mapping Worskshops for Intelligent Continuous Security
Value Stream Mapping Worskshops for Intelligent Continuous Security
Marc Hornbeek
 
Data Structures_Searching and Sorting.pptx
Data Structures_Searching and Sorting.pptxData Structures_Searching and Sorting.pptx
Data Structures_Searching and Sorting.pptx
RushaliDeshmukh2
 
π0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalizationπ0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalization
NABLAS株式会社
 
fluke dealers in bangalore..............
fluke dealers in bangalore..............fluke dealers in bangalore..............
fluke dealers in bangalore..............
Haresh Vaswani
 
DSP and MV the Color image processing.ppt
DSP and MV the  Color image processing.pptDSP and MV the  Color image processing.ppt
DSP and MV the Color image processing.ppt
HafizAhamed8
 
Avnet Silica's PCIM 2025 Highlights Flyer
Avnet Silica's PCIM 2025 Highlights FlyerAvnet Silica's PCIM 2025 Highlights Flyer
Avnet Silica's PCIM 2025 Highlights Flyer
WillDavies22
 
"Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G...
"Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G..."Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G...
"Feed Water Heaters in Thermal Power Plants: Types, Working, and Efficiency G...
Infopitaara
 
AI-assisted Software Testing (3-hours tutorial)
AI-assisted Software Testing (3-hours tutorial)AI-assisted Software Testing (3-hours tutorial)
AI-assisted Software Testing (3-hours tutorial)
Vəhid Gəruslu
 
Compiler Design Unit1 PPT Phases of Compiler.pptx
Compiler Design Unit1 PPT Phases of Compiler.pptxCompiler Design Unit1 PPT Phases of Compiler.pptx
Compiler Design Unit1 PPT Phases of Compiler.pptx
RushaliDeshmukh2
 
Ad

Chaptor 2- Big Data Processing in big data technologies

  • 1. Chapter 2 BIG DATA PROCESSING
  • 2. Big Data technologies • Big data technologies are essential for offering more precise analysis, which may lead to more tangible decision making resulting in better operational efficiencies, cost reductions, and reduced risks for the business. • To control the power of big data you would need an infrastructure that can handle and process huge volumes of structured and unstructured data in real time and can preserve data privacy and security. • Today, the various architectures and papers that were contributed by these and other developers across the world have culminated into several open-source projects under the Apache Software Foundation and the NoSQL movement. • All of these technologies have been identified as Big Data processing platforms, including Hadoop, Hive, HBase, Cassandra, and Map Reduce. • NoSQL platforms include MongoDB, Neo4J, Riak, Amazon DynamoDB, MemcachedDB, BerkleyDB, Voldemort, and many more.
  • 3. Distributed Data Processing • Distributed data processing has been in existence since the late 1970s. • The primary concept was to replicate the DBMS in a master–slave configuration and process data across multiple instances • Each slave would engage in a two-phase commit with its master in a query processing situation.
  • 4. Why did distributed data processing fail to meet the requirements in the relational data processing architecture? • Complex architectures for consistency management • Latencies across the system • Slow networks • Infrastructure cost • Complex data processing and transformation requirements
  • 5. Client–Server Data Processing Benefits: • Centralization of administration, security, and setup. • Back-up and recovery of data is inexpensive, as outages can occur at the server or a client and can be restored. • Scalability of infrastructure by adding more server capacity or client capacity can be accomplished. The scalability is not linear. • Accessibility of the server from heterogeneous platforms locally or remotely. • Clients can use servers for different types of processing. Limitations: • The server is the central point of failure. • Very limited scalability. • Performance can degrade with network congestion. • Too many clients accessing a single server cannot process data in a quick time.
  • 6. Big Data Processing Requirements Volume: • Size of data to be processed is large—it needs to be broken into manageable chunks. • Data needs to be processed in parallel across multiple systems. • Data needs to be processed across several program modules simultaneously.
  • 7. Velocity: • Data needs to be processed at streaming speeds during data collection. • Data needs to be processed for multiple acquisition points. Variety: • Data of different formats needs to be processed. • Data of different types needs to be processed. • Data of different structures needs to be processed. • Data from different regions needs to be processed. Ambiguity: • Big Data is ambiguous by nature due to the lack of relevant metadata and context in many cases. An example is the use of M and F in a sentence—it can mean, respectively, Monday and Friday, male and female, or mother and father. • Big Data that is within the corporation also exhibits this ambiguity to a lesser degree. For example, employment agreements have standard and custom sections and the latter is ambiguous without the right context. Complexity: • Big Data complexity needs to use many algorithms to process data quickly and efficiently. • Several types of data need multipass processing and scalability is extremely important.
  • 8. Google File System https://youtu.be/eRgFNW4QFDc • Developed by Google to hold Google’s increasing data processing requirements. • It is one of the scalable distributed file system. • GFS is improved to hold the data used by Google and their requirement of storage, like search engine, which produce large amounts of data that needs to be stored. • The main purpose behind the design of GFS is to hold the Google’s huge cluster requirements without making extra load on applications.
  • 9. • Google organized the GFS into clusters of computers. • Each cluster might contain hundreds or even thousands of machines. Within GFS clusters there are three kinds of entities: clients, master servers and chunkservers. • "client" refers to any entity that makes a file request. • Requests can range from retrieving and manipulating existing files to creating new files on the system. • Clients can be other computers or computer applications. • You can think of clients as the customers of the GFS.
  • 16. Master Server • The master server acts as the coordinator for the cluster. • The master's duties include maintaining an operation log, which keeps track of the activities of the master's cluster. • The operation log helps keep service interruptions to a minimum -- if the master server crashes, a replacement server that has monitored the operation log can take its place. • The master server also keeps track of metadata, which is the information that describes chunks. • The metadata tells the master server to which files the chunks belong and where they fit within the overall file. • Upon startup, the master polls all the chunkservers in its cluster. • The chunkservers respond by telling the master server the contents of their inventories. • From that moment on, the master server keeps track of the location of chunks within the cluster. • There's only one active master server per cluster at any one time (though each cluster has multiple copies of the master server in case of a hardware failure)
  • 17. Chunkservers • Chunkservers are the workhorses of the GFS. • They're responsible for storing the 64-MB file chunks. • The chunkservers don't send chunks to the master server. • Instead, they send requested chunks directly to the client. • The GFS copies every chunk multiple times and stores it on different chunkservers. • Each copy is called a replica. • By default, the GFS makes three replicas per chunk, but users can change the setting and make more or fewer replicas if desired.
  • 18. A GFS cluster: • A single master • Multiple chunk servers (workers or slaves) per master • Accessed by multiple clients • Running on commodity Linux machines A file: • Represented as fixed-sized chunks • Labeled with 64-bit unique global IDs • Stored at chunk servers and three-way mirrored across chunk servers
  • 19. • In the GFS cluster, input data files are divided into chunks (64 MB is the standard chunk size), each assigned its unique 64-bit handle, and stored on local chunk server systems as files. • To ensure fault tolerance and scalability, each chunk is replicated at least once on another server, and the default design is to create three copies of a chunk. • The role of the master is to communicate to clients which chunk servers have which chunks and their metadata information. • Clients’ tasks then interact directly with chunk servers for all subsequent operations, and use the master only in a minimal fashion.
  • 20. • Another important issue to understand in the GFS architecture is the single point of failure (SPOF) of the master node and all the metadata that keeps track of the chunks and their state. • To avoid this situation, GFS was designed to have the master keep data in memory for speed, keep a log on the master’s local disk, and replicate the disk across remote nodes. • This way if there is a crash in the master node, a shadow can be up and running almost instantly. • The master stores three types of metadata: 1. File and chunk names or namespaces. 2. Mapping from files to chunks (i.e., the chunks that make up each file). 3. Locations of each chunk’s replicas. The replica locations for each chunk are stored on the local chunk server apart from being replicated, and the information of the replications is provided to the master at startup or when a chunk server is added to a cluster. • Since the master controls the chunk placement, it always updates metadata as new chunks get written.
  • 21. • To recover from any corruption, GFS appends data as it is available rather than updates an existing data set; this provides the ability to recover from corruption or failure quickly. • When a corruption is detected, with a combination of frequent checkpoints, snapshots, and replicas, data is recovered with minimal chance of data loss.
  • 22. The GFS architecture has the following strengths: ● Availability: 1. Triple replication–based redundancy (or more if you choose). 2. Chunk replication. 3. Rapid failovers for any master failure. 4. Automatic replication management. ● Performance: 1. The biggest workload for GFS is read-on large data sets, which based on the architecture discussion, will be a nonissue. 2. There are minimal writes to the chunks directly, thus providing auto availability. ● Management: 1. GFS manages itself through multiple failure modes. 2. Automatic load balancing. 3. Storage management and pooling. 4. Chunk management. 5. Failover management. ● Cost: 1. Is not a constraint due to use of commodity hardware and Linux platforms.
  • 23. Hadoop • The Hadoop framework application works in an environment that provides distributed storage and computation across clusters of computers. • Hadoop is designed to scale up from single server to thousands of machines, each offering local computation and storage. • Hadoop is an open source, Java-based programming framework that supports the processing and storage of extremely large data sets in a distributed computing environment.
  • 26. Map Reduce • MapReduce is a parallel programming model for writing distributed applications devised at Google • For efficient processing of large amounts of data (multiterabyte datasets), on large clusters (thousands of nodes) of commodity hardware in a reliable, fault tolerant manner. • The MapReduce program runs on Hadoop which is an Apache open source framework.
  • 27. Hadoop Distributed File System • The Hadoop Distributed File System (HDFS)is based on the Google File System (GFS) and provides a distributed file system that is designed to run on commodity hardware. • It is highly fault tolerant and is designed to be deployed on low cost hardware. • It provides high throughput access to application data and is suitable for applications having large datasets.
  • 28. • Apart from the above mentioned two core components, Hadoop framework also includes the following two modules: • Hadoop Common: These are Java libraries and utilities required by other Hadoop modules. • Hadoop YARN: • This is a framework for job scheduling and cluster resource management.
  • 29. How it works? • Data is initially divided into directories and files. Files are divided into uniform sized blocks of 128M and 64M(preferably 128M) • These files are then distributed across various cluster nodes for further processing. • HDFS being on top of the local file system supervises the processing • Blocks are replicated for handling hardware failure • Checking that the code was executed successfully • Performing the sort that takes place between the map and reduce stages. • Sending the sorted data to a certain computer. • Writing the debugging logs for each job
  • 30. HDFS • Hadoop Distributed File System is a block-structured file system where each file is divided into blocks of a pre-determined size. • These blocks are stored across a cluster of one or several machines. • Apache Hadoop HDFS Architecture follows a Master/Slave Architecture, where a cluster comprises of a single NameNode (Master node) and all the other nodes are DataNodes (Slave nodes). • The HDFS architecture was designed to solve two known problems experienced by the early developers of large-scale data processing. • The first problem was the ability to break down the files across multiple systems and process each piece of the file independent of the other pieces and finally consolidate all the outputs in a single result set. • The second problem was the fault tolerance both at the file processing level and the overall system level in the distributed data processing systems.
  • 32. Namenode and Datanodes  Master/slave architecture  HDFS cluster consists of a single Namenode, a master server that manages the file system namespace and regulates access to files by clients.  There are a number of DataNodes usually one per node in a cluster.  The DataNodes manage storage attached to the nodes that they run on.  HDFS exposes a file system namespace and allows user data to be stored in files.  A file is split into one or more blocks and set of blocks are stored in DataNodes.  DataNodes: serves read, write requests, performs block creation, deletion, and replication upon instruction from Namenode. 01/06/2025 32
  • 33. HDFS Architecture 01/06/2025 33 Namenode B replication Rack1 Rack2 Client Blocks Datanodes Datanodes Client Write Read Metadata ops Metadata(Name, replicas..) (/home/foo/data,6. .. Block ops
  • 34. File System Namespace 01/06/2025 34 • Hierarchical file system with directories and files • Create, remove, move, rename etc. • Namenode maintains the file system • Any meta information changes to the file system recorded by the Namenode. • An application can specify the number of replicas of the file needed: replication factor of the file. This information is stored in the Namenode.
  • 35. Data Replication 01/06/2025 35  HDFS is designed to store very large files across machines in a large cluster.  Each file is a sequence of blocks.  All blocks in the file except the last are of the same size.  Blocks are replicated for fault tolerance.  Block size and replicas are configurable per file.  The Namenode receives a Heartbeat and a BlockReport from each DataNode in the cluster.  BlockReport contains all the blocks on a Datanode.
  • 36. Datanode 01/06/2025 36 • A Datanode stores data in files in its local file system. • Datanode has no knowledge about HDFS filesystem • It stores each block of HDFS data in a separate file. • Datanode does not create all files in the same directory. • A typical HDFS cluster can have thousands of DataNodes and tens of thousands of HDFS clients per cluster, since each DataNode may execute multiple application tasks simultaneously. • The DataNodes are responsible for managing read and write requests from the file system’s clients, and block maintenance and perform replication as directed by the NameNode. • The size of the data file equals the actual length of the block. This means if a block is half full it needs only half of the space of the full block on the local drive, thereby optimizing storage space for compactness, and there is no extra space consumed on the block unlike a regular file system.
  • 37. • Image : An image represents the metadata of the namespace (inodes and lists of blocks). • On startup,the NameNode pins the entire namespace image in memory. The in-memory persistence enables the NameNode to service multiple client requests concurrently. • Journal : The journal represents the modification log of the image in the local host’s native file system. • During normal operations, each client transaction is recorded in the journal, and the journal file is flushed and synced before the acknowledgment is sent to the client. The NameNode upon startup or from a recovery can replay this journal. • Checkpoint : To enable recovery, the persistent record of the image is also stored in the local host’s native files system and is called a checkpoint. • Once the system starts up, the NameNode never modifies or updates the checkpoint file. • A new checkpoint file can be created during the next startup, on a restart, or on demand when requested by the administrator or by the CheckpointNode
  • 38. Checkpoint Node and Backup Node • There are two roles that a NameNode can be designated to perform apart from servicing client requests and managing Data Nodes. • These roles are specified during startup and can be the Checkpoint Node or the Backup Node.
  • 39. Checkpoint Node • The Checkpoint Node serves as a journal-capture architecture to create a recovery mechanism for the NameNode. • The Checkpoint Node combines the existing checkpoint and journal to create a new checkpoint and an empty journal in specific intervals. • It returns the new checkpoint to the NameNode. • The Checkpoint Node runs on a different host from the NameNode since it has the same memory requirements as the NameNode. • This mechanism provides a protection.
  • 40. Backup Node • The Backup Node can be considered as a read-only NameNode. • It contains all file system metadata information except for block locations. • It accepts a stream of namespace transactions from the active NameNode and saves them to its own storage directories, and applies these transactions to its own namespace image in its memory. • If the NameNode fails, the Backup Node’s image in memory and the checkpoint on disk are a record of the latest namespace state and can be used to create a checkpoint for recovery. • Creating a checkpoint from a Backup Node is very efficient as it processes the entire image in its own disk and memory. • A Backup Node can perform all operations of the regular NameNode that do not involve modification of the namespace or management of block locations.
  • 41. MapReduce • The key features of MapReduce that make it the interface on Hadoop or Cassandra include: 1. Automatic parallelization 2. Automatic distribution 3. Fault-tolerance 4. Status and monitoring tools 5. Easy abstraction for programmers 6. Programming language flexibility 7. Extensibility
  • 42. MapReduce Programming Model • MapReduce is based on functional programming models largely from Lisp. Typically, the users will implement two functions: Map (in_key, in_value) -> (out_key, intermediate_value) list • The Map function written by the user will receive an input pair of keys and values, and after the computation cycles, will produce a set of intermediate key-value pairs. • Library functions then are used to group together all intermediate values associated with an intermediate key I and passes them to the Reduce function. Reduce (out_key, intermediate_value list) -> out_value list • The Reduce function written by the user will accept an intermediate key I, and the set of values for the key. • It will merge together these values to form a possibly smaller set of values. • Reducer outputs are just zero or one output value per invocation. • The intermediate values are supplied to the Reduce function via an iterator. • The Iterator function allows us to handle large lists of values that cannot be fit in memory or a single pass.
  • 44. • The main components of this architecture include: • Mapper—maps input key-value pairs to a set of intermediate key-value pairs. • For an input pair the mapper can map to zero or many output pairs. By default the mapper spawns one map task for each input. • Reducer—performs a number of tasks: Sort and group mapper outputs. Shuffle partitions. Perform secondary sorting as necessary. Manage overrides specified by users for grouping and partitioning.
  • 45. • Reporter—is used to report progress, set application-level status messages, update any user set counters, and indicate long running tasks or jobs are alive. • Combiner—an optional performance booster that can be specified to perform local aggregation of the intermediate outputs to manage the amount of data transferred from the Mapper to the Reducer. • Partitioner—controls the partitioning of the keys of the intermediate map outputs. The key (or a subset of the key) is used to derive the partition and default partitions are created by a hash function. The total number of partitions will be same as the number of reduce tasks for the job. • Output collector—collects the output of Mappers and Reducers. • Job configuration—is the primary user interface to manage MapReduce jobs. • It is typically used to specify the Mapper, Combiner, Partitioner, Reducer, InputFormat, OutputFormat, and OutputCommitter for every job. • It also indicates the set of input files and where the output files should be written. Optionally used to specify other advanced options for the job such as the comparator to be used, files to be put in the DistributedCache, and compression on intermediate and/or final job outputs. • It is used for debugging via user-provided scripts, whether job tasks can be executed in a speculative manner, the maximum number of attempts per task for any possible failure, and the percentage of task failures that can be tolerated by the job overall.
  • 46. • Output committer—is used to manage the commit for jobs and tasks in MapReduce. Key tasks executed are: • Set up the job during initialization. For example, create the intermediate directory for the job during the initialization of the job. • Clean up the job after the job completion. For example, remove the temporary output directory after the job completion. • Set up any task temporary output. • Check whether a task needs a commit. This will avoid overheads on unnecessary commits. • Commit of the task output on completion. • On failure, discard the task commit and clean up all intermediate results, memory release, and other user-specified tasks. • Job input: • Specifies the input format for a Map/Reduce job. • Validate the input specification of the job. • Split up the input file(s) into logical instances to be assigned to an individual Mapper. • Provide input records from the logical splits for processing by the Mapper. • Memory management, JVM reuse, and compression are managed with the job configuration set of classes.
  • 48. Example Input: Bus, Car, bus, car, train, car, bus, car, train, bus, TRAIN,BUS, buS, caR, CAR, car, BUS, TRAIN.
  • 50. Anatomy of File Write and Read • HDFS has a master and slave kind of architecture. • Namenode acts as master and Datanodes as worker. • All the metadata information is with namenode and the original data is stored on the datanodes. • Keeping all these in mind the below figure will give idea about how data flow happens between the Client interacting with HDFS, i.e. the Namenode and the Datanodes.
  • 51. • The following steps are involved in reading the file from HDFS: Let’s suppose a Client (a HDFS Client) wants to read a file from HDFS. Step 1: First the Client will open the file by giving a call to open() method on FileSystem object, which for HDFS is an instance of DistributedFileSystem class. Step 2: DistributedFileSystem calls the Namenode, using RPC (Remote Procedure Call), to determine the locations of the blocks for the first few blocks of the file. For each block, the NameNode returns the addresses of all the DataNode’s that have a copy of that block. Client will interact with respective DataNode’sto read the file. NameNode also provide a token to the client which it shows to data node for authentication. • The DistributedFileSystem returns an object of FSDataInputStream(an input stream that supports file seeks) to the client for it to read data from FSDataInputStream in turn wraps a DFSInputStream, which manages the datanode and namenode I/O Step 3: The client then calls read() on the stream. DFSInputStream, which has stored the DataNode addresses for the first few blocks in the file, then connects to the first closest DataNode for the first block in the file.
  • 52. • Step 4: Data is streamed from the DataNode back to the client, which calls read() repeatedly on the stream. • Step 5: When the end of the block is reached, DFSInputStream will close the connection to the DataNode , then find the best DataNode for the next block. This happens transparently to the client, which from its point of view is just reading a continuous stream. • Step 6: Blocks are read in order, with the DFSInputStream opening new connections to datanodes as the client reads through the stream. It will also call the namnode to retrieve the datanode locations for the next batch of blocks as needed. When the client has finished reading, it calls close() on the FSDataInputStream.
  • 54. Now we will look at what happens when you write a File in HDFS. • DistributedFileSystem object do a RPC call to namenode to create a new file in filesystem namespace with no blocks associated to it • Namenode process performs various checks like a) client has required permissions to create a file or not 2) file should not exists earlier. In case of above exceptions it will throw an IOexception to client • Once the file is registered with the namenode then client will get an object i.e. FSDataOutputStream which in turns embed DFSOutputStream object for the client to start writing data to DFSoutputStream handles communication with the datanodes and namenode. • As client writes data DFSOutputStream split it into packets and write it to its internal queue i.e. data queue and also maintains an acknowledgement queue. • Data queue is then consumed by a Data Streamer process which is responsible for asking namenode to allocate new blocks by picking a list of suitable datanodes to store the replicas.
  • 55. • The list of datanodes forms a pipeline and assuming a replication factor of three, so there will be three nodes in the pipeline. • The data streamer streams the packets to the first datanode in the pipeline, which then stores the packet and forward it to second datanode in the pipeline. Similarly the second node stores the packet and forward it to next datanode or last datanode in the pipeline • Once each datanode in the pipeline acknowledge the packet the packet is removed from the acknowledgement queue.
  • 56. • Now what happens when one of the machines i.e. part of the pipeline which has datanode process running fails. Hadoop has inbuilt functionality to handle this scenario.If a datanode fails while data is being written to it, then the following actions are taken, which are transparent to the client writing the data.
  • 57. • First, the pipeline is closed, and any packets in the ack queue are added to the front of the data queue so that datanodes that are downstream from the failed node will not miss any packets. • The current block on the good datanodes is given a new identity, which is communicated to the namenode, so that the partial block on the failed datanode will be deleted if the failed datanode recovers later on. • The failed datanode is removed from the pipeline, and the remainder of the block’s data is written to the two good datanodes in the pipeline. • The namenode notices that the block is under-replicated, and it arranges for a further replica to be created on another node. Subsequent blocks are then treated as normal. It’s possible, but unlikely, that multiple datanodes fail while a block is being written. • As long as dfs.replication.min replicas (which default to one) are written, the write will succeed, and the block will be asynchronously replicated across the cluster until its target replication factor is reached (dfs.replication, which defaults to three).