Chaptor 2- Big Data Processing in big data technologies

Big Data technologies
• Big data technologies are essential for offering more precise analysis, which may
lead to more tangible decision making resulting in better operational
efficiencies, cost reductions, and reduced risks for the business.
• To control the power of big data you would need an infrastructure that can
handle and process huge volumes of structured and unstructured data in real
time and can preserve data privacy and security.
• Today, the various architectures and papers that were contributed by these and
other developers across the world have culminated into several open-source
projects under the Apache Software Foundation and the NoSQL movement.
• All of these technologies have been identified as Big Data processing platforms,
including Hadoop, Hive, HBase, Cassandra, and Map Reduce.
• NoSQL platforms include MongoDB, Neo4J, Riak, Amazon DynamoDB,
MemcachedDB, BerkleyDB, Voldemort, and many more.

Distributed Data Processing
• Distributed data processing has been in
existence since the late 1970s.
• The primary concept was to replicate the
DBMS in a master–slave configuration and
process data across multiple instances
• Each slave would engage in a two-phase
commit with its master in a query processing
situation.

Why did distributed data processing fail to meet the
requirements in the relational data processing
architecture?
• Complex architectures for consistency management
• Latencies across the system
• Slow networks
• Infrastructure cost
• Complex data processing and transformation
requirements

Client–Server Data
Processing
Benefits:
• Centralization of administration, security, and setup.
• Back-up and recovery of data is inexpensive, as outages can occur at the server
or a client and can be restored.
• Scalability of infrastructure by adding more server capacity or client capacity
can be accomplished. The scalability is not linear.
• Accessibility of the server from heterogeneous platforms locally or remotely.
• Clients can use servers for different types of processing.
Limitations:
• The server is the central point of failure.
• Very limited scalability.
• Performance can degrade with network congestion.
• Too many clients accessing a single server cannot process data in a quick time.

Big Data Processing Requirements
Volume:
• Size of data to be processed is large—it needs
to be broken into manageable chunks.
• Data needs to be processed in parallel across
multiple systems.
• Data needs to be processed across several
program modules simultaneously.

Velocity:
• Data needs to be processed at streaming speeds during data collection.
• Data needs to be processed for multiple acquisition points.
Variety:
• Data of different formats needs to be processed.
• Data of different types needs to be processed.
• Data of different structures needs to be processed.
• Data from different regions needs to be processed.
Ambiguity:
• Big Data is ambiguous by nature due to the lack of relevant metadata and context in
many cases. An example is the use of M and F in a sentence—it can mean,
respectively, Monday and Friday, male and female, or mother and father.
• Big Data that is within the corporation also exhibits this ambiguity to a lesser degree.
For example, employment agreements have standard and custom sections and the
latter is ambiguous without the right context.
Complexity:
• Big Data complexity needs to use many algorithms to process data quickly and
efficiently.
• Several types of data need multipass processing and scalability is extremely
important.

Google File System
https://youtu.be/eRgFNW4QFDc
• Developed by Google to hold Google’s increasing
data processing requirements.
• It is one of the scalable distributed file system.
• GFS is improved to hold the data used by Google
and their requirement of storage, like search
engine, which produce large amounts of data that
needs to be stored.
• The main purpose behind the design of GFS is to
hold the Google’s huge cluster requirements
without making extra load on applications.

• Google organized the GFS into clusters of computers.
• Each cluster might contain hundreds or even thousands
of machines. Within GFS clusters there are three kinds
of entities: clients, master servers and chunkservers.
• "client" refers to any entity that makes a file request.
• Requests can range from retrieving and manipulating
existing files to creating new files on the system.
• Clients can be other computers or computer
applications.
• You can think of clients as the customers of the GFS.

Chaptor 2- Big Data Processing in big data technologies

Master Server
• The master server acts as the coordinator for the cluster.
• The master's duties include maintaining an operation log, which keeps track of
the activities of the master's cluster.
• The operation log helps keep service interruptions to a minimum -- if the
master server crashes, a replacement server that has monitored the operation
log can take its place.
• The master server also keeps track of metadata, which is the information that
describes chunks.
• The metadata tells the master server to which files the chunks belong and
where they fit within the overall file.
• Upon startup, the master polls all the chunkservers in its cluster.
• The chunkservers respond by telling the master server the contents of their
inventories.
• From that moment on, the master server keeps track of the location of chunks
within the cluster.
• There's only one active master server per cluster at any one time (though each
cluster has multiple copies of the master server in case of a hardware failure)

Chunkservers
• Chunkservers are the workhorses of the GFS.
• They're responsible for storing the 64-MB file chunks.
• The chunkservers don't send chunks to the master
server.
• Instead, they send requested chunks directly to the
client.
• The GFS copies every chunk multiple times and stores it
on different chunkservers.
• Each copy is called a replica.
• By default, the GFS makes three replicas per chunk, but
users can change the setting and make more or fewer
replicas if desired.

A GFS cluster:
• A single master
• Multiple chunk servers (workers or slaves) per
master
• Accessed by multiple clients
• Running on commodity Linux machines
A file:
• Represented as fixed-sized chunks
• Labeled with 64-bit unique global IDs
• Stored at chunk servers and three-way mirrored
across chunk servers

• In the GFS cluster, input data files are divided into
chunks (64 MB is the standard chunk size), each
assigned its unique 64-bit handle, and stored on
local chunk server systems as files.
• To ensure fault tolerance and scalability, each chunk
is replicated at least once on another server, and the
default design is to create three copies of a chunk.
• The role of the master is to communicate to clients
which chunk servers have which chunks and their
metadata information.
• Clients’ tasks then interact directly with chunk
servers for all subsequent operations, and use the
master only in a minimal fashion.

• Another important issue to understand in the GFS architecture is the single point
of failure (SPOF) of the master node and all the metadata that keeps track of the
chunks and their state.
• To avoid this situation, GFS was designed to have the master keep data in
memory for speed, keep a log on the master’s local disk, and replicate the disk
across remote nodes.
• This way if there is a crash in the master node, a shadow can be up and running
almost instantly.
• The master stores three types of metadata:
1. File and chunk names or namespaces.
2. Mapping from files to chunks (i.e., the chunks that make up each file).
3. Locations of each chunk’s replicas. The replica locations for each chunk are
stored on the local chunk server apart from being replicated, and the
information of the replications is provided to the master at startup or when a
chunk server is added to a cluster.
• Since the master controls the chunk placement, it always updates metadata as
new chunks get written.

• To recover from any corruption, GFS appends data as it is
available rather than updates an existing data set; this provides the
ability to recover from corruption or failure quickly.
• When a corruption is detected, with a combination of frequent
checkpoints, snapshots, and replicas, data is recovered with
minimal chance of data loss.

The GFS architecture has the following
strengths:
● Availability:
1. Triple replication–based redundancy (or more if you choose).
2. Chunk replication.
3. Rapid failovers for any master failure.
4. Automatic replication management.
● Performance:
1. The biggest workload for GFS is read-on large data sets, which based on the architecture
discussion, will be a nonissue.
2. There are minimal writes to the chunks directly, thus providing auto availability.
● Management:
1. GFS manages itself through multiple failure modes.
2. Automatic load balancing.
3. Storage management and pooling.
4. Chunk management.
5. Failover management.
● Cost:
1. Is not a constraint due to use of commodity hardware and Linux platforms.

Hadoop
• The Hadoop framework application works in an
environment that provides distributed storage and
computation across clusters of computers.
• Hadoop is designed to scale up from single server to
thousands of machines, each offering local
computation and storage.
• Hadoop is an open source, Java-based programming
framework that supports the processing and storage
of extremely large data sets in a distributed
computing environment.

Map Reduce
• MapReduce is a parallel programming model for
writing distributed applications devised at Google
• For efficient processing of large amounts of data
(multiterabyte datasets), on large clusters
(thousands of nodes) of commodity hardware in a
reliable, fault tolerant manner.
• The MapReduce program runs on Hadoop which is
an Apache open source framework.

Hadoop Distributed File System
• The Hadoop Distributed File System (HDFS)is
based on the Google File System (GFS) and
provides a distributed file system that is
designed to run on commodity hardware.
• It is highly fault tolerant and is designed to be
deployed on low cost hardware.
• It provides high throughput access to
application data and is suitable for applications
having large datasets.

• Apart from the above mentioned two core
components, Hadoop framework also includes
the following two modules:
• Hadoop Common:
These are Java libraries and utilities required by
other Hadoop modules.
• Hadoop YARN:
• This is a framework for job scheduling and cluster
resource management.

How it works?
• Data is initially divided into directories and files. Files are
divided into uniform sized blocks of 128M and
64M(preferably 128M)
• These files are then distributed across various cluster nodes
for further processing.
• HDFS being on top of the local file system supervises the
processing
• Blocks are replicated for handling hardware failure
• Checking that the code was executed successfully
• Performing the sort that takes place between the map and
reduce stages.
• Sending the sorted data to a certain computer.
• Writing the debugging logs for each job

HDFS
• Hadoop Distributed File System is a block-structured file system
where each file is divided into blocks of a pre-determined size.
• These blocks are stored across a cluster of one or several machines.
• Apache Hadoop HDFS Architecture follows a Master/Slave
Architecture, where a cluster comprises of a single NameNode
(Master node) and all the other nodes are DataNodes (Slave nodes).
• The HDFS architecture was designed to solve two known problems
experienced by the early developers of large-scale data processing.
• The first problem was the ability to break down the files across
multiple systems and process each piece of the file independent of
the other pieces and finally consolidate all the outputs in a single
result set.
• The second problem was the fault tolerance both at the file
processing level and the overall system level in the distributed data
processing systems.

Namenode and Datanodes
 Master/slave architecture
 HDFS cluster consists of a single Namenode, a master server that
manages the file system namespace and regulates access to files by
clients.
 There are a number of DataNodes usually one per node in a cluster.
 The DataNodes manage storage attached to the nodes that they run on.
 HDFS exposes a file system namespace and allows user data to be
stored in files.
 A file is split into one or more blocks and set of blocks are stored in
DataNodes.
 DataNodes: serves read, write requests, performs block creation,
deletion, and replication upon instruction from Namenode.
01/06/2025 32

HDFS Architecture
01/06/2025 33
Namenode
B
replication
Rack1 Rack2
Client
Blocks
Datanodes Datanodes
Client
Write
Read
Metadata ops
Metadata(Name, replicas..)
(/home/foo/data,6. ..
Block ops

File System Namespace
01/06/2025 34
• Hierarchical file system with directories and files
• Create, remove, move, rename etc.
• Namenode maintains the file system
• Any meta information changes to the file system
recorded by the Namenode.
• An application can specify the number of replicas
of the file needed: replication factor of the file.
This information is stored in the Namenode.

Data Replication
01/06/2025 35
 HDFS is designed to store very large files across
machines in a large cluster.
 Each file is a sequence of blocks.
 All blocks in the file except the last are of the same size.
 Blocks are replicated for fault tolerance.
 Block size and replicas are configurable per file.
 The Namenode receives a Heartbeat and a BlockReport
from each DataNode in the cluster.
 BlockReport contains all the blocks on a Datanode.

Datanode
01/06/2025 36
• A Datanode stores data in files in its local file system.
• Datanode has no knowledge about HDFS filesystem
• It stores each block of HDFS data in a separate file.
• Datanode does not create all files in the same directory.
• A typical HDFS cluster can have thousands of DataNodes and tens of
thousands of HDFS clients per cluster, since each DataNode may execute
multiple application tasks simultaneously.
• The DataNodes are responsible for managing read and write requests
from the file system’s clients, and block maintenance and perform
replication as directed by the NameNode.
• The size of the data file equals the actual length of the block. This means
if a block is half full it needs only half of the space of the full block on the
local drive, thereby optimizing storage space for compactness, and there
is no extra space consumed on the block unlike a regular file system.

• Image : An image represents the metadata of the namespace (inodes and
lists of blocks).
• On startup,the NameNode pins the entire namespace image in memory.
The in-memory persistence enables the NameNode to service multiple
client requests concurrently.
• Journal : The journal represents the modification log of the image in the
local host’s native file system.
• During normal operations, each client transaction is recorded in the journal,
and the journal file is flushed and synced before the acknowledgment is
sent to the client. The NameNode upon startup or from a recovery can
replay this journal.
• Checkpoint : To enable recovery, the persistent record of the image is
also stored in the local host’s native files system and is called a
checkpoint.
• Once the system starts up, the NameNode never modifies or updates the
checkpoint file.
• A new checkpoint file can be created during the next startup, on a restart,
or on demand when requested by the administrator or by the
CheckpointNode

Checkpoint Node and Backup Node
• There are two roles that a NameNode can be
designated to perform apart from servicing client
requests and managing Data Nodes.
• These roles are specified during startup and can
be the Checkpoint Node or the Backup Node.

Checkpoint Node
• The Checkpoint Node serves as a journal-capture
architecture to create a recovery mechanism for the
NameNode.
• The Checkpoint Node combines the existing
checkpoint and journal to create a new checkpoint
and an empty journal in specific intervals.
• It returns the new checkpoint to the NameNode.
• The Checkpoint Node runs on a different host from
the NameNode since it has the same memory
requirements as the NameNode.
• This mechanism provides a protection.

Backup Node
• The Backup Node can be considered as a read-only NameNode.
• It contains all file system metadata information except for block
locations.
• It accepts a stream of namespace transactions from the active
NameNode and saves them to its own storage directories, and
applies these transactions to its own namespace image in its
memory.
• If the NameNode fails, the Backup Node’s image in memory and the
checkpoint on disk are a record of the latest namespace state and
can be used to create a checkpoint for recovery.
• Creating a checkpoint from a Backup Node is very efficient as it
processes the entire image in its own disk and memory.
• A Backup Node can perform all operations of the regular
NameNode that do not involve modification of the namespace or
management of block locations.

MapReduce
• The key features of MapReduce that make it the
interface on Hadoop or Cassandra include:
1. Automatic parallelization
2. Automatic distribution
3. Fault-tolerance
4. Status and monitoring tools
5. Easy abstraction for programmers
6. Programming language flexibility
7. Extensibility

MapReduce Programming Model
• MapReduce is based on functional programming models largely from Lisp.
Typically, the users will implement two functions:
Map (in_key, in_value) -> (out_key, intermediate_value) list
• The Map function written by the user will receive an input pair of keys and values,
and after the computation cycles, will produce a set of intermediate key-value
pairs.
• Library functions then are used to group together all intermediate values
associated with an intermediate key I and passes them to the Reduce function.
Reduce (out_key, intermediate_value list) -> out_value list
• The Reduce function written by the user will accept an intermediate key I, and the
set of values for the key.
• It will merge together these values to form a possibly smaller set of values.
• Reducer outputs are just zero or one output value per invocation.
• The intermediate values are supplied to the Reduce function via an iterator.
• The Iterator function allows us to handle large lists of values that cannot be fit in
memory or a single pass.

• The main components of this architecture include:
• Mapper—maps input key-value pairs to a set of
intermediate key-value pairs.
• For an input pair the mapper can map to zero or
many output pairs. By default the mapper spawns
one map task for each input.
• Reducer—performs a number of tasks:
Sort and group mapper outputs.
Shuffle partitions.
Perform secondary sorting as necessary.
Manage overrides specified by users for grouping
and partitioning.

• Reporter—is used to report progress, set application-level status messages,
update any user set counters, and indicate long running tasks or jobs are alive.
• Combiner—an optional performance booster that can be specified to perform
local aggregation of the intermediate outputs to manage the amount of data
transferred from the Mapper to the Reducer.
• Partitioner—controls the partitioning of the keys of the intermediate map
outputs. The key (or a subset of the key) is used to derive the partition and default
partitions are created by a hash function. The total number of partitions will be
same as the number of reduce tasks for the job.
• Output collector—collects the output of Mappers and Reducers.
• Job configuration—is the primary user interface to manage MapReduce jobs.
• It is typically used to specify the Mapper, Combiner, Partitioner, Reducer,
InputFormat, OutputFormat, and OutputCommitter for every job.
• It also indicates the set of input files and where the output files should be written.
Optionally used to specify other advanced options for the job such as the
comparator to be used, files to be put in the DistributedCache, and compression on
intermediate and/or final job outputs.
• It is used for debugging via user-provided scripts, whether job tasks can be
executed in a speculative manner, the maximum number of attempts per task for
any possible failure, and the percentage of task failures that can be tolerated by the
job overall.

• Output committer—is used to manage the commit for jobs and tasks in
MapReduce. Key tasks executed are:
• Set up the job during initialization. For example, create the intermediate directory
for the job during the initialization of the job.
• Clean up the job after the job completion. For example, remove the temporary
output directory after the job completion.
• Set up any task temporary output.
• Check whether a task needs a commit. This will avoid overheads on unnecessary
commits.
• Commit of the task output on completion.
• On failure, discard the task commit and clean up all intermediate results, memory
release, and other user-specified tasks.
• Job input:
• Specifies the input format for a Map/Reduce job.
• Validate the input specification of the job.
• Split up the input file(s) into logical instances to be assigned to an individual
Mapper.
• Provide input records from the logical splits for processing by the Mapper.
• Memory management, JVM reuse, and compression are managed with the job
configuration set of classes.

Example
Input: Bus, Car, bus, car, train, car, bus, car,
train, bus, TRAIN,BUS, buS, caR, CAR, car, BUS,
TRAIN.

Anatomy of File Write and Read
• HDFS has a master and slave kind of architecture.
• Namenode acts as master and Datanodes as worker.
• All the metadata information is with namenode and the
original data is stored on the datanodes.
• Keeping all these in mind the below figure will give idea
about how data flow happens between the Client
interacting with HDFS, i.e. the Namenode and the
Datanodes.

• The following steps are involved in reading the file from HDFS:
Let’s suppose a Client (a HDFS Client) wants to read a file from HDFS.
Step 1: First the Client will open the file by giving a call to open() method on
FileSystem object, which for HDFS is an instance of DistributedFileSystem class.
Step 2: DistributedFileSystem calls the Namenode, using RPC (Remote Procedure
Call), to determine the locations of the blocks for the first few blocks of the file.
For each block, the NameNode returns the addresses of all the DataNode’s that
have a copy of that block. Client will interact with respective DataNode’sto read
the file. NameNode also provide a token to the client which it shows to data node
for authentication.
• The DistributedFileSystem returns an object of FSDataInputStream(an input
stream that supports file seeks) to the client for it to read data from
FSDataInputStream in turn wraps a DFSInputStream, which manages the
datanode and namenode I/O
Step 3: The client then calls read() on the stream. DFSInputStream, which has stored
the DataNode addresses for the first few blocks in the file, then connects to the
first closest DataNode for the first block in the file.

• Step 4: Data is streamed from the DataNode back to the
client, which calls read() repeatedly on the stream.
• Step 5: When the end of the block is reached,
DFSInputStream will close the connection to the
DataNode , then find the best DataNode for the next
block. This happens transparently to the client, which
from its point of view is just reading a continuous
stream.
• Step 6: Blocks are read in order, with the
DFSInputStream opening new connections to datanodes
as the client reads through the stream. It will also call the
namnode to retrieve the datanode locations for the next
batch of blocks as needed. When the client has finished
reading, it calls close() on the FSDataInputStream.

Now we will look at what happens when you write a File in
HDFS.
• DistributedFileSystem object do a RPC call to namenode to create a new file in
filesystem namespace with no blocks associated to it
• Namenode process performs various checks like a) client has required permissions
to create a file or not 2) file should not exists earlier. In case of above exceptions it
will throw an IOexception to client
• Once the file is registered with the namenode then client will get an object i.e.
FSDataOutputStream which in turns embed DFSOutputStream object for the client
to start writing data to DFSoutputStream handles communication with the
datanodes and namenode.
• As client writes data DFSOutputStream split it into packets and write it to its
internal queue i.e. data queue and also maintains an acknowledgement queue.
• Data queue is then consumed by a Data Streamer process which is responsible for
asking namenode to allocate new blocks by picking a list of suitable datanodes to
store the replicas.

• The list of datanodes forms a pipeline and
assuming a replication factor of three, so there will
be three nodes in the pipeline.
• The data streamer streams the packets to the first
datanode in the pipeline, which then stores the
packet and forward it to second datanode in the
pipeline. Similarly the second node stores the
packet and forward it to next datanode or last
datanode in the pipeline
• Once each datanode in the pipeline acknowledge
the packet the packet is removed from the
acknowledgement queue.

• Now what happens when one of the machines i.e.
part of the pipeline which has datanode process
running fails. Hadoop has inbuilt functionality to
handle this scenario.If a datanode fails while data
is being written to it, then the following actions
are taken, which are transparent to the client
writing the data.

• First, the pipeline is closed, and any packets in the ack queue are added to the front
of the data queue so that datanodes that are downstream from the failed node will
not miss any packets.
• The current block on the good datanodes is given a new identity, which is
communicated to the namenode, so that the partial block on the failed datanode will
be deleted if the failed datanode recovers later on.
• The failed datanode is removed from the pipeline, and the remainder of the block’s
data is written to the two good datanodes in the pipeline.
• The namenode notices that the block is under-replicated, and it arranges for a
further replica to be created on another node. Subsequent blocks are then treated as
normal.
It’s possible, but unlikely, that multiple datanodes fail while a block is being written.
• As long as dfs.replication.min replicas (which default to one) are written, the write
will succeed, and the block will be asynchronously replicated across the cluster until
its target replication factor is reached (dfs.replication, which defaults to three).

Chaptor 2- Big Data Processing in big data technologies

More Related Content

Similar to Chaptor 2- Big Data Processing in big data technologies (20)

Recently uploaded (20)

Chaptor 2- Big Data Processing in big data technologies