0% found this document useful (0 votes)
5 views

CS3551 - 1_merged

The document outlines the fundamentals of distributed computing, including definitions, communication models, and challenges faced in distributed systems. It covers topics such as logical time, distributed mutual exclusion, consensus algorithms, and cloud computing, emphasizing the importance of message-passing systems and synchronization. Additionally, it discusses various middleware technologies and their roles in facilitating distributed computations.

Uploaded by

pathisiva2004
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

CS3551 - 1_merged

The document outlines the fundamentals of distributed computing, including definitions, communication models, and challenges faced in distributed systems. It covers topics such as logical time, distributed mutual exclusion, consensus algorithms, and cloud computing, emphasizing the importance of message-passing systems and synchronization. Additionally, it discusses various middleware technologies and their roles in facilitating distributed computations.

Uploaded by

pathisiva2004
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 117

CS3551- DISTRIBUTED COMPUTING

UNIT I INTRODUCTION

Introduction: Definition-Relation to Computer System Components – Motivation – Message -Passing


Systems versus Shared Memory Systems – Primitives for Distributed Communication –
Synchronous versus Asynchronous Executions – Design Issues and Challenges; A Model of
Distributed Computations: A Distributed Program – A Model of Distributed Executions – Models of
Communication Networks – Global State of a Distributed System.

UNIT II LOGICAL TIME AND GLOBAL STATE

Logical Time: Physical Clock Synchronization: NTP – A Framework for a System of Logical Clocks
– Scalar Time – Vector Time; Message Ordering and Group Communication: Message Ordering
Paradigms – Asynchronous Execution with Synchronous Communication – Synchronous Program
Order on Asynchronous System – Group Communication – Causal Order – Total Order; Global
State and Snapshot Recording Algorithms: Introduction – System Model and Definitions – Snapshot
Algorithms for FIFO Channels

UNIT III DISTRIBUTED MUTEX AND DEADLOCK

Distributed Mutual exclusion Algorithms: Introduction – Preliminaries – Lamport’s algorithm –


RicartAgrawala’s Algorithm –– Token-Based Algorithms – Suzuki-Kasami’s Broadcast Algorithm;
Deadlock Detection in Distributed Systems: Introduction – System Model – Preliminaries – Models
of Deadlocks – Chandy-Misra-Haas Algorithm for the AND model and OR Model.

UNIT IV CONSENSUS AND RECOVERY

Consensus and Agreement Algorithms: Problem Definition – Overview of Results – Agreement in a


Failure-Free System(Synchronous and Asynchronous) – Agreement in Synchronous Systems with
Failures; Checkpointing and Rollback Recovery: Introduction – Background and Definitions – Issues
in Failure Recovery – Checkpoint-based Recovery – Coordinated Checkpointing Algorithm –
– Algorithm for Asynchronous Checkpointing and Recovery

UNIT V CLOUD COMPUTING

Definition of Cloud Computing – Characteristics of Cloud – Cloud Deployment Models – Cloud


Service Models – Driving Factors and Challenges of Cloud – Virtualization – Load Balancing –
Scalability and Elasticity – Replication – Monitoring – Cloud Services and Platforms: Compute
Services – Storage Services – Application Services
UNIT I
INTRODUCTION

The process of computation was started from working on a single processor. This uni-
processor computing can be termed as centralized computing.
A distributed system is a collection of independent computers, interconnected via a
network, capable of collaborating on a task. Distributed computing is computing
performed in a distributed system.

A distributed system is a collection of independent entities that cooperate to solve a problem


that cannot be individually solved. Distributed computing is widely used due to
advancements in machines; faster and cheaper networks. In distributed systems, the entire
network will be viewed as a computer. The multiple systems connected to the network will
appear as a single system to the user.
Features of Distributed Systems:
No common physical clock - It introduces the element of “distribution” in the system and
gives rise to the inherent asynchrony amongst the processors.
No shared memory - A key feature that requires message-passing for communication. This
feature implies the absence of the common physical clock.
Geographical separation – The geographically wider apart that the processors are, the
more representative is the system of a distributed system.
Autonomy and heterogeneity – Here the processors are “loosely coupled” in that they have
different speeds and each can be running a different operating system.

Issues in distributed systems


Heterogeneity
Openness
Security
Scalability
Failure handling
Concurrency
Transparency
Quality of service

1.2 Relation to Computer System Components

Fig 1.1: Example of a Distributed System


As shown in Fig 1.1, Each computer has a memory-processing unit and the computers are
connected by a communication network. Each system connected to the distributed networks
hosts distributed software which is a middleware technology. This drives the Distributed
System (DS) at the same time preserves the heterogeneity of the DS. The term computation
or run in a distributed system is the execution of processes to achieve a common goal.

Fig 1.2: Interaction of layers of network

The interaction of the layers of the network with the operating system and
middleware is shown in Fig 1.2. The middleware contains important library functions for
facilitating the operations of DS.
The distributed system uses a layered architecture to break down the complexity of system
design. The middleware is the distributed software that drives the distributed system, while
providing transparency of heterogeneity at the platform level

Examples of middleware: Object Management Group’s (OMG), Common Object Request


Broker Architecture (CORBA) [36], Remote Procedure Call (RPC), Message Passing
Interface (MPI)

1.3 Motivation

The following are the key points that acts as a driving force behind DS:

Inherently distributed computations: DS can process the computations at geographically


remote locations.
Resource sharing: The hardware, databases, special libraries can be shared between
systems without owning a dedicated copy or a replica. This is cost effective and reliable.
Access to geographically remote data and resources: Resources such as centralized
servers can also be accessed from distant locations.
Enhanced reliability: DS provides enhanced reliability, since they run on multiple copies of
resources.
The term reliability comprises of:
1. Availability: The resource/ service provided by the resource should be accessible
atall times
2. Integrity: the value/state of the resource should be correct and consistent.
3. Fault-Tolerance: Ability to recover from system failures
Increased performance/cost ratio: The resource sharing and remote access features of DS
naturally increase the performance / cost ratio.
Scalable: The number of systems operating in a distributed environment can be increased as
the demand increases.

1.4 MESSAGE-PASSING SYSTEMS VERSUS SHARED MEMORY SYSTEMS


Communication among processors takes place via shared data variables, and
control variables for synchronization among the processors. The communicationsbetween
the tasks in multiprocessor systems take place through two main modes:

Message passing systems:


• This allows multiple processes to read and write data to the message queue
without being connected to each other.
• Messages are stored on the queue until their recipient retrieves them.
Shared memory systems:
• The shared memory is the memory that can be simultaneously accessed by
multiple processes. This is done so that the processes can communicate with each
other.
• Communication among processors takes place through shared data variables, and
control variables for synchronization among the processors.
• Semaphores and monitors are common synchronization mechanisms on shared
memory systems.
• When shared memory model is implemented in a distributed environment, it is
termed as distributed shared memory.

Emulating message-passing on a shared memory system (MP → SM)


• The shared memory system can be made to act as message passing system. The
shared address space can be partitioned into disjoint parts, one part being
assigned to each processor.
• Send and receive operations care implemented by writing to and reading from the
destination/sender processor’s address space. The read and write operations are
synchronized.
• Specifically, a separate location can be reserved as the mailbox for each ordered
pair of processes.

Emulating shared memory on a message-passing system (SM → MP)


• This is also implemented through read and write operations. Each shared
location can be modeled as a separate process. Write to a shared location is
emulated by sending an update message to the corresponding owner process and
read operation to a shared location is emulated by sending a query message to the
owner process.
• This emulation is expensive as the processes has to gain access to other process
memory location. The latencies involved in read and write operations may be
high even when using shared memory emulation because the read and write
operations are implemented by using network-wide communication.

1.5 PRIMITIVES FOR DISTRIBUTED COMMUNICATION

Blocking / Non blocking / Synchronous / Asynchronous


• Message send and message receive communication primitives are done through
Send() and Receive(), respectively.
• A Send primitive has two parameters: the destination, and the buffer in the user
space that holds the data to be sent.
• The Receive primitive also has two parameters: the source from which the data is
to be received and the user buffer into which the data is to be received.
There are two ways of sending data when the Send primitive is called:

• Buffered: The standard option copies the data from the user buffer to the kernel
buffer. The data later gets copied from the kernel buffer onto the network. For the
Receive primitive, the buffered option is usually required because the data may
already have arrived when the primitive is invoked, and needs a storage place in
the kernel.
• Unbuffered: The data gets copied directly from the user buffer onto the network.

Blocking primitives
• The primitive commands wait for the message to be delivered. The execution of
the processes is blocked.
• The sending process must wait after a send until an acknowledgement is made
bythe receiver.
• The receiving process must wait for the expected message from the sending
process
• A primitive is blocking if control returns to the invoking process after the
processing for the primitive completes.
Non Blocking primitives
• If send is nonblocking, it returns control to the caller immediately, before the
message is sent.
• The advantage of this scheme is that the sending process can continue computing
in parallel with the message transmission, instead of having the CPU go idle.
• This is a form of asynchronous communication.
• A primitive is non-blocking if control returns back to the invoking process
immediately after invocation, even though the operation has not completed.
• For a non-blocking Send, control returns to the process even before the data
iscopied out of the user buffer.

For a non-blocking Receive, control returns to the process even before thedata may have
arrived from the sender.
Synchronous
• A Send or a Receive primitive is synchronous if both the Send() and Receive()
handshake with each other.
• The processing for the Send primitive completes only after the invoking
processor learns
• The processing for the Receive primitive completes when the data to be
received is copied into the receiver’s user buffer.
Asynchronous
• A Send primitive is said to be asynchronous, if control returns back to the
invoking process after the data item to be sent has been copied out of the user-
specified buffer.
• For non-blocking primitives, a return parameter on the primitive call returns a
system-generated handle which can be later used to check the status of
completion of the call.
• The process can check for the completion:
o checking if the handle has been flagged or posted
o issue a Wait with a list of handles as parameters: usually blocks until one
of the parameter handles is posted.
The send and receive primitives can be implemented in four modes:
• Blocking synchronous
• Non- blocking synchronous
• Blocking asynchronous
• Non- blocking asynchronous

Four modes of send operation


Blocking synchronous Send:
• The data gets copied from the user buffer to the kernel buffer and is then sent over
the network.
• After the data is copied to the receiver’s system buffer and a Receive call has been
issued, an acknowledgement back to the sender causes control to return to the
process that invoked the Send operation and completes the Send.
Non-blocking synchronous Send:
• Control returns back to the invoking process as soon as the copy of data from the user
buffer to the kernel buffer is initiated.
• A parameter in the non-blocking call also gets set with the handle of a location that
the user process can later check for the completion of the synchronous send
operation.
• The location gets posted after an acknowledgement returns from the receiver.
• The user process can keep checking for the completion of the non-blocking
synchronous Send by testing the returned handle, or it can invoke the blocking Wait
operation on the returned handle
Blocking asynchronous Send:
• The user process that invokes the Send is blocked until the data is copied from the
user’s buffer to the kernel buffer.
Non-blocking asynchronous Send:
• The user process that invokes the Send is blocked until the transfer of the data from
the user’s buffer to the kernel buffer is initiated.
• Control returns to the user process as soon as this transfer is initiated, and a parameter
in the non-blocking call also gets set with the handle of a location that the user
process can check later using the Wait operation for the completion of the
asynchronous Send.
The asynchronous Send completes when the data has been copied out of the user’s
buffer. The checking for the completion may be necessary if the user wants to reuse the
buffer from which the data was sent.
Modes of receive operation
Blocking Receive:
The Receive call blocks until the data expected arrives and is written in the specified
user buffer. Then control is returned to the user process.
Non-blocking Receive:
• The Receive call will cause the kernel to register the call and return the handle
of a location that the user process can later check for the completion of the
non-blocking Receive operation.
• This location gets posted by the kernel after the expected data arrives and is
copied to the user-specified buffer. The user process can check for then
completion of the non-blocking Receive by invoking the Wait operation on the
returned handle.

Processor Synchrony

Processor synchrony indicates that all the processors execute in lock-step with their clocks
synchronized.

To ensure that no processor begins executing the next step of code until all the processors
have completed executing the previous steps ofcode assigned to each of the processors.

Libraries and standards


There exists a wide range of primitives for message-passing. The message-passing interface
(MPI) library and the PVM (parallel virtual machine) library are used largely by the
scientific community
• Message Passing Interface (MPI): This is a standardized and portable message-
passing system to function on a wide variety of parallel computers. MPI primarily
addresses the message-passing parallel programming model: data is moved from the
address space of one process to that of another process through cooperative
operations on each process.
• Parallel Virtual Machine (PVM): It is a software tool for parallel networking of
computers. It is designed to allow a network of heterogeneous Unix and/or Windows
machines to be used as a single distributed parallel processor.
• Remote Procedure Call (RPC): The Remote Procedure Call (RPC) is a common
model of request reply protocol. In RPC, the procedure need not exist in the same
address space as the calling procedure.
• Remote Method Invocation (RMI): RMI (Remote Method Invocation) is a way that
a programmer can write object-oriented programming in which objects on different
computers can interact in a distributed network.
• Remote Procedure Call (RPC): RPC is a powerful technique for constructing
distributed, client-server based applications. In RPC, the procedure need not exist in
the same address space as the calling procedure. The two processes may be on the
same system, or they may be on different systems with a network connecting them.

• Common Object Request Broker Architecture (CORBA): CORBA describes a


messaging mechanism by which objects distributed over a network can communicate with
each other irrespective of the platform and language used to develop those objects.
1.6 SYNCHRONOUS VS ASYNCHRONOUS EXECUTIONS
The execution of process in distributed systems may be synchronous or asynchronous.

Asynchronous Execution:
A communication among processes is considered asynchronous, when every
communicating process can have a different observation of the order of the messages being
exchanged. In an asynchronous execution:
• there is no processor synchrony and there is no bound on the drift rate of processor
clocks
• message delays are finite but unbounded
• no upper bound on the time taken by a process
Fig: Asynchronous execution in message passing system

Synchronous Execution:
A communication among processes is considered synchronous when every process
observes the same order of messages within the system. In an synchronous execution:
• processors are synchronized and the clock drift rate between any two processors is
bounded
• message delivery times are such that they occur in one logical step or round
• upper bound on the time taken by a process to execute a
step.

Emulating an asynchronous system by a synchronous system (A → S)


An asynchronous program can be emulated on a synchronous system fairly trivially as the
synchronous system is a special case of an asynchronous system – all communication
finishes within the same round in which it is initiated.

Emulating a synchronous system by an asynchronous system (S → A)


A synchronous program can be emulated on an asynchronous system using a tool called
synchronizer.

Emulation for a fault free system


Fig 1.15: Emulations in a failure free message passing system
If system A can be emulated by system B, denoted A/B, and if a problem is not solvable in
B, then it is also not solvable in A. If a problem is solvable in A, it is also solvable in B.
Hence, in a sense, all four classes are equivalent in terms of computability in failure-free
systems.

1.7 DESIGN ISSUES AND CHALLENGES IN DISTRIBUTED SYSTEMS


The design of distributed systems has numerous challenges. They can be categorized
into:
• Issues related to system and operating systems design
• Issues related to algorithm design
• Issues arising due to emerging technologies
The above three classes are not mutually exclusive.

1.7.1 Issues related to system and operating systems design


The following are some of the common challenges to be addressed in designing a
distributed system from system perspective:
➢ Communication: This task involves designing suitable communication mechanisms
among the various processes in the networks.
Examples: RPC, RMI

➢ Processes: The main challenges involved are: process and thread management at
both client and server environments, migration of code between systems, design of software
and mobile agents.
➢ Naming: Devising easy to use and robust schemes for names, identifiers, and
addresses is essential for locating resources and processes in a transparent and scalable
manner. The remote and highly varied geographical locations make this task difficult.
➢ Synchronization: Mutual exclusion, leader election, deploying physical clocks,
global state recording are some synchronization mechanisms.
➢ Data storage and access Schemes: Designing file systems for easy and efficient data
storage with implicit accessing mechanism is very much essential for distributed operation
➢ Consistency and replication: The notion of Distributed systems goes hand in hand
with replication of data, to provide high degree of scalability. The replicas should be handed
with care since data consistency is prime issue.

➢ Fault tolerance: This requires maintenance of fail proof links, nodes, and processes.
Some of the common fault tolerant techniques are resilience, reliable communication,
distributed commit, checkpointing and recovery, agreement and consensus, failure detection,
and self-stabilization.
➢ Security: Cryptography, secure channels, access control, key management –
generation and distribution, authorization, and secure group management are some of the
security measure that is imposed on distributed systems.
➢ Applications Programming Interface (API) and transparency: The user
friendliness and ease of use is very important to make the distributed services to be used by
wide community. Transparency, which is hiding inner implementation policy from users, is
of the following types:

▪ Access transparency: hides differences in data representation


▪ Location transparency: hides differences in locations y providing uniform access to
data located at remote locations.
▪ Migration transparency: allows relocating resources without changing names.
▪ Replication transparency: Makes the user unaware whether he is working on
original or replicated data.
▪ Concurrency transparency: Masks the concurrent use of shared resources for the
user.
▪ Failure transparency: system being reliable and fault-tolerant.
➢ Scalability and modularity: The algorithms, data and services must be as distributed
as possible. Various techniques such as replication, caching and cache management, and
asynchronous processing help to achieve scalability.
1.7.2 Algorithmic challenges in distributed computing
➢ Designing useful execution models and frameworks
The interleaving model, partial order model, input/output automata model and the Temporal
Logic of Actions (TLA) are some examples of models that provide different degrees of
infrastructure.
➢ Dynamic distributed graph algorithms and distributed routing algorithms
• The distributed system is generally modeled as a distributed graph.
• Hence graph algorithms are the base for large number of higher level
communication,data dissemination, object location, and object search functions.
• These algorithms must have the capacity to deal with highly dynamic graph
characteristics. They are expected to function like routing algorithms.
• The performance of these algorithms has direct impact on user-perceived latency, data
traffic and load in the network.
➢ Time and global state in a distributed system

• The geographically remote resources demands the synchronization based on logical


time.
• Logical time is relative and eliminates the overheads of providing physical time for
applications. Logical time can
(i) Capture the logic and inter-process dependencies
(ii) track the relative progress at each process
• Maintaining the global state of the system across space involves the role of time
dimension for consistency. This can be done with extra effort in a coordinated manner.
• Deriving appropriate measures of concurrency also involves the time dimension, as
theexecution and communication speed of threads may vary a lot.
➢ Synchronization/coordination mechanisms
• Synchronization is essential for the distributed processes to facilitate concurrent
execution without affecting other processes.

• The synchronization mechanisms also involve resource management and


concurrency management mechanisms.
• Some techniques for providing synchronization are:
✓ Physical clock synchronization: Physical clocks usually diverge in their values due
to hardware limitations. Keeping them synchronized is a fundamental challenge to maintain
common time.
✓ Leader election: All the processes need to agree on which process will play the
roleof a distinguished process or a leader process. A leader is necessary even for many
distributed algorithms because there is often some asymmetry.
✓ Mutual exclusion: Access to the critical resource(s) has to be coordinated.

✓ Deadlock detection and resolution: This is done to avoid duplicate work,


and deadlock resolution should be coordinated to avoid unnecessary aborts of
processes.
✓ Termination detection: cooperation among the processes to detect the specific global
state of quiescence.
✓ Garbage collection: Detecting garbage requires coordination among the processes.
➢ Group communication, multicast, and ordered message delivery
• A group is a collection of processes that share a common context and collaborate on a
common task within an application domain. Group management protocols are needed for
group communication wherein processes can join and leave groups dynamically, or fail.
➢ Monitoring distributed events and predicates
• Predicates defined on program variables that are local to different processes are used
for specifying conditions on the global system state.
• On-line algorithms for monitoring such predicates are hence important.
• The specification of such predicates uses physical or logical time relationships.
➢ Distributed program design and verification tools
Methodically designed and verifiably correct programs can greatly reduce the overhead of
software design, debugging, and engineering. Designing these is a big challenge.
➢ Debugging distributed programs
Debugging distributed programs is much harder because of the concurrency and replications.
Adequate debugging mechanisms and tools are need of the hour.
➢ Data replication, consistency models, and caching
• Fast access to data and other resources is important in distributed systems.
Managing replicas and their updates faces concurrency problems.
• Placement of the replicas in the systems is also a challenge because resources
usuallycannot be freely replicated.
➢ World Wide Web design – caching, searching, scheduling
• WWW is a commonly known distributed system.
• The issues of object replication and caching, prefetching of objects have to be done on
WWW also.
• Object search and navigationon the web are important functions in the operation of
the web.
➢ Distributed shared memory abstraction
• A shared memory is easier to implement since it does not involve managing the
communication tasks.
• The communication is done by the middleware by message passing.
• The overhead of shared memory is to be dealt by the middleware technology.
• Some of the methodologies that does the task of communication in shared memory
distributed systems are:
✓ Wait-free algorithms: The ability of a process to complete its execution irrespective
of the actions of other processes is wait free algorithm. They control the access to shared
resources in the shared memory abstraction. They are expensive.
✓ Mutual exclusion: Concurrent access of processes to a shared resource or data is
executed in mutually exclusive manner. Only one process is allowed to execute the critical
section at any given time. In a distributed system, shared variables or a local kernel cannot
be used to implement mutual exclusion. Message passing is the sole means for implementing
distributed mutual exclusion.

✓ Register constructions: Architectures must be designed in such a way that,


registersallows concurrent access without any restrictions on the concurrency permitted.
➢ Reliable and fault-tolerant distributed systems
The following are some of the fault tolerant strategies:
✓ Consensus algorithms: Consensus algorithms allow correctly functioning processes
to reach agreement among themselves in spite of the existence of malicious processes. The
goal of the malicious processes is to prevent the correctly functioning processes from
reaching agreement. The malicious processes operate by sending messages with misleading
information, to confuse the correctly functioning processes.
✓ Replication and replica management: The Triple Modular Redundancy (TMR)
technique is used in software and hardware implementation. TMR is a fault-tolerant form of
N-modular redundancy, in which three systems perform a process and that result is
processed by a majority-voting system to produce a single output.
✓ Voting and quorum systems: Providing redundancy in the active or passive
components in the system and then performing voting based on some quorum criterion is a
classical way of dealing with fault-tolerance. Designing efficient algorithms for this
purposeis the challenge.
✓ Distributed databases and distributed commit: The distributed databases should
also follow atomicity, consistency, isolation and durability (ACID) properties.
✓ Self-stabilizing systems: A self-stabilizing algorithm guarantee to take the system to
a good state even if a bad state were to arise due to some error. Self-stabilizing algorithms
require some in-built redundancy to track additional variables of the state and do extra work.
✓ Checkpointing and recovery algorithms: Checkpointing is periodically recording
the current state on secondary storage so that, in case of a failure. The entire computation is
not lost but can be recovered from one of the recently taken checkpoints. Checkpointing in
distributed environment is difficult because if the checkpoints at the different processes are
not coordinated, the local checkpoints may become useless because they are inconsistent with
the checkpoints at other processes.
✓ Failure detectors: The asynchronous distributed do not have a bound on the message
transmission time. This makes the message passing very difficult, since the receiver do not
know the waiting time. Failure detectors probabilistically suspect another process as having
failed and then converge on a determination of the up/down status of the suspected process.
➢ Load balancing
The objective of load balancing is to gain higher throughput, and reduce the user
perceived latency. Load balancing may be necessary because of a variety off actors such
as high network traffic or high request rate causing the network connection to be a
bottleneck, or high computational load. The following are some forms of load balancing:
✓ Data migration: The ability to move data around in the system, based on the access
pattern of the users
✓ Computation migration: The ability to relocate processes in order to perform
are distribution of the workload.
✓ Distributed scheduling: This achieves a better turnaround time for the users by
using idle processing power in the system more efficiently.
➢ Real-time scheduling
Real-time scheduling becomes more challenging when a global view of the system state is
absent with more frequent on-line or dynamic changes. The message propagation delays
which are network-dependent are hard to control or predict. This is an hindrance to meet the
QoS requirements of the network.

➢ Performance
User perceived latency in distributed systems must be reduced. The common issues in
performance:
✓ Metrics: Appropriate metrics must be defined for measuring the performance of
theoretical distributed algorithms and its implementation.
✓ Measurement methods/tools: The distributed system is a complex entity
appropriate methodology and tools must be developed for measuring the performance
metrics.
1.7.3 Applications of distributed computing and newer challenges
The deployment environment of distributed systems ranges from mobile systems to
cloud storage. All the environments have their own challenges:
➢ Mobile systems
o Mobile systems which use wireless communication in shared broadcast
medium have issues related to physical layer such as transmission range,
power, battery power consumption, interfacing with wired internet, signal
processing and interference.
o The issues pertaining to other higher layers include routing, location
management, channel allocation, localization and position estimation, and
mobility management.
o Apart from the above mentioned common challenges, the architectural
differences of the mobile network demands varied treatment. The two
architectures are:
✓ Base-station approach (cellular approach): The geographical region is divided into
hexagonal physical locations called cells. The powerful base station transmits signals to all
other nodes in its range

✓ Ad-hoc network approach: This is an infrastructure-less approach which do not


haveany base station to transmit signals. Instead all the responsibility is distributed among
the mobile nodes.
✓ It is evident that both the approaches work in different environment with different
principles of communication. Designing a distributed system to cater the varied need is a
great challenge.

➢ Sensor networks
o A sensor is a processor with an electro-mechanical interface that is capable of
sensing physical parameters.
o They are low cost equipment with limited computational power and battery
life. They are designed to handle streaming data and route it to external
computer network and processes.
o They are susceptible to faults and have to reconfigure themselves.
o These features introduces a whole new set of challenges, such as position
estimation and time estimation when designing a distributed system .
➢ Ubiquitous or pervasive computing
o In Ubiquitous systems the processors are embedded in the environment to
perform application functions in the background.
o Examples: Intelligent devices, smart homes etc.
o They are distributed systems with recent advancements operating in wireless
environments through actuator mechanisms.
o They can be self-organizing and network-centric with limited resources.
➢ Peer-to-peer computing
o Peer-to-peer (P2P) computing is computing over an application layer
networkwhere all interactions among the processors are at a same level.
o This is a form of symmetric computation against the client sever paradigm.
o They are self-organizing with or without regular structure to the network.
Some of the key challenges include: object storage mechanisms, efficientobject lookup, and retrieval in a
scalable manner; dynamic reconfiguration with nodes as well as objects joining and leaving the network
randomly;replication strategies to expedite object search; tradeoffs between object size latency and table
sizes; anonymity, privacy, and security
➢ Publish-subscribe, content distribution, and multimedia
o The users in present day require only the information of interest.
o In a dynamic environment where the information constantly fluctuates there
isgreat demand for
o Publish: an efficient mechanism for distributing this information
o Subscribe: an efficient mechanism to allow end users to indicate interest in
receiving specific kinds of information
o An efficient mechanism for aggregating large volumes of published
information and filtering it as per the user’s subscription filter.
o Content distribution refers to a mechanism that categorizes the information
based on parameters.
o The publish subscribe and content distribution overlap each other.
o Multimedia data introduces special issue because of its large size.
➢ Distributed agents
o Agents are software processes or sometimes robots that move around the
system to do specific tasks for which they are programmed.
o Agents collect and process information and can exchange such
informationwith other agents.

o Challenges in distributed agent systems include coordination mechanisms


among the agents, controlling the mobility of the agents, their software design
and interfaces.
➢ Distributed data mining
o Data mining algorithms process large amount of data to detect patterns and
trends in the data, to mine or extract useful information.
o The mining can be done by applying database and artificial intelligence
techniques to a data repository.
➢ Grid computing
• Grid computing is deployed to manage resources. For instance, idle CPU
cycles of machines connected to the network will be available to others.
• The challenges includes: scheduling jobs, framework for implementing quality
of service, real-time guarantees, security.
➢ Security in distributed systems
The challenges of security in a distributed setting include: confidentiality,
authentication and availability. This can be addressed using efficient and scalable solutions.

1.8 A MODEL OF DISTRIBUTED COMPUTATIONS: DISTRIBUTED PROGRAM


• A distributed program is composed of a set of asynchronous processes that
communicate by message passing over the communication network. Each process
may run on different processor.
• The processes do not share a global memory and communicate solely by passing
messages. These processes do not share a global clock that is instantaneously
accessible to these processes.
• Process execution and message transfer are asynchronous – a process may execute an
action spontaneously and a process sending a message does not wait for the delivery
of the message to be complete.
• The global state of a distributed computation is composed of the states of the
processes and the communication channels. The state of a process is characterized by
the state of its local memory and depends upon the context.
• The state of a channel is characterized by the set of messages in transit in the channel.
A MODEL OF DISTRIBUTED EXECUTIONS

• The execution of a process consists of a sequential execution of its actions.


• The actions are atomic and the actions of a process are modeled as three types of
events: internal events, message send events, and message receive events.
• An internal event changes the state of the process at which it occurs.
• A send event changes the state of the process that sends the message and the state of
the channel on which the message is sent.
• The execution of process pi produces a sequence of events e1, e2, e3, …, and it is
denoted by Hi: Hi =(hi→i). Here hiare states produced by pi and →are the casual
dependencies among events pi.
• →msgindicates the dependency that exists due to message passing between two events.

Fig Space time distribution of distributed systems

• An internal event changes the state of the process at which it occurs. A send event
changes the state of the process that sends the message and the state of the channel
onwhich the message is sent.
• A receive event changes the state of the process that receives the message and the
stateof the channel on which the message is received.
Casual Precedence Relations
Causal message ordering is a partial ordering of messages in a distributed computing
environment. It is the delivery of messages to a process in the order in which they were
transmitted to that process.

It places a restriction on communication between processes by requiring that if the


transmission of message mi to process pk necessarily preceded the transmission of message
mj to the same process, then the delivery of these messages to that process must be ordered
such that mi is delivered before mj.
Happen Before Relation
The partial ordering obtained by generalizing the relationship between two process is called
as happened-before relation or causal ordering or potential causal ordering. This term
was coined by Lamport. Happens-before defines a partial order of events in a distributed
system. Some events can’t be placed in the order. If say A →B if A happens before B. A B
is defined using the following rules:
✓ Local ordering:A and B occur on same process and A occurs before B.
✓ Messages: send(m) → receive(m) for any message m
✓ Transitivity: e → e’’ if e → e’ and e’ → e’’
• Ordering can be based on two situations:
1. If two events occur in same process then they occurred in the order observed.
2. During message passing, the event of sending message occurred before the event of
receiving it.

Lamports ordering is happen before relation denoted by →


• a→b, if a and b are events in the same process and a occurred before b.
• a→b, if a is the vent of sending a message m in a process and b is the event of the
same message m being received by another process.
• If a→b and b→c, then a→c. Lamports law follow transitivity property.

When all the above conditions are satisfied, then it can be concluded that a→b is casually
related. Consider two events c and d; c→d and d→c is false (i.e) they are not casually
related, then c and d are said to be concurrent events denoted as c||d.

Fig Communication between processes


Fig 1.22 shows the communication of messages m1 and m2 between three processes p1, p2
and p3. a, b, c, d, e and f are events. It can be inferred from the diagram that, a→b; c→d;
e→f; b->c; d→f; a→d; a→f; b→d; b→f. Also a||e and c||e.

Logical vs physical concurrency


Physical as well as logical concurrency is two events that creates confusion in
distributed systems.
Physical concurrency: Several program units from the same program that execute
simultaneously.
Logical concurrency: Multiple processors providing actual concurrency. The actual
execution of programs is taking place in interleaved fashion on a single processor.

Differences between logical and physical concurrency


Logical concurrency Physical concurrency
Several units of the same program execute Several program units of the same program
simultaneously on same processor, giving an execute at the same time on different
illusion to the programmer that they are processors.
executing on multiple processors.
They are implemented through interleaving. They are implemented as uni-processor with
I/O
channels, multiple CPUs, network of uni or
multi CPU machines.
MODELS OF COMMUNICATION NETWORK
The three main types of communication models in distributed systems are:
FIFO (first-in, first-out): each channel acts as a FIFO message queue.
Non-FIFO (N-FIFO): a channel acts like a set in which a sender process adds messages and
receiver removes messages in random order.
Causal Ordering (CO): It follows Lamport’s law.
o The relation between the three models is given by CO FIFO N-FIFO.

A system that supports the causal ordering model satisfies the following property:

GLOBAL STATE

Distributed Snapshot represents a state in which the distributed system might have been in. A snapshot
of the system is a single configuration of the system.

• The global state of a distributed system is a collection of the local states of its components, namely,
the processes
and the communication channels. • The state of a process at any time is defined by the contents of
processor registers, stacks, local memory, etc. and depends on the local context of the distributed
application.
• The state of a channel is given by the set of messages in transit in the channel.
UNIT II

LOGICAL TIME & GLOBAL STATE

Logical clocks are based on capturing chronological and causal relationships of processes and
ordering events based on these relationships.

Three types of logical clock are maintained in distributed systems:


• Scalar clock
• Vector clock
• Matrix clock

In a system of logical clocks, every process has a logical clock that is advanced using a set
of rules. Every event is assigned a timestamp and the causality relation between events can
be generally inferred from their timestamps.
The timestamps assigned to events obey the fundamental monotonicity property; that is, if
an event a causally affects an event b, then the timestamp of a is smaller than the timestamp
of b.
A Framework for a system of logical clocks

A system of logical clocks consists of a time domain T and a logical clock C. Elements of T form a
partially ordered set over a relation <. This relation is usually called the happened before or
causal precedence.

The logical clock C is a function that maps an event e in a distributed system to an element
in the time domain T denoted as C(e).
such that
for any two events ei and ej,.
This monotonicity property is called the clock consistency condition. When T and C
satisfythe following condition,

Then the system of clocks is strongly consistent.

Implementing logical clocks


The two major issues in implanting logical clocks are:
Data structures: representation of each process
Protocols: rules for updating the data structures to ensure consistent conditions.

Data structures:
Each process pi maintains data structures with the given capabilities:
• A local logical clock (lci), that helps process pi measure its own progress.
• A logical global clock (gci), that is a representation of process pi’s local view of the
logicalglobal time. It allows this process to assign consistent timestamps to its local events.
Protocol:
The protocol ensures that a process’s logical clock, and thus its view of the global time,
ismanaged consistently with the following rules:
Rule 1: Decides the updates of the logical clock by a process. It controls send, receive and
other operations.
Rule 2: Decides how a process updates its global logical clock to update its view of the
global time and global progress. It dictates what information about the logical time is
piggybacked in a message and how this information is used by the receiving process to
update its view of the global time.

2.1.1 SCALAR TIME


Scalar time is designed by Lamport to synchronize all the events in distributed
systems. A Lamport logical clock is an incrementing counter maintained in each process.
When a process receives a message, it resynchronizes its logical clock with that sender
maintaining causal relationship.
The Lamport’s algorithm is governed using the following rules:
• The algorithm of Lamport Timestamps can be captured in a few rules:
• All the process counters start with value 0.
• A process increments its counter for each event (internal event, message sending,
message receiving) in that process.
• When a process sends a message, it includes its (incremented) counter value with the
message.
• On receiving a message, the counter of the recipient is updated to the greater of its
current counter and the timestamp in the received message, and then incremented by
one.

• If Ci is the local clock for process Pi then,


• if a and b are two successive events in Pi, then Ci(b) = Ci(a) + d1, where d1 > 0
• if a is the sending of message m by Pi, then m is assigned timestamp tm = Ci(a)
• if b is the receipt of m by Pj, then Cj(b) = max{Cj(b), tm + d2}, where d2 > 0

Rules of Lamport’s clock


Rule 1: Ci(b) = Ci(a) + d1, where d1 > 0
Rule 2: The following actions are implemented when pi receives a message m with timestamp Cm:
a) Ci= max(Ci, Cm)
b) Execute Rule 1
c) deliver the message

Fig 1.20: Evolution of scalar time


Basic properties of scalar time:
1. Consistency property: Scalar clock always satisfies monotonicity. A monotonic clock
only increments its timestamp and never jump. Hence it is consistent.

2. Total Reordering: Scalar clocks order the events in distributed systems. But all the
events do not follow a common identical timestamp. Hence a tie breaking mechanism is
essential toorder the events. The tie breaking is done through:
• Linearly order process identifiers.
• Process with low identifier value will be given higher priority.

The term (t, i) indicates timestamp of an event, where t is its time of occurrence and i is the
identity of the process where it occurred.
The total order relation ( ) over two events x and y with timestamp (h, i) and (k, j) is given by:

A total order is generally used to ensure liveness properties in distributed algorithms.

3. Event Counting
If event e has a timestamp h, then h−1 represents the minimum logical duration,
counted in units of events, required before producing the event e. This is called height of the
event e. h-1 events have been produced sequentially before the event e regardless of the
processes that produced these events.

4. No strong consistency
The scalar clocks are not strongly consistent is that the logical local clock and logical
global clock of a process are squashed into one, resulting in the loss causal dependency
information among events at different processes.

2.1.2 VECTOR TIME


The ordering from Lamport's clocks is not enough to guarantee that if two events
precede one another in the ordering relation they are also causally related. Vector Clocks use
a vector counter instead of an integer counter. The vector clock of a system with N processes
is a vector of N counters, one counter per process. Vector counters have to follow the
following update rules:
• Initially, all counters are zero.
• Each time a process experiences an event, it increments its own counter in the vector
by one.
• Each time a process sends a message, it includes a copy of its own (incremented)
vector in the message.
• Each time a process receives a message, it increments its own counter in the vector by
one and updates each element in its vector by taking the maximum of the value in its
own vector counter and the value in the vector in the received message.

The time domain is represented by a set of n-dimensional non-negative integer vectors in vector
time.

Rules of Vector Time


Rule 1: Before executing an event, process pi updates its local logical time
as follows:

Rule 2: Each message m is piggybacked with the vector clock vt of the sender
process at sending time. On the receipt of such a message (m,vt), process
pi executes the following sequence of actions:
1. update its global logical time

2. execute R1
3. deliver the message m

Fig 1.21: Evolution of vector scale


Basic properties of vector time
1. Isomorphism:
• “→” induces a partial order on the set of events that are produced by a distributed
execution.
• If events x and y are timestamped as vh and vk then,

• If the process at which an event occurred is known, the test to compare two
timestamps can be simplified as:

2. Strong consistency
The system of vector clocks is strongly consistent; thus, by examining the vector timestamp
of two events, we can determine if the events are causally related.
3. Event counting
If an event e has timestamp vh[i], vh[j] denotes the number of events executed by process
pjthat causally precede e.

2.2 PHYSICAL CLOCK SYNCHRONIZATION: NEWTWORK TIME PROTOCOL


(NTP)
Centralized systems do not need clock synchronization, as they work under a common
clock. But the distributed systems do not follow common clock: each system functions based
on its own internal clock and its own notion of time. The time in distributed systems is
measured in the following contexts:
• The time of the day at which an event happened on a specific machine in the network.
• The time interval between two events that happened on different machines in the
network.
• The relative ordering of events that happened on different machines in the network.

Clock synchronization is the process of ensuring that physically distributed processors have a
common notion of time.

Due to different clocks rates, the clocks at various sites may diverge with time, and
periodically a clock synchronization must be performed to correct this clock skew in
distributed systems. Clocks are synchronized to an accurate real-time standard like UTC
(Universal Coordinated Time). Clocks that must not only be synchronized with each other
but also have to adhere to physical time are termed physical clocks. This degree of
synchronization additionally enables to coordinate and schedule actions between multiple
computers connected to a common network.

Basic terminologies:
If Ca and Cb are two different clocks, then:
• Time: The time of a clock in a machine p is given by the function Cp(t),where Cp(t)=
tfor a perfect clock.
• Frequency: Frequency is the rate at which a clock progresses. The frequency at time t
of clock Ca is Ca’(t).
• Offset: Clock offset is the difference between the time reported by a clock and the
real time. The offset of the clock Ca is given by Ca(t)− t. The offset of clock C a
relative toCb at time t ≥ 0 is given by Ca(t)- Cb(t)
• Skew: The skew of a clock is the difference in the frequencies of the clock and
theperfect clock. The skew of a clock Ca relative to clock Cb at timet is Ca’(t)-
Cb’(t).
• Drift (rate): The drift of clock Ca the second derivative of the clockvalue with
respectto time. The drift is calculated as:
Clocking Inaccuracies
Physical clocks are synchronized to an accurate real-time standard like UTC
(Universal Coordinated Time). Due to the clock inaccuracy discussed above, a timer (clock)
is said to be working within its specification if:

1. Offset delay estimation


A time service for the Internet - synchronizes clients to UTC Reliability from
redundant paths, scalable, authenticates time sources Architecture. The design of NTP
involves a hierarchical tree of time servers with primary server at the root synchronizes with
the UTC. The next level contains secondary servers, which act as a backup to the primary
server. At the lowest level is the synchronization subnet which has the clients.

2. Clock offset and delay estimation


A source node cannot accurately estimate the local time on the target node due to
varying message or network delays between the nodes. This protocol employs a very
common practice of performing several trials and chooses the trial with the minimum

delay.
Fig : Behavior of clocks

Fig a) Offset and delay estimation Fig b) Offset and delay estimation
between processes from same server between processes from different servers

Let T1, T2, T3, T4 be the values of the four most recent timestamps. The clocks A and B
arestable and running at the same speed. Let a = T1 − T3 and b = T2 − T4. If the network
delay difference from A to B and from B to A, called differential delay, is
small, the clock offset and roundtrip delay of B relative to A at time T4 are
approximatelygiven by the following:

Each NTP message includes the latest three timestamps T1, T2, andT3, while
T4 isdetermined upon arrival.

2.3 MESSAGE ORDERING AND GROUP COMMUNICATION


As the distributed systems are a network of systems at various physical locations, the
coordination between them should always be preserved. The message ordering means the
order of delivering the messages to the intended recipients. The common message order
schemes are First in First out (FIFO), non FIFO, causal order and synchronous order. In case
of group communication with multicasting, the causal and total ordering scheme is followed.
It is also essential to define the behaviour of the system in case of failures. The following
are the notations that are widely used in this chapter:
• Distributed systems are denoted by a graph (N, L).
• The set of events are represented by event set {E, }
• Message is denoted as mi: send and receive events as si and ri respectively.
• Send (M) and receive (M) indicates the message M send and received.
• a b denotes a and b occurs at the same process
• The send receive pairs ={(s, r) Ei x Ejcorresponds to r}
2.3.1 MESSAGE ORDERING PARADIGMS
The message orderings are

(i) non-FIFO
(ii) FIFO
(iii) causal order
(iv) synchronous order

There is always a trade-off between concurrency and ease of use and implementation.

Asynchronous Executions
An asynchronous execution (or A-execution) is an execution (E, ≺) for which the causality relation
is a partial order.
• There cannot be any causal relationship between events in asynchronous execution.
• The messages can be delivered in any order even in non FIFO.
• Though there is a physical link that delivers the messages sent on it in FIFO order due
to the physical properties of the medium, a may be formed as a composite of
physical links and multiple paths may exist between the two end points of the logical
link.
Fig 2.1: a) FIFO executions b) non FIFO executions

FIFO executions

A FIFO execution is an A-execution in which, for all

• The logical link is non-FIFO.


• A FIFO logical channel can be created over a non-FIFO channel by using a
separate numbering scheme to sequence the messages on each logical channel.
• The sender assigns and appends a <sequence_num, connection_id> tuple to each
message.
• The receiver uses a buffer to order the incoming messages as per the sender’s
sequence numbers, and accepts only the “next” message in sequence.

Causally Ordered (CO) executions

CO execution is an A-execution in which, for all,

Fig: CO Execution
• Two send events s and s’ are related by causality ordering (not physical time
ordering), then a causally ordered execution requires that their corresponding receive
events r and r’ occur in the same order at all common destinations.
Applications of causal order:
Applications that requires update to shared data to implement distributed shared
memory, and fair resource allocation in distributed mutual exclusion.

Causal Order(CO) for Implementations:

If send(m1) ≺ send(m2) then for each common destination d of messages m1 and m2,
deliverd(m1) ≺deliverd(m2) must be satisfied.
Other properties of causal ordering
1. Message Order (MO): A MO execution is an A-execution in which, for all

Fig: Not a CO execution

Empty Interval Execution: An execution (E ≺) is an empty-interval (EI)execution if


for each pair of events (s, r) ∈ T, the open interval set

in the partial order is empty.


An execution (E, ≺) is CO if and only if for each pair of events (s, r) ∈ T and eachevent e ∈ E,
• weak common past:

• weak common future:

Synchronous Execution

• When all the communication between pairs of processes uses synchronous send and
receives primitives, the resulting order is the synchronous order.
• The synchronous communication always involves a handshake between the receiver
and the sender, the handshake events may appear to be occurring instantaneously and
atomically.

Causality in a synchronous execution


The synchronous causality relation << on E is the smallest transitive relation that satisfies the
following: S1: If x occurs before y at the same process, then x<< y.
S2: If sr∈ T, then for all x ∈ E,[(x s ⇐⇒ x<<r) and (s<<x⇐⇒ r <<x)].
S3: If x <<y and y<<z, then x<<z
Synchronous Execution:
A synchronous execution (or S-execution) is an execution (E, << )for which the causality relation<< is
partial order
Fig) Execution in an asynchronous system Fig) Execution in synchronous

Timestamping a synchronous execution

An execution( E ≺)is synchronous if and only if there exists a mapping from E to T (scalar timestamps)
such that
• for any message M, T(s(M))=T(r(M);
• for each process Pi, if ei ≺ ei1 then T(ei)< T(ei1) .

2.4 ASYNCHRONOUS EXECUTION WITH SYNCHRONOUS COMMUNICATION


When all the communication between pairs of processes is by using synchronous send
and receive primitives, the resulting order is synchronous order. If a program is written for an
asynchronous system, say a FIFO system, will it still execute correctly if the communication
is done by synchronous primitives. There is a possibility that the program may deadlock,

Fig) A communication program for an asynchronous system deadlock when using


synchronous primitives

Fig) Illustrations of asynchronous crown of size 2 crown of size 3


Execution and of crowns
Crown of size 2
Realizable Synchronous Communication (RSC)
A-execution can be realized under synchronous communication is called a realizable with
synchronous communication (RSC).
An execution can be modeled to give a total order that extends the partial order (E, ≺).
In an A-execution, the messages can be made to appear instantaneous if there exist a linear extension of
the execution, such that each send event is immediately followed by its corresponding receive event in
this linear extension.
Non-separated linear extension is an extension of (E, ≺) is a linear extension of (E, ≺) such that for
each pair (s, r) ∈ T, the interval { x∈ E s ≺ x ≺ r } is empty.
A-execution (E, ≺) is an RSC execution if and only if there exists a non-separated linear extension of
the partial order (E, ≺).
In the non-separated linear extension, if the adjacent send event and its corresponding receive event are
viewed atomically, then that pair of events shares a common past and a common future with each other.
Crown
Let E be an execution. A crown of size k in E is a sequence <(si, ri), i ∈{0,…, k-1}> of pairs of
corresponding send and receive events such that: s0 ≺ r1, s1 ≺ r2, sk−2 ≺ rk−1, sk−1 ≺ r0.
The crown is <(s1, r1) (s2, r2)> as we have s1 ≺ r2 and s2 ≺ r1. Cyclic dependencies may exist in a
crown. The crown criterion states that an A-computation is RSC, i.e., it can be realized on a system
with synchronous communication, if and only if it contains no crown.
Timestamp criterion for RSC execution
An execution (E, ≺) is RSC if and only if there exists a mapping from E to T (scalar timestamps)
such that

Synchronous programs on asynchronous systems


− A (valid) S-execution can be trivially realized on an asynchronous system by
scheduling the messages in the order in which they appear in the S-execution.
− The partial order of the S-execution remains unchanged but the communication
occurs on an asynchronous system that uses asynchronous communication primitives.
− Once a message send event is scheduled, the middleware layer waits for
acknowledgment; after the ack is received, the synchronous send primitive completes.

2.5 SYNCHRONOUS PROGRAM ORDER ON AN ASYNCHRONOUS SYSTEM

Non deterministic programs


The partial ordering of messages in the distributed systems makes the repeated runs of
the same program will produce the same partial order, thus preserving deterministic nature.
But sometimes the distributed systems exhibit non determinism:
• A receive call can receive a message from any sender who has sent a message, if the
expected sender is not specified.
• Multiple send and receive calls which are enabled at a process can be executed in an
interchangeable order.
• If i sends to j, and j sends to i concurrently using blocking synchronous calls, there
results a deadlock.
• There is no semantic dependency between the send and the immediately following
receive at each of the processes. If the receive call at one of the processes can be
scheduled before the send call, then there is no deadlock.

Rendezvous
Rendezvous systems are a form of synchronous communication among an arbitrary
number of asynchronous processes. All the processes involved meet with each other, i.e.,
communicate synchronously with each other at one time. Two types of rendezvous systems
are possible:
• Binary rendezvous: When two processes agree to synchronize.
• Multi-way rendezvous: When more than two processes agree to synchronize.

Features of binary rendezvous:


• For the receive command, the sender must be specified. However, multiple recieve
commands can exist. A type check on the data is implicitly performed.

• Send and received commands may be individually disabled or enabled. A command is


disabled if it is guarded and the guard evaluates to false. The guard would likely
contain an expression on some local variables.
• Synchronous communication is implemented by scheduling messages under the
covers using asynchronous communication.

• Scheduling involves pairing of matching send and receives commands that are both
enabled. The communication events for the control messages under the covers do not
alter the partial order of the execution.

2.3.2 Binary rendezvous algorithm


If multiple interactions are enabled, a process chooses one of them and tries to
synchronize with the partner process. The problem reduces to one of scheduling messages
satisfying the following constraints:
• Schedule on-line, atomically, and in a distributed manner.
• Schedule in a deadlock-free manner (i.e., crown-free).
• Schedule to satisfy the progress property in addition to the safety property.

Steps in Bagrodia algorithm


1. Receive commands are forever enabled from all processes.
2. A send command, once enabled, remains enabled until it completes, i.e., it is not
possible that a send command gets before the send is executed.
3. To prevent deadlock, process identifiers are used to introduce asymmetry to break
potential crowns that arise.
4. Each process attempts to schedule only one send event at any time.
The message (M) types used are: M, ack(M), request(M), and permission(M). Execution
events in the synchronous execution are only the send of the message M and receive of the
message M. The send and receive events for the other message types – ack(M), request(M),
and permission(M) which are control messages. The messages request(M), ack(M), and
permission(M) use M’s unique tag; the message M is not included in these messages.

(message types)

M, ack(M), request(M), permission(M)

(1) Pi wants to execute SEND(M) to a lower priority process Pj:

Pi executes send(M) and blocks until it receives ack(M) from Pj . The send event SEND(M)
nowcompletes.

Any M’ message (from a higher priority processes) and request(M’) request for
synchronization (from a lower priority processes) received during the blocking period are
queued.

(2) Pi wants to execute SEND(M) to a higher priority

process Pj: (2a) Pi seeks permission from Pj by executing

send(request(M)).

.(2b) While Pi is waiting for permission, it remains unblocked.

(i) If a message M’ arrives from a higher priority process Pk, Pi accepts M’ by scheduling a
RECEIVE(M’) event and then executes send(ack(M’)) to Pk.

(ii) Ifa request(M’) arrives from a lower priority process Pk, Pi executes
send(permission(M’)) to Pk and blocks waiting for the messageM’. WhenM’ arrives, the
RECEIVE(M’) event is executed.

(2c) When the permission(M) arrives, Pi knows partner Pj is synchronized and Pi executes
send(M). The SEND(M) now completes.

(3) request(M) arrival at Pi from a lower priority process Pj:

At the time a request(M) is processed by Pi, process Pi executes send(permission(M)) to Pj


and blocks waiting for the message M. When M arrives, the RECEIVE(M) event is executed
and the process unblocks.
(4) Message M arrival at Pi from a higher priority process Pj:

At the time a message M is processed by Pi, process Pi executes RECEIVE(M) (which is


assumed to be always enabled) and then send(ack(M)) to Pj .

(5) Processing when Pi is unblocked:

When Pi is unblocked, it dequeues the next (if any) message from the queue and processes it
as a message arrival (as per rules 3 or 4).

Fig 2.5: Bagrodia Algorithm

2.6 GROUP COMMUNICATION


Group communication is done by broadcasting of messages. A message broadcast is
the sending of a message to all members in the distributed system. The communication may
be
• Multicast: A message is sent to a certain subset or a group.
• Unicasting: A point-to-point message communication.

The network layer protocol cannot provide the following functionalities:


▪ Application-specific ordering semantics on the order of delivery of messages.
▪ Adapting groups to dynamically changing membership.
▪ Sending multicasts to an arbitrary set of processes at each send event.
▪ Providing various fault-tolerance semantics.
▪ The multicast algorithms can be open or closed group.

2.7 CAUSAL ORDER (CO)


In the context of group communication, there are two modes of communication:
causal order and total order. Given a system with FIFO channels, causal order needs to be
explicitly enforced by a protocol. The following two criteria must be met by a causal
ordering protocol:
• Safety: In order to prevent causal order from being violated, a message M that
arrives at a process may need to be buffered until all system wide messages sent in the
causal past of the send (M) event to that same destination have already arrived. The
arrival of a message is transparent to the application process. The delivery event
corresponds to the receive event in the execution model.
• Liveness: A message that arrives at a process must eventually be delivered to the
process.

The Raynal–Schiper–Toueg algorithm


• Each message M should carry a log of all other messages sent causally before M’s
send event, and sent to the same destination dest(M).
• The Raynal–Schiper–Toueg algorithm canonical algorithm is a representative of
several algorithms that reduces the size of the local space and message space
overhead by various techniques.
• This log can then be examined to ensure whether it is safe to deliver a message.
• All algorithms aim to reduce this log overhead, and the space and time overhead of
maintaining the log information at the processes.
• To distribute this log information, broadcast and multicast communication is used.
• The hardware-assisted or network layer protocol assisted multicast cannot efficiently
provide features:
➢ Application-specific ordering semantics on the order of delivery of messages.
➢ Adapting groups to dynamically changing membership.
➢ Sending multicasts to an arbitrary set of processes at each send event.
➢ Providing various fault-tolerance semantics

The Kshem Kalyani – Singhal Optimal Algorithm


An optimal CO algorithm stores in local message logs and propagates on messages,
information of the form d is a destination of M about a message M sent in the causal past, as
long as and only as long as:

Propagation Constraint I: it is not known that the message M is delivered to d.

Propagation Constraint II: it is not known that a message has been sent to d in the causal
future of Send(M), and hence it is not guaranteed using a reasoning based on transitivity that
the message M will be delivered to d in CO.

Fig : Conditions for causal ordering


The Propagation Constraints also imply that if either (I) or (II) is false, the information
“d ∈ M.Dests” must not be stored or propagated, even to remember that (I) or (II) has been
falsified:
▪ not in the causal future of Deliverd(M1, a)
▪ not in the causal future of e k, c where d ∈Mk,cDests and there is no
other message sent causally between Mi,a and Mk, c to the same
destination d.

Information about messages:


(i) not known to be delivered
(ii) not guaranteed to be delivered in CO, is explicitly tracked by the algorithm using (source,
timestamp, destination) information.
Information about messages already delivered and messages guaranteed to be delivered in
CO is implicitly tracked without storing or propagating it, and is derived from the explicit
information. The algorithm for the send and receive operations is given in Fig. 2.7 a) and b).
Procedure SND is executed atomically. Procedure RCV is executed atomically except for a
possible interruption in line 2a where a non-blocking wait is required to meet the Delivery
Condition.
Fig 2.7 a) Send algorithm by Kshemkalyani–Singhal to optimally implement causal
ordering

Fig b) Receive algorithm by Kshemkalyani–Singhal to optimally implement causal


ordering

The data structures maintained are sorted row–major and then column–major:

1. Explicit tracking:
▪ Tracking of (source, timestamp, destination) information for messages (i) not known to be
delivered and (ii) not guaranteed to be delivered in CO, is done explicitly using the
I.Dests field of entries in local logs at nodes and o.Dests field of entries in messages.
▪ Sets li,a Dests and oi,a. Dests contain explicit information of destinations to which Mi,ais
not guaranteed to be delivered in CO and is not known to be delivered.
▪ The information about d ∈Mi,a .Dests is propagated up to the earliest events on all causal
paths from (i, a) at which it is known that Mi,a is delivered to d or is guaranteed to be
delivered to d in CO.

2. Implicit tracking:
▪ Tracking of messages that are either (i) already delivered, or (ii) guaranteed to be
delivered in CO, is performed implicitly.
▪ The information about messages (i) already delivered or (ii) guaranteed to be
delivered in CO is deleted and not propagated because it is redundant as far as
enforcing CO is concerned.
▪ It is useful in determining what information that is being carried in other messages
and is being stored in logs at other nodes has become redundant and thus can be
purged.
▪ The semantics are implicitly stored and propagated. This information about messages
that are (i) already delivered or (ii) guaranteed to be delivered in CO is tracked
without explicitly storing it.
▪ The algorithm derives it from the existing explicit information about messages (i) not
known to be delivered and (ii) not guaranteed to be delivered in CO, by examining
only oi,aDests or li,aDests, which is a part of the explicit information.

Fig 2.8: Illustration of propagation

constraintsMulticasts M5,1and M4,1


Message M5,1 sent to processes P4 and P6 contains the piggybacked information M5,1.
Dest= {P4, P6}. Additionally, at the send event (5, 1), the information M5,1.Dests = {P4,P6}
is also inserted in the local log Log5. When M5,1 is delivered to P6, the (new) piggybacked
information P4 ∈ M5,1 .Dests is stored in Log6 as M5,1.Dests ={P4} information about P6 ∈
M5,1.Dests which was needed for routing, must not be stored in Log6 because of constraint
I.
In the same way when M5,1 is delivered to process P4
at event (4, 1), only the new piggybacked information P6 ∈ M5,1 .Dests is inserted in Log4 as
M5,1.Dests =P6which is later propagated during multicast M4,2.

Multicast M4,3
At event (4, 3), the information P6 ∈M5,1.Dests in Log4 is propagated on multicast M4,3only
to process P6 to ensure causal delivery using the Delivery Condition. The piggybacked
information on message M4,3sent to process P3must not contain this information because of
constraint II. As long as any future message sent to P6 is delivered in causal order w.r.t.
M4,3sent to P6, it will also be delivered in causal order w.r.t. M5,1. And as M5,1 is already
delivered to P4, the information M5,1Dests = ∅ is piggybacked on M4,3 sent to P 3.
Similarly, the information P6 ∈ M5,1Dests must be deleted from Log4 as it will no longer be
needed, because of constraint II. M5,1Dests = ∅ is stored in Log4 to remember that M5,1 has
been delivered or is guaranteed to be delivered in causal order to all its destinations.
Learning implicit information at P2 and P3
When message M4,2is received by processes P2 and P3, they insert the (new)
piggybacked information in their local logs, as information M5,1.Dests = P6. They both
continue to storethis in Log2 and Log3 and propagate this information on multicasts until
they learn at events(2, 4) and (3, 2) on receipt of messages M3,3and M4,3, respectively, that
any future message is expected to be delivered in causal order to process P6, w.r.t. M 5,1sent
toP6. Hence byconstraint II, this information must be deleted from Log2 andLog3. The
flow of events isgiven by;
• When M4,3 with piggybacked information M5,1Dests = ∅ is received byP3at (3, 2),
this is inferred to be valid current implicit information about multicast M5,1because
the log Log3 already contains explicit informationP6 ∈M5,1.Dests about that
multicast. Therefore, the explicit information in Log3 is inferred to be old and must be
deleted to achieve optimality. M5,1Dests is set to ∅ in Log3.
• The logic by which P2 learns this implicit knowledge on the arrival of M3,3is
identical.

Processing at P6
When message M5,1 is delivered to P6, only M5,1.Dests = P4 is added to Log6. Further,
P6 propagates only M5,1.Dests = P4 on message M6,2, and this conveys the current
implicit information M5,1 has been delivered to P6 by its very absence in the explicit
information.
• When the information P6 ∈ M5,1Dests arrives on M4,3, piggybacked as M5,1 .Dests
= P6 it is used only to ensure causal delivery of M4,3 using the Delivery
Condition,and is not inserted in Log6 (constraint I) – further, the presence of M5,1
.Dests = P4 in Log6 implies the implicit information that M5,1 has already been
delivered to P6. Also, the absence of P4 in M5,1 .Dests in the explicit
piggybacked information implies the implicit information that M5,1 has been
delivered or is guaranteed to be delivered in causal order to P4, and, therefore,
M5,1. Dests is set to ∅ in Log6.
• When the information P6 ∈ M5,1 .Dests arrives on M5,2 piggybacked as M5,1. Dests
= {P4, P6} it is used only to ensure causal delivery of M4,3 using the Delivery
Condition, and is not inserted in Log6 because Log6 contains M5,1 .Dests = ∅,
which gives the implicit information that M5,1 has been delivered or is
guaranteedto be delivered in causal order to both P4 and P6.

Processing at P1
• When M2,2arrives carrying piggybacked information M5,1.Dests = P6 this
(new)information is inserted in Log1.
• When M6,2arrives with piggybacked information M5,1.Dests ={P4}, P1learns
implicit information M5,1has been delivered to P6 by the very absence of explicit
information P6 ∈ M5,1.Dests in the piggybacked information, and hence marks
information P6 ∈ M5,1Dests for deletion from Log1
• The information “P6 ∈M5,1.Dests piggybacked on M2,3,which arrives at P 1, is
inferred to be outdated using the implicit knowledge derived from M5,1.Dest= ∅”
inLog1.
2.8 TOTAL ORDER

For each pair of processes Pi and Pj and for each pair of messages Mx and My that are delivered to
both the processes, Pi is delivered Mx before My if and only if Pj is delivered Mxbefore My.

Centralized Algorithm for total ordering

Each process sends the message it wants to broadcast to a centralized process, which
relays all the messages it receives to every other process over FIFO channels.

Complexity: Each message transmission takes two message hops and exactly n messages
in a system of n processes.

Drawbacks: A centralized algorithm has a single point of failure and congestion, and is
not an elegant solution.

Three phase distributed algorithm

Three phases can be seen in both sender and receiver side.

Sender

Phase 1
• In the first phase, a process multicasts the message M with a locally unique tag and
the local timestamp to the group members.

Phase 2
• The sender process awaits a reply from all the group members who respond with a
tentative proposal for a revised timestamp for that message M.
• The await call is non-blocking.

Phase 3
• The process multicasts the final timestamp to the group.
Fig) Sender side of three phase distributed algorithm

Receiver Side
Phase 1
• The receiver receives the message with a tentative timestamp. It updates the variable
priority that tracks the highest proposed timestamp, then revises the proposed
timestamp to the priority, and places the message with its tag and the revised
timestamp at the tail of the queue temp_Q. In the queue, the entry is marked as
undeliverable.

Phase 2
• The receiver sends the revised timestamp back to the sender. The receiver then waits
in a non-blocking manner for the final timestamp.

Phase 3
• The final timestamp is received from the multi caster. The corresponding
messageentry in temp_Q is identified using the tag, and is marked as deliverable
after the revised timestamp is overwritten by the final timestamp.
• The queue is then resorted using the timestamp field of the entries as the key. As the
queue is already sorted except for the modified entry for the message under
consideration, that message entry has to be placed in its sorted position in the queue.
• If the message entry is at the head of the temp_Q, that entry, and all consecutive
subsequent entries that are also marked as deliverable, are dequeued from temp_Q,
and enqueued in deliver_Q.

Complexity
This algorithm uses three phases, and, to send a message to n − 1 processes, it uses 3(n – 1)
messages and incurs a delay of three message hops
Example
An example execution to illustrate the algorithm is given in Figure 6.14. Here, A and B
multicast to a set of destinations and C and D are the common destinations for both
multicasts. •
Figure (a) The main sequence of steps is as follows:
1. A sends a REVISE_TS(7) message, having timestamp 7. B sends a REVISE_TS(9)
message, having timestamp 9.
2. C receives A’s REVISE_TS(7), enters the corresponding message in temp_Q, and marks
it as undeliverable; priority = 7. C then sends PROPOSED_TS(7) message to A
3. D receives B’s REVISE_TS(9), enters the corresponding message in temp_Q, and marks
it as undeliverable; priority = 9. D then sends PROPOSED_TS(9) message to B.
4. C receives B’s REVISE_TS(9), enters the corresponding message in temp_Q, and marks
it as undeliverable; priority = 9. C then sends PROPOSED_TS(9) message to B.
5. D receives A’s REVISE_TS(7), enters the corresponding message in temp_Q, and marks
it as undeliverable; priority = 10. D assigns a tentative timestamp value of 10, which is
greater than all of the times tamps on REVISE_TSs seen so far, and then sends
PROPOSED_TS(10) message to A.
The state of the system is as shown in the figure

Fig) An example to illustrate the three-phase total ordering algorithm. (a) A snapshot for
PROPOSED_TS and REVISE_TS messages. The dashed lines show the further execution
after the snapshot. (b) The FINAL_TS messages in the example.
Figure (b) The continuing sequence of main steps is as follows:
6. When A receives PROPOSED_TS(7) from C and PROPOSED_TS(10) from D, it
computes the final timestamp as max710=10, and sends FINAL_TS(10) to C and D.
7. When B receives PROPOSED_TS(9) from C and PROPOSED_TS(9) from D, it
computes the final timestamp as max99= 9, and sends FINAL_TS(9) to C and D.
8. C receives FINAL_TS(10) from A, updates the corresponding entry in temp_Q with the
timestamp, resorts the queue, and marks the message as deliverable. As the message is not
at the head of the queue, and some entry ahead of it is still undeliverable, the message is
not moved to delivery_Q.
9. D receives FINAL_TS(9) from B, updates the corresponding entry in temp_Q by
marking the corresponding message as deliverable, and resorts the queue. As the message
is at the head of the queue, it is moved to delivery_Q. This is the system snapshot shown in
Figure (b).
The following further steps will occur:
10. When C receives FINAL_TS(9) from B, it will update the correspond ing entry in
temp_Q by marking the corresponding message as deliv erable. As the message is at the
head of the queue, it is moved to the delivery_Q, and the next message (of A), which is
also deliverable, is also moved to the delivery_Q.
11. When D receives FINAL_TS(10) from A, it will update the corre sponding entry in
temp_Q by marking the corresponding message as deliverable. As the message is at the
head of the queue, it is moved to the delivery_Q

2.9 GLOBAL STATE AND SNAPSHOT RECORDING ALGORITHMS


• A distributed computing system consists of processes that do not share a common
memory and communicate asynchronously with each other by message passing.
• Each component of has a local state. The state of the process is the local memory and
ahistory of its activity.
• The state of a channel is characterized by the set of messages sent along the channel
less the messages received along the channel. The global state of a distributed system
isa collection of the local states of its components.
• If shared memory were available, an up-to-date state of the entire system would be
available to the processes sharing the memory.
• The absence of shared memory necessitates ways of getting a coherent and complete
view of the system based on the local states of individual processes.
• A meaningful global snapshot can be obtained if the components of the distributed
system record their local states at the same time.
• This would be possible if the local clocks at processes were perfectly synchronized or
if there were a global system clock that could be instantaneously read by the
processes.
• If processes read time from a single common clock, various in determinate
transmission delays during the read operation will cause the processes to identify
various physical instants as the same time.

2.9.1 System Model


• The system consists of a collection of n processes, p1, p2,…,pn that are
connectedby channels.
• Let Cij denote the channel from process pi to process pj.
• Processes and channels have states associated with them.
• The state of a process at any time is defined by the contents of processor registers,
stacks, local memory, etc., and may be highly dependent on the local context of
the distributed application.
• The state of channel Cij, denoted by SCij, is given by the set of messages in transit
in the channel.
• The events that may happen are: internal event, send (send (mij)) and receive
(rec(mij)) events.
• The occurrences of events cause changes in the process state.
• A channel is a distributed entity and its state depends on the local states of the
processes on which it is incident.

• The transit function records the state of the channel Cij.


• In the FIFO model, each channel acts as a first-in first-out message queue and,
thus, message ordering is preserved by a channel.
• In the non-FIFO model, a channel acts like a set in which the sender process
adds messages and the receiver process removes messages from it in a random
order.

2.9.2 A consistent global state


The global state of a distributed system is a collection of the local states of the
processes and the channels. The global state is given by:

The two conditions for global state are:

Condition 1 preserves law of conservation of messages. Condition C2 states that in


thecollected global state, for every effect, its cause must be present.

Law of conservation of messages: Every message mijthat is recorded as sent in the local state of a
process pi must be captured in the state of the channel Cij or in the collected local state of the
receiver process pj.

➢ In a consistent global state, every message that is recorded as received is also recorded
as sent. Such a global state captures the notion of causality that a message cannot be
received if it was not sent.
➢ Consistent global states are meaningful global states and inconsistent global states are
not meaningful in the sense that a distributed system can never be in an inconsistent
state.

2.9.3 Interpretation of cuts


• Cuts in a space–time diagram provide a powerful graphical aid in representing and
reasoning about the global states of a computation. A cut is a line joining an arbitrary
point on each process line that slices the space–time diagram into a PAST and a
FUTURE.
• A consistent global state corresponds to a cut in which every message received in the
PAST of the cut has been sent in the PAST of that cut. Sucha cut is known as a
consistent cut.
• In a consistent snapshot, all the recorded local states of processes are concurrent; that
is, the recorded local state of no process casually affects the recorded local state of
anyother process.

Issues in recording global state


The non-availability of global clock in distributed system, raises the following issues:
Issue 1:
How to distinguish between the messages to be recorded in the snapshot from those
not to be recorded?
Answer:
• Any message that is sent by a process before recording its snapshot, must
berecorded in the global snapshot (from C1).
• Any message that is sent by a process after recording its snapshot, must not
berecorded in the global snapshot (from C2).

Issue 2:
How to determine the instant when a process takes its snapshot?
The answer
Answer:
A process pj must record its snapshot before processing a message mij that was sent byprocess pi after
recording its snapshot

2.9.4 SNAPSHOT ALGORITHMS FOR FIFO CHANNELS


Each distributed application has number of processes running on different physical
servers. These processes communicate with each other through messaging channels.

A snapshot captures the local states of each process along with the state of each communication channel.

Snapshots are required to:


• Checkpointing
• Collecting garbage
• Detecting deadlocks
• Debugging

Chandy–Lamport algorithm
• The algorithm will record a global snapshot for each process channel.
• The Chandy-Lamport algorithm uses a control message, called a marker.
• After a site has recorded its snapshot, it sends a marker along all of its outgoing
channels before sending out any more messages.
• Since channels are FIFO, a marker separates the messages in the channel into those to
be included in the snapshot from those not to be recorded in the snapshot.

• This addresses issue I1. The role of markers in a FIFO system is to act as delimiters
for the messages in the channels so that the channel state recorded by the process
at the receiving end of the channel satisfies the condition C2.

Fig 2.10: Chandy–Lamport algorithm

Initiating a snapshot
• Process Pi initiates the snapshot
• Pi records its own state and prepares a special marker message.
• Send the marker message to all other processes.
• Start recording all incoming messages from channels Cij for j not equal to i.

Propagating a snapshot
• For all processes Pjconsider a message on channel Ckj.

• If marker message is seen for the first time:


− Pjrecords own sate and marks Ckj as empty
− Send the marker message to all other processes.
− Record all incoming messages from channels Clj for 1 not equal to j or k.
− Else add all messages from inbound channels.

Terminating a snapshot
• All processes have received a marker.
• All process have received a marker on all the N-1 incoming channels.
• A central server can gather the partial state to build a global snapshot.

Correctness of the algorithm


• Since a process records its snapshot when it receives the first marker on any
incoming channel, no messages that follow markers on the channels incoming to it are
recorded in the process’s snapshot.
• A process stops recording the state of an incoming channel when a marker is received
on that channel.
• Due to FIFO property of channels, it follows that no message sent after the marker on that
channel is recorded in the channel state. Thus, condition C2 is satisfied.
• When a process pj receives message mij that precedes the marker on channel Cij, it acts
as follows: if process pj has not taken its snapshot yet, then it includes mij in its recorded
snapshot. Otherwise, it records mij in the state of the channel Cij. Thus, condition C1
issatisfied.

Complexity
The recording part of a single instance of the algorithm requires O(e) messages
and O(d) time, where e is the number of edges in the network and d is the diameter of
thenetwork.

Properties of the recorded global state

Fig) Timing diagram of two possible executions of the banking examples


1. (Markers shown using dashed-and-dotted arrows.) Let site S1 initiate the algorithm just after t1.
Site S1 records its local state (account A=$550) and sends a marker to site S2. The marker is
received by site S2 after t4. When site S2 receives the marker, it records its local state
(account B=$170), the state of channel C12 as $0, and sends a marker along channel C21.
When site S1 receives this marker, it records the state of channel C21 as $80. The $800 amount
in the system is conserved in the recorded global state,
A=$550 B=$170 C12 =$0 C21 =$80

2. (Markers shown using dotted arrows.) Let site S1 initiate the algorithm just after t0 and before
Sending the $50 for S2. Site S1 records its local state (account A = $600) and sends a marker to
S2. The marker is received by site S2 between t2 and t3. When site S2 receives the marker, it
records its local state (account B = $120), the state of channel C12 as $0, and sends a marker
along channel C21. When site S1 receives this marker, it records the state of channel C21 as $80.
The $800 amount in the system is conserved in the recorded global state,
A=$600 B=$120 C12 =$0 C21 =$80

The recorded global state may not correspond to any of the global states that occurred
during the computation.
This happens because a process can change its state asynchronously before the markers it
sentare received by other sites and the other sites record their states.
But the system could have passed through the recorded global states in some equivalent
executions.
The recorded global state is a valid state in an equivalent execution and if a stable property
(i.e., a property that persists) holds in the system before the snapshot algorithm begins, it holds in
the recorded global snapshot.
Therefore, a recorded global state is useful in detecting stable properties.

UNIT III

DISTRIBUTED MUTEX AND DEADLOCK

DISTRIBUTED MUTEX & DEADLOCK


Distributed mutual exclusion algorithms: Introduction – Preliminaries – Lamport‘s algorithm
–Ricart-Agrawala algorithm – Token-Based algorithms – Suzuki Kasami‘s broadcast
algorithm; Deadlock detection in distributed systems: Introduction – System model –
Preliminaries –Models of deadlocks – Chandy-Misra-Haas Algorithms for the AND model
and OR model.

3.1 DISTRIBUTED MUTUAL EXCLUSION ALGORITHMS


• Mutual exclusion is a concurrency control property which is introduced to prevent
race conditions.
• It is the requirement that a process cannot access a shared resource while another
concurrent process is currently present or executing the same resource.

Mutual exclusion in a distributed system states that only one process is allowed to execute the
critical section (CS) at any given time.

• Message passing is the sole means for implementing distributed mutual exclusion.
There are three basic approaches for implementing distributed mutual exclusion:
1. Token-based approach:
− A unique token (also known as the privilege message) is shared among the sites.
− A site is allowed to enter its CS if it possesses the token.
− Mutual Exclusion is ensured because the token is unique.
− Eg: Suzuki-Kasami’s Broadcast Algorithm, Raymond’s Tree- Based Algorithm
etc
2. Non-token-based approach:
− Two or more successive rounds of messages are exchanged among the sites to
determine which site will enter the CS next.
− Eg: Lamport's algorithm, Ricart–Agrawala algorithm

Quorum-based approach:

− Each site requests permission to execute the CS from a subset of sites


(called a quorum)
− Any two subsets of sites or Quorum contains a common site.
− This common site is responsible to make sure that only one request excutes the
CS at any time.
− Eg: Maekawa’s Algorithm

3.1.1 Preliminaries
• The system consists of N sites, S1, S2, S3, …, SN.
• Assume that a single process is running on each site.
• The process at site Si is denoted by pi.
• All these processes communicate asynchronously over an underlying
communication network.
• A site can be in one of the following three states: requesting the CS, executing the CS,
or neither requesting nor executing the CS.
• In the requesting the CS state, the site is blocked and cannot make further requests for
the CS.
• In the idle state, the site is executing outside the CS.
• In the token-based algorithms, a site can also be in a state where a site holding the
token is executing outside the CS. Such state is referred to as the idle token state.
• At any instant, a site may have several pending requests for CS. A site queues up
these requests and serves them one at a time.
• N denotes the number of processes or sites involved in invoking the critical section, T
denotes the average message delay, and E denotes the average critical section
execution time.

3.1.2 Requirements of mutual exclusion algorithms


• Safety property:

At any instant, only one process can execute the critical section. This is an
essential property of a mutual exclusion algorithm.
• Liveness property:
This property states the absence of deadlock and starvation. Two or more sites
should not endlessly wait for messages that will never arrive. time. This is an
important property of a mutual exclusion algorithm
• Fairness:
Fairness in the context of mutual exclusion means that each process gets a fair
chance to execute the CS. In mutual exclusion algorithms, the fairness property
generally means that the CS execution requests are executed in order of their arrival in
the system.

3.1.3 Performance metrics

➢ Message complexity: This is the number of messages that are required per CS
execution by a site.
➢ Synchronization delay: After a site leaves the CS, it is the time required and before
the next site enters the CS.
➢ Response time: This is the time interval a request waits for its CS execution to be
over after its request messages have been sent out. Thus, response time does not
include the time a request waits at a site before its request messages have been sent
out.
System throughput: This is the rate at which the system executes requests for the

CS. If SD is the synchronization delay and E is the average critical section execution
time.

Figure: Synchronization delay

Figure: Response Time


Low and High Load Performance:
▪ The performance of mutual exclusion algorithms is classified as two special loading
conditions, viz., “low load” and “high load”.
▪ The load is determined by the arrival rate of CS execution requests.
▪ Under low load conditions, there is seldom more than one request for the critical
section present in the system simultaneously.
▪ Under heavy load conditions, there is always a pending request for critical section at a
site.

Best and worst case performance


▪ In the best case, prevailing conditions are such that a performance metric attains the
best possible value. For example, the best value of the response time is a roundtrip
message delay plus the CS execution time, 2T +E.

▪ For examples, the best and worst values of the response time are achieved when load
is, respectively, low and high;
▪ The best and the worse message traffic is generated at low and heavy load conditions,
respectively.
3.2 LAMPORT’S ALGORITHM
• Request for CS are executed in the increasing order of timestamps and time is
determined by logical clocks.
• Every site Si keeps a queue, request_queuei which contains mutual exclusion requests
ordered by their timestamps
• This algorithm requires communication channels to deliver messages the FIFO
order.Three types of messages are used Request, Reply and Release. These messages
with timestamps also updates logical clock

Fig: Lamport’s distributed mutual exclusion algorithm


To enter Critical section:
When a site Si wants to enter the critical section, it sends a request message
Request(tsi, i) to all other sites and places the request on request_queuei. Here, Tsi
denotes the timestamp of Site Si.
When a site Sj receives the request message REQUEST(tsi, i) from site Si, it returns a
timestamped REPLY message to site Si and places the request of site Si on
request_queuej

To execute the critical section:


• A site Si can enter the critical section if it has received the message with timestamp
larger than (tsi, i) from all other sites and its own request is at the top of
request_queuei.

To release the critical section:


When a site Si exits the critical section, it rSemoves its own request from the top of its request
queue and sends a timestamped RELEASE message to all other sites. When a site Sj receives the
timestamped RELEASSE message from site Si, it removes the request of Sia from its request
queue.
Fig) S1 and S2 are making requests for the CS

Fig) Site S1 enters the CS

Fig) Site S1 exists the CS and sends RELEASE messages


Fig) Site S2 enters the CS

Correctness
Theorem: Lamport’s algorithm achieves mutual exclusion.
Proof: Proof is by contradiction.
Suppose two sites Si and Sj are executing the CS concurrently. For this to happen
conditions L1 and L2 must hold at both the sites concurrently.
This implies that at some instant in time, say t, both Si and Sj have their own requests
at the top of their request queues and condition L1 holds at them. Without loss of
generality, assume that Si ’s request has smaller timestamp than the request of Sj .
From condition L1 and FIFO property of the communication channels, it is clear that
at instant t the request of Si must be present in request queuej when Sj was executing
its CS. This implies that Sj ’s own request is at the top of its own request queue
when a smaller timestamp request, Si ’s request, is present in the request queuej – a
contradiction!

Theorem: Lamport’s algorithm is fair.


Proof: The proof is by contradiction.
Suppose a site Si ’s request has a smaller timestamp than the request of another site Sj
and Sj is able to execute the CS before Si .
For Sj to execute the CS, it has to satisfy the conditions L1 and L2. This implies
that at some instant in time say t, Sj has its own request at the top of its queue and it
has also received a message with timestamp larger than the timestamp of its request
from all other sites.
But request queue at a site is ordered by timestamp, and according to our assumption
Si has lower timestamp. So Si ’s request must be placed ahead of the Sj ’s request in
the request queuej . This is a contradiction!

Message Complexity:
Lamport’s Algorithm requires invocation of 3(N – 1) messages per critical section execution.
These 3(N – 1) messages involves
• (N – 1) request messages
• (N – 1) reply messages
• (N – 1) release messages
Drawbacks of Lamport’s Algorithm:
• Unreliable approach: failure of any one of the processes will halt the progress
ofentire system.
• High message complexity: Algorithm requires 3(N-1) messages per critical
sectioninvocation.

To enter Critical section:


• When a site Si wants to enter the critical section, it send a timestamped
REQUESTmessage to all other sites.
• When a site Sj receives a REQUEST message from site Si, It sends a REPLY
message to site Si if and only if Site Sj is neither requesting nor currently executing
the critical section.

• In case Site Sj is requesting, the timestamp of Site Si‘s request is smaller than its
ownrequest.
• Otherwise the request is deferred by site Sj.

To execute the critical section:


Site Si enters the critical section if it has received the REPLY message from all other
sites.

To release the critical section:

Upon exiting site Si sends REPLY message to all the deferred requests.

Performance:
Synchronization delay is equal to maximum message transmission time. It requires 3(N –
1) messages per CS execution. Algorithm can be optimized to 2(N – 1) messages by
omitting the REPLY message in some situations.

3.3 RICART–AGRAWALA ALGORITHM


• Ricart–Agrawala algorithm is an algorithm to for mutual exclusion in a
distributedsystem proposed by Glenn Ricart and Ashok Agrawala.
• This algorithm is an extension and optimization of Lamport’s Distributed
MutualExclusion Algorithm.
• It follows permission based approach to ensure mutual exclusion.
• Two type of messages ( REQUEST and REPLY) are used and communication
channels are assumed to follow FIFO order.

• A site send a REQUEST message to all other site to get their permission to
entercritical section.
• A site send a REPLY message to other site to give its permission to enter the
criticalsection.
• A timestamp is given to each critical section request using Lamport’s logical clock.
• Timestamp is used to determine priority of critical section requests.
• Smaller timestamp gets high priority over larger timestamp.
• The execution of critical section request is always in the order of their timestamp.

Fig: Ricart–Agrawala algorithm

Fig) site S1 and S2 each make a request for the CS


Fig) site S1 enters the CS

Fig) Site S1 exists the CS and sends a reply message to S2’s deferred
request

Fig) Site S2enters the CS


Theorem: Ricart-Agrawala algorithm achieves mutual exclusion.
Proof: Proof is by contradiction.
▪ Suppose two sites Si and Sj ‘ are executing the CS concurrently and Si ’s request has
higher priority than the request of Sj . Clearly, Si received Sj ’s request after it has
made its own request.
▪ Thus, Sj can concurrently execute the CS with Si only if Si returns a REPLY to Sj (in
response to Sj ’s request) before Si exits the CS.
▪ However, this is impossible because Sj ’s request has lower priority.
Therefore,Ricart- Agrawala algorithm achieves mutual exclusion.

Message Complexity:
Ricart–Agrawala algorithm requires invocation of 2(N – 1) messages per critical section
execution. These 2(N – 1) messages involve:
• (N – 1) request messages
• (N – 1) reply messages

Drawbacks of Ricart–Agrawala algorithm:


• Unreliable approach: failure of any one of node in the system can halt the progress
of the system. In this situation, the process will starve forever. The problem of
failureof node can be solved by detecting failure after some timeout.

Performance:
Synchronization delay is equal to maximum message transmission time It requires
2(N – 1) messages per Critical section execution.

3.4 SUZUKI–KASAMI‘s BROADCAST ALGORITHM


• Suzuki–Kasami algorithm is a token-based algorithm for achieving mutual
exclusionin distributed systems.
• This is modification of Ricart–Agrawala algorithm, a permission based (Non-
token based) algorithm which uses REQUEST and REPLY messages to ensure
mutual exclusion.
• In token-based algorithms, A site is allowed to enter its critical section if it
possessesthe unique token.
• Non-token based algorithms uses timestamp to order requests for the critical
sectionwhere as sequence number is used in token based algorithms.
• Each requests for critical section contains a sequence number. This sequence
numberis used to distinguish old and current requests
To enter Critical section:
• When a site Si wants to enter the critical section and it does not have the token then
it increments its sequence number RNi[i] and sends a request message REQUEST(i,
sn)to all other sites in order to request the token.
• Here sn is update value of RNi[i]
• When a site Sj receives the request message REQUEST(i, sn) from site Si, it
setsRNj[i] to maximum of RNj[i] and sni.eRNj[i] = max(RNj[i], sn).
After updating RNj[i], Site Sj sends the token to site Si if it has token and RNj[i]
= LN[i] + 1

Fig: Suzuki–Kasami‘s broadcast

algorithmTo execute the critical section:


• Site Si executes the critical section if it has acquired the token.

To release the critical section:


After finishing the execution Site Si exits the critical section and does following:
• sets LN[i] = RNi[i] to indicate that its critical section request RNi[i] has been executed
• For every site Sj, whose ID is not prsent in the token queue Q, it appends its ID to Q
ifRNj[j] = LN[j] + 1 to indicate that site Sj has an outstanding request.
• After above updation, if the Queue Q is non-empty, it pops a site ID from the Q
andsends the token to site indicated by popped ID.
• If the queue Q is empty, it keeps the token

Correctness
Mutual exclusion is guaranteed because there is only one token in the system and a site holds
the token during the CS execution.
Theorem: A requesting site enters the CS in finite time.
Proof: Token request messages of a site Si reach other sites in finite time.
Since one of these sites will have token in finite time, site Si ’s request will be placed in the
token queue in finite time.
Since there can be at most N − 1 requests in front of this request in the token queue, site Si
will get the token and execute the CS in finite time.

Message Complexity:
The algorithm requires 0 message invocation if the site already holds the idle token at the
time of critical section request or maximum of N message per critical section execution.
ThisN messages involves
• (N – 1) request messages
• 1 reply message

Drawbacks of Suzuki–Kasami Algorithm:


• Non-symmetric Algorithm: A site retains the token even if it does not have
requestedfor critical section.

Performance:
Synchronization delay is 0 and no message is needed if the site holds the idle token at the
time of its request. In case site does not holds the idle token, the maximum
synchronization delay is equal to maximum message transmission time and a maximum of
N message is required per critical section invocation.

3.5 DEADLOCK DETECTION IN DISTRIBUTED SYSTEMS


Deadlock can neither be prevented nor avoided in distributed system as the system is
so vast that it is impossible to do so. Therefore, only deadlock detection can be
implemented. The techniques of deadlock detection in the distributed system require the
following:
• Progress: The method should be able to detect all the deadlocks in the system.
• Safety: The method should not detect false of phantom deadlocks.

There are three approaches to detect deadlocks in distributed systems.


Centralized approach:
• Here there is only one responsible resource to detect deadlock.
• The advantage of this approach is that it is simple and easy to implement, while the
drawbacks include excessive workload at one node, single point failure which in
turnsmakes the system less reliable.

Distributed approach:
• In the distributed approach different nodes work together to detect deadlocks.
Nosingle point failure as workload is equally divided among all nodes.
• The speed of deadlock detection also increases.
Hierarchical approach:
• This approach is the most advantageous approach.
• It is the combination of both centralized and distributed approaches of
deadlockdetection in a distributed system.
• In this approach, some selected nodes or cluster of nodes are responsible for
deadlockdetection and these selected nodes are controlled by a single node.

System Model

• A distributed program is composed of a set of n asynchronous processes p1, p2, . .


. , pi , . . . , pn that communicates by message passing over the communication
network.
• Without loss of generality we assume that each process is running on a different
processor.
• The processors do not share a common global memory and communicate solely
by passing messages over the communication network.
• There is no physical global clock in the system to which processes have
instantaneous access.
• The communication medium may deliver messages out of order, messages may
be lost garbled or duplicated due to timeout and retransmission, processors may
fail and communication links may go down.
We make the following assumptions:
• The systems have only reusable resources.
• Processes are allowed to make only exclusive access to resources.
• There is only one copy of each resource.
• A process can be in two states: running or blocked.
• In the running state (also called active state), a process has all the needed
resources and is either executing or is ready for execution.
• In the blocked state, a process is waiting to acquire some resource.
Wait for graph
This is used for deadlock deduction. A graph is drawn based on the request and
acquirement of the resource. If the graph created has a closed loop or a cycle, then there is
adeadlock.
Preliminaries

3.6.1 Deadlock Handling Strategies


Handling of deadlock becomes highly complicated in distributed systems because
no site has accurate knowledge of the current state of the system and because every inter-
site communication involves a finite and unpredictable delay. There are three strategies for
handling deadlocks:
• Deadlock prevention:
− This is achieved either by having a process acquire all the needed resources
simultaneously before it begins executing or by preempting a process
whichholds the needed resource.
− This approach is highly inefficient and impractical in distributed systems.
• Deadlock avoidance:
− A resource is granted to a process if the resulting global system state is
safe.This is impractical in distributed systems.
• Deadlock detection:
− This requires examination of the status of process-resource interactions
forpresence of cyclic wait.
− Deadlock detection in distributed systems seems to be the best approach
tohandle deadlocks in distributed systems.

3.6.2 Issues in deadlock Detection


Deadlock handling faces two major issues
1. Detection of existing deadlocks
Resolution of detected deadlocks

Deadlock Detection

− Detection of deadlocks involves addressing two issues namely maintenance of


theWFG and searching of the WFG for the presence of cycles or knots.
− In distributed systems, a cycle or knot may involve several sites, the search for
cycles greatly depends upon how the WFG of the system is represented across the
system.
− Depending upon the way WFG information is maintained and the search for cycles is
carried out, there are centralized, distributed, and hierarchical algorithms for
deadlockdetection in distributed systems.

Correctness criteria
A deadlock detection algorithm must satisfy the following two conditions:
1. Progress-No undetected deadlocks:
The algorithm must detect all existing deadlocks in finite time. In other words, after
all wait-for dependencies for a deadlock have formed, the algorithm should not wait for any
more events to occur to detect the deadlock.
2. Safety -No false deadlocks:
The algorithm should not report deadlocks which do not exist. This is also called as called
phantom or false deadlocks

Resolution of a Detected Deadlock


• Deadlock resolution involves breaking existing wait-for dependencies between
theprocesses to resolve the deadlock.
• It involves rolling back one or more deadlocked processes and assigning
theirresources to blocked processes so that they can resume execution.
• The deadlock detection algorithms propagate information regarding wait-
fordependencies along the edges of the wait-for graph.
• When a wait-for dependency is broken, the corresponding information should
beimmediately cleaned from the system.
• If this information is not cleaned in a timely manner, it may result in detection
ofphantom deadlocks.

3.7 MODELS OF DEADLOCKS


The models of deadlocks are explained based on their hierarchy. The diagrams illustrate
the working of the deadlock models. Pa, Pb, Pc, Pdare passive processes that had already
acquired the resources. Peis active process that is requesting the resource.

3.7.1 Single Resource Model


• A process can have at most one outstanding request for only one unit of a resource.
• The maximum out-degree of a node in a WFG for the single resource model can be
1,the presence of a cycle in the WFG shall indicate that there is a deadlock.

Fig: Deadlock in single resource model

3.7.2 AND Model


• In the AND model, a passive process becomes active (i.e., its activation condition
is

• fulfilled) only after a message from each process in its dependent set has arrived.
• In the AND model, a process can request more than one resource simultaneously and
therequest is satisfied only after all the requested resources are granted to the process.
• The requested resources may exist at different locations.
• The out degree of a node in the WFG for AND model can be more than 1.
• The presence of a cycle in the WFG indicates a deadlock in the AND model.
• Each node of the WFG in such a model is called an AND node.
• In the AND model, if a cycle is detected in the WFG, it implies a deadlock but not
viceversa. That is, a process may not be a part of a cycle, it can still be deadlocked.

Fig: Deadlock in AND model


3.7.3 OR Model

• A process can make a request for numerous resources simultaneously and the
requestis satisfied if any one of the requested resources is granted.
• Presence of a cycle in the WFG of an OR model does not imply a
deadlockin the OR model.
• In the OR model, the presence of a knot indicates a deadlock.

Deadlock in OR model: a process Pi is blocked if it has a pending OR request to be satisfied.

• With every blocked process, there is an associated set of processes called


dependentset.
• A process shall move from an idle to an active state on receiving a grant
messagefrom any of the processes in its dependent set.
• A process is permanently blocked if it never receives a grant message from any of
theprocesses in its dependent set.
• A set of processes S is deadlocked if all the processes in S are permanently blocked.
• In short, a process is deadlocked or permanently blocked, if the following
conditionsare met:
1. Each of the process is the set S is blocked.
2. The dependent set for each process in S is a subset of S.
3. No grant message is in transit between any two processes in set S.
• A blocked process P is the set S becomes active only after receiving a grant
messagefrom a process in its dependent set, which is a subset of S.
Fig: OR Model

3.7.4 Model (p out of q model)


• This is a variation of AND-OR model.

• This allows a request to obtain any k available resources from a pool of n


resources.Both the models are the same in expressive power.
• This favours more compact formation of a request.
• Every request in this model can be expressed in the AND-OR model and vice-versa.

• Note that AND requests for p resources can be stated as and OR requests for
presources can be stated as

Fig: p out of q Model

3.7.5 Unrestricted model


• No assumptions are made regarding the underlying structure of resource requests.
• In this model, only one assumption that the deadlock is stable is made and hence it
isthe most general model.
• This model helps separate concerns: Concerns about properties of the problem
(stability and deadlock) are separated from underlying distributed systems
computations (e.g., message passing versus synchronous communication).
3.8 CHANDY–MISRA–HAAS ALGORITHM FOR THE AND MODEL
This is considered an edge-chasing, probe-based algorithm.
It is also considered one of the best deadlock detection algorithms for distributed
systems.
If a process makes a request for a resource which fails or times out, the process generates a
probe message and sends it to each of the processes holding one or more of its requested
resources.
This algorithm uses a special message called probe, which is a triplet (i, j,k), denoting that it
belongs to a deadlock detection initiated for process Pi and it is being sent by the home
siteof process Pj to the home site of process Pk.
Each probe message contains the following information:
➢ the id of the process that is blocked (the one that initiates the probe message);
➢ the id of the process is sending this particular version of the probe message;
➢ the id of the process that should receive this probe message.
A probe message travels along the edges of the global WFG graph, and a deadlock is
detected when a probe message returns to the process that initiated it.
A process Pj is said to be dependent on another process Pk if there exists a sequence of
processes Pj, Pi1 , Pi2 , . . . , Pim, Pksuch that each process except Pkin the sequence is
blocked and each process, except the Pj, holds a resource for which the previous process in
the sequence is waiting.
Process Pj is said to be locally dependent upon process Pk if Pj is dependent upon Pk
andboth the processes are on the same site.
When a process receives a probe message, it checks to see if it is also waiting for
resources
If not, it is currently using the needed resource and will eventually finish and release the
resource.
If it is waiting for resources, it passes on the probe message to all processes it knows to be
holding resources it has itself requested.
The process first modifies the probe message, changing the sender and receiver
ids. If a process receives a probe message that it recognizes as having initiated, it
knows there is a cycle in the system and thus, deadlock.

Data structures
Each process Pi maintains a boolean array, dependen ti, where dependent(j) is true only if
Piknows that Pj is dependent on it. Initially, dependen ti (j) is false for all i and j.

Fig : Chandy–Misra–Haas algorithm for the AND model

Performance analysis
In the algorithm, one probe message is sent on every edge of the WFG which
connects processes on two sites.
The algorithm exchanges at most m(n − 1)/2 messages to detect a deadlock that
involves m processes and spans over n sites.
The size of messages is fixed and is very small (only three integer words).
The delay in detecting a deadlock is O(n).

Advantages:
It is easy to implement.
Each probe message is of fixed
length. There is very little
computation.
There is very little overhead.
There is no need to construct a graph, nor to pass graph information to other sites.
This algorithm does not find false (phantom) deadlock.
There is no need for special data structures.

3.9 CHANDY MISRA HAAS ALGORITHM FOR THE OR MODEL


A blocked process determines if it is deadlocked by initiating a diffusion
computation.Two types of messages are used in a diffusion computation:
➢ query(i, j, k)
➢ reply(i, j, k)

denoting that they belong to a diffusion computation initiated by a process pi and are
being sent from process pj to process pk.
A blocked process initiates deadlock detection by sending query messages to all
processes in its dependent set.
If an active process receives a query or reply message, it discards it. When a blocked
process Pk receives a query(i, j, k) message, it takes the following actions:
1. If this is the first query message received by Pk for the deadlock detection
initiated by Pi, then it propagates the query to all the processes in its
dependent set and sets a local variable numk (i) to the number of query
messages sent.
2. If this is not the engaging query, then Pk returns a reply message to it
immediately provided Pk has been continuously blocked since it received
thecorresponding engaging query. Otherwise, it discards the query.
• Process Pk maintains a boolean variable waitk(i) that denotes the fact that it
has been continuously blocked since it received the last engaging query
fromprocess Pi.
• When a blocked process Pk receives a reply(i, j, k) message, it
decrementsnumk(i) only if waitk(i) holds.
• A process sends a reply message in response to an engaging query only after
it has received a reply to every query message it has sent out for this engaging
query.
• The initiator process detects a deadlock when it has received reply messages
toall the query messages it has sent out.

Fig: Chandy–Misra–Haas algorithm for the OR model

Performance analysis
For every deadlock detection, the algorithm exchanges e query messages and e reply
messages, where e = n(n – 1) is the number of edges.

UNIT IV CONSENSUS AND RECOVERY

Consensus and Agreement Algorithms: Problem Definition – Overview of Results – Agreement in a


Failure-Free System(Synchronous and Asynchronous) – Agreement in Synchronous Systems with
Failures; Checkpointing and Rollback Recovery: Introduction – Background and Definitions – Issues
in Failure Recovery – Checkpoint-based Recovery – Coordinated Checkpointing Algorithm –
– Algorithm for Asynchronous Checkpointing and Recovery

Problem definition
Agreement among the processes in a distributed system is a fundamental requirement for a
wide range of applications. Many forms of coordination require the processes to exchange
information to negotiate with one another and eventually reach a common understanding or
agreement, before taking application-specific actions. A classical example is that of the
commit decision in database systems, wherein the processes collectively decide whether to
commit or abort a transaction that they participate in.
We first state some assumptions underlying our study of agreement algorithms:
• Failure models Among the n processes in the system, at most f processes can be faulty. A
faulty process can behave in any manner allowed by the failure model assumed. The various
failure models – fail-stop, send omission and receive omission, and Byzantine failures.
• Synchronous/asynchronous communication If a failure-prone process chooses to send a
message to process Pi but fails, then Pi cannot detect the non-arrival of the message in an
asynchronous system. In a synchronous system, however, the scenario in which a message
has not been sent can be recognized by the intended recipient, at the end of the round.
• Network connectivity The system has full logical connectivity, i.e., each process can
communicate with any other by direct message passing.
• Sender identification A process that receives a message always knows the identity of the
sender process.
• Channel reliability The channels are reliable, and only the processes may fail (under one of
various failure models).
• Authenticated vs. non-authenticated messages With unauthenticated messages, when a
faulty process relays a message to other processes, (i) it can forge the message and claim that
it was received from another process, and (ii) it can also tamper with the contents of a
received message before relaying it. When a process receives a message, it has no way to
verify its authenticity. An unauthenticated message is also called an oral message or an
unsigned message. Using authentication via techniques such as digital signatures, it is easier
to solve the agreement problem because, if some process forges a message or tampers with
the contents of a received message before relaying it, the recipient can detect the forgery or
tampering. Thus, faulty processes can inflict less damage.
• Agreement variable The agreement variable may be boolean or multivalued, and need not
be an integer.

The Byzantine agreement


The Byzantine agreement problem requires a designated process, called the source process,
with an initial value
Problem definition agreement with the other processes about its initial value, subject to the
following conditions:
• Agreement All non-faulty processes must agree on the same value.
• Validity If the source process is non-faulty, then the agreed upon value by all the non-faulty
processes must be the same as the initial value of the source.
• Termination Each non-faulty process must eventually decide on a value. The validity
condition rules out trivial solutions, such as one in which the agreed upon value is a constant.
The consensus problem
The consensus problem differs from the Byzantine agreement problem in that each process
has an initial value and all the correct processes must agree on a single value
• Agreement All non-faulty processes must agree on the same (single) value.
• Validity If all the non-faulty processes have the same initial value, then the agreed upon
value by all the non-faulty processes must be that same value.
• Termination Each non-faulty process must eventually decide on a value.
The interactive consistency problem
The interactive consistency problem differs from the Byzantine agreement problem in that
each process has an initial value, and all the correct processes must agree upon a set of
values, with one value for each process
• Agreement All non-faulty processes must agree on the same array of values A[v1…vn]
• Validity If process i is non-faulty and its initial value is vi, then all nonfaulty processes
agree on vi as the ith element of the array A. If process j is faulty, then the non-faulty
processes can agree on any value for A[j].
• Termination Each non-faulty process must eventually decide on the array A
Overview of results:
Failure Synchronous system Asynchronous system
mode (message-passing and (message-passing and
shared memory) shared memory)
No Failure agreement attainable; agreement attainable;

common knowledge attainable concurrent common knowledge


Crash Failure agreement attainable agreement not attainable

f < n processes
Byzantine agreement attainable agreement not attainable
Failure f ≤ [(n - 1)/3] Byzantine processes

AGREEMENT IN A FAILURE-FREE SYSTEM (SYNCHRONOUS OR


ASYNCHRONOUS)
In a failure-free system, consensus can be reached by collecting information from the
different processes, arriving at a “decision,” and distributing this decision in the system.
A distributed mechanism would have each process broadcast its values to others, and each
process computes the same function on the values received.
The decision can be reached by using an applicationspecific function – some simple examples
being the majority, max, and min functions. Algorithms to collect the initial values and then
distribute the decision may be based on the token circulation on a logical ring, or the three-
phase
Consensus and agreement algorithms tree-based broadcast–converge cast–broadcast, or direct
communication with all nodes.
AGREEMENT IN (MESSAGE-PASSING) SYNCHRONOUS SYSTEMS WITH
FAILURES
CONSENSUS ALGORITHM FOR CRASH FAILURES (SYNCHRONOUS SYSTEM)
• The consensus algorithm for n processes where up to f processes where f < n may fail in a
fail stop failure model.
• Here the consensus variable x is integer value; each process has initial value xi. If up to f
failures are to be tolerated than algorithm has f+1 rounds, in each round a process i sense the
value of its variable xi to all other processes if that value has not been sent before.
• So, of all the values received within that round and its own value xi at that start of the round
the process takes minimum and updates xi occur f + 1 rounds the local value xi guaranteed to
be the consensus value.
• If one process is faulty, among three processes then f = 1. So the agreement requires f + 1
that is equal to two rounds.
• If it is faulty let us say it will send 0 to 1 process and 1 to another process i, j and k. Now,
on receiving one on receiving 0 it will broadcast 0 over here and this particular process on
receiving 1 it will broadcast 1 over here.
• So, this will complete one round in this one round and this particular process on receiving 1
it will send 1 over here and this on the receiving 0 it will send 0 over here.

• The agreement condition is satisfied because in the f+ 1 rounds, there must be at least one round in which
no process failed.
• In this round, say round r, all the processes that have not failed so far succeed in broadcasting their
values, and all these processes take the minimum of the values broadcast and received in that round.
• Thus, the local values at the end of the round are the same, say x r i for all non-failed processes.
• In further rounds, only this value may be sent by each process at most once, and no process i will update
its value x r i.
• The validity condition is satisfied because processes do not send fictitious values in this failure model.
• For all i, if the initial value is identical, then the only value sent by any process is the value that has been
agreed upon as per the agreement condition.
• The termination condition is seen to be satisfied.
Complexity: The complexity of this particular algorithm is it requires f + 1 rounds where f < n and the
number of messages is O(n2)in each round and each message has one integers hence the total number
of messages is O((f+1)· n 2 ) is the total number of rounds and in each round n 2 messages are required.

Consensus algorithms for Byzantine failures (synchronous system)


STEPS FOR BYZANTINE GENERALS (ITERATIVE FORMULATION),
SYNCHRONOUS, MESSAGE-PASSING:
STEPS FOR BYZANTINE GENERALS (RECURSIVE FORMULATION),
SYNCHRONOUS, MESSAGE-PASSING:

CODE FOR THE PHASE KING ALGORITHM:

Each phase has a unique "phase king" derived, say, from


PID. Each phase has two rounds:
• 1 in 1st round, each process sends its estimate to all other processes.

• 2 in 2nd round, the "Phase king" process arrives at an estimate based on the
values it received in 1st round, and broadcasts its new estimate to all others.
Fig. Message pattern for the phase-king algorithm.

PHASE KING ALGORITHM CODE:

(f + 1) phases, (f + 1)[(n - 1)(n + 1)] messages, and can tolerate up to f < dn=4e
malicious processes

Correctness Argument

• 1 Among f + 1 phases, at least one phase k where phase-king is non-malicious.

• 2 In phase k, all non-malicious processes Pi and Pj will have same


estimate of consensus value as Pk does.

• Pi and Pj use their own majority values. Pi 's mult > n=2 + f )

• Pi uses its majority value; Pj uses phase-king's tie-breaker value. (Pi’s mult >
n=2 + f , Pj 's mult > n=2 for same value)

• Pi and Pj use the phase-king's tie-breaker value. (In the phase in which Pk
is non- malicious, it sends same value to Pi and Pj )

In all 3 cases, argue that Pi and Pj end up with same value as estimate


If all non-malicious processes have the value x at the start of a phase, they
will continue to have x as the consensus value at the end of the phase.
Check pointing and rollback recovery: Introduction
• Rollback recovery protocols restore the system back to a consistent state after a failure,
• It achieves fault tolerance by periodically saving the state of a process during the failure-
free execution
• It treats a distributed system application as a collection of processes that communicate
over a network
Checkpoints
The saved state is called a checkpoint, and the procedure of restarting from a previously check
pointed state is called rollback recovery. A checkpoint can be saved on either the stable storage
or the volatile storage
Why is rollback recovery of distributed systems complicated?
Messages induce inter-process dependencies during failure-free operation
Rollback propagation
The dependencies among messages may force some of the processes that did not fail to roll back.
This phenomenon of cascaded rollback is called the domino effect.
Uncoordinated check pointing
If each process takes its checkpoints independently, then the system cannot avoid the domino
effect – this scheme is called independent or uncoordinated check pointing
Techniques that avoid domino effect
1. Coordinated check pointing rollback recovery - Processes coordinate their checkpoints to
form a system-wide consistent state
2. Communication-induced check pointing rollback recovery - Forces each process to take
checkpoints based on information piggybacked on the application.
3. Log-based rollback recovery - Combines check pointing with logging of non-
deterministic events
• relies on piecewise deterministic (PWD) assumption.
Background and definitions
System model
• A distributed system consists of a fixed number of processes, P1, P2,…_ PN , which
communicate only through messages.
• Processes cooperate to execute a distributed application and interact with the outside
world by receiving and sending input and output messages, respectively.
• Rollback-recovery protocols generally make assumptions about the reliability of theinter-
process communication.
• Some protocols assume that the communication uses first-in-first-out (FIFO) order, while
other protocols assume that the communication subsystem can lose, duplicate, or reorder
messages.
• Rollback-recovery protocols therefore must maintain information about the internal
interactions among processes and also the external interactions with the outside world.

An example of a distributed system with three processes.

A local checkpoint
• All processes save their local states at certain instants of time
• A local check point is a snapshot of the state of the process at a given instance
• Assumption
– A process stores all local checkpoints on the stable storage
– A process is able to roll back to any of its existing local checkpoints

• 𝐶𝑖,𝑘 – The kth local checkpoint at process 𝑃𝑖


• 𝐶𝑖,0 – A process 𝑃𝑖 takes a checkpoint 𝐶𝑖,0 before it starts execution
Consistent states
• A global state of a distributed system is a collection of the individual states of all
participating processes and the states of the communication channels
• Consistent global state
– a global state that may occur during a failure-free execution of distribution of
distributed computation
– if a process‟s state reflects a message receipt, then the state of the
corresponding sender must reflect the sending of the message
• A global checkpoint is a set of local checkpoints, one from each process
• A consistent global checkpoint is a global checkpoint such that no message is sent by a
process after taking its local point that is received by another process before taking its
checkpoint.
• For instance, Figure shows two examples of global states.
• The state in fig (a) is consistent and the state in Figure (b) is inconsistent.
• Note that the consistent state in Figure (a) shows message m1 to have been sent but not
yet received, but that is alright.
• The state in Figure (a) is consistent because it represents a situation in which every
message that has been received, there is a corresponding message send event.
• The state in Figure (b) is inconsistent because process P2 is shown to have received m2
but the state of process P1 does not reflect having sent it.
• Such a state is impossible in any failure-free, correct computation. Inconsistent states
occur because of failures.
Interactions with outside world
A distributed system often interacts with the outside world to receive input data or deliver the
outcome of a computation. If a failure occurs, the outside world cannot be expected to roll back.
For example, a printer cannot roll back the effects of printing a character
Outside World Process (OWP)
• It is a special process that interacts with the rest of the system through message passing.
• It is therefore necessary that the outside world see a consistent behavior of the system
despite failures.
• Thus, before sending output to the OWP, the system must ensure that the state from
which the output is sent will be recovered despite any future failure.
A common approach is to save each input message on the stable storage before allowing the
application program to process it.
An interaction with the outside world to deliver the outcome of a computation is shown on the
process-line by the symbol “||”.
Different types of Messages

1. In-transit message
• messages that have been sent but not yet received
2. Lost messages
• messages whose “send‟ is done but “receive‟ is undone due to rollback
3. Delayed messages
•messages whose “receive‟ is not recorded because the receiving process was
either down or the message arrived after rollback
4. Orphan messages
• messages with “receive‟ recorded but message “send‟ not recorded
• do not arise if processes roll back to a consistent global state
5. Duplicate messages
• arise due to message logging and replaying during process recovery

In-transit messages
In Figure , the global state {C1,8 , C2, 9 , C3,8, C4,8} shows that message m1 has been sent but
not yet received. We call such a message an in-transit message. Message m2 is also an in-transit
message.
Delayed messages
Messages whose receive is not recorded because the receiving process was either down or the
message arrived after the rollback of the receiving process, are called delayed messages. For
example, messages m2 and m5 in Figure are delayed messages.

Lost messages
Messages whose send is not undone but receive is undone due to rollback are called lost
messages. This type of messages occurs when the process rolls back to a checkpoint prior to
reception of the message while the sender does not rollback beyond the send operation of the
message. In Figure , message m1 is a lost message.
Duplicate messages
• Duplicate messages arise due to message logging and replaying during process
recovery. For example, in Figure, message m4 was sent and received before the
rollback. However, due to the rollback of process P4 to C4,8 and process P3 to C3,8,
both send and receipt of message m4 are undone.
• When process P3 restarts from C3,8, it will resend message m4.
• Therefore, P4 should not replay message m4 from its log.
• If P4 replays message m4, then message m4 is called a duplicate message.
Issues in failure recovery
In a failure recovery, we must not only restore the system to a consistent state, but also
appropriately handle messages that are left in an abnormal state due to the failure and recovery

• The computation comprises of three processes Pi, Pj , and Pk, connected through a
communication network. The processes communicate solely by exchanging messages over fault
free, FIFO communication channels.
• Processes Pi, Pj , and Pk, have taken checkpoints {Ci,0, Ci,1}, {Cj,0, Cj,1, Cj,2}, and {Ck,0,
Ck,1}, respectively, and these processes have exchanged messages A to J
Suppose process Pi fails at the instance indicated in the figure. All the contents of the volatile
memory of Pi are lost and, after Pi has recovered from the failure, the system needs to be
restored to a consistent global state from where the processes can resume their execution.
• Process Pi’s state is restored to a valid state by rolling it back to its most recent checkpoint
Ci,1. To restore the system to a consistent state, the process Pj rolls back to checkpoint Cj,1
because the rollback of process Pi to checkpoint Ci,1 created an orphan message H (the receive
event of H is recorded at process Pj while the send event of H has been undone at process Pi).
• Pj does not roll back to checkpoint Cj,2 but to checkpoint Cj,1. An orphan message I is created
due to the roll back of process Pj to checkpoint Cj,1. To eliminate this orphan message, process
Pk rolls back to checkpoint Ck,1.
• Messages C, D, E, and F are potentially problematic. Message C is in transit during the failure
and it is a delayed message. The delayed message C has several possibilities: C might arrive at
process Pi before it recovers, it might arrive while Pi is recovering, or it might arrive after Pi has
completed recovery. Each of these cases must be dealt with correctly.
• Message D is a lost message since the send event for D is recorded in the restored state for
process Pj , but the receive event has been undone at process Pi. Process Pj will not resend D
without an additional mechanism.
• Messages E and F are delayed orphan messages and pose perhaps the most serious problem of
all the messages. When messages E and F arrive at their respective destinations, they must be
discarded since their send events have been undone. Processes, after resuming execution from
their checkpoints, will generate both of these messages.
• Lost messages like D can be handled by having processes keep a message log of all the sent
messages. So when a process restores to a checkpoint, it replays the messages from its log to
handle the lost message problem.
• Overlapping failures further complicate the recovery process. If overlapping failures are to be
tolerated, a mechanism must be introduced to deal with amnesia and the resulting
inconsistencies.
Checkpoint-based recovery
Checkpoint-based rollback-recovery techniques can be classified into three categories:
1. Uncoordinated checkpointing
2. Coordinated checkpointing
3. Communication-induced checkpointing

1. Uncoordinated Checkpointing
• Each process has autonomy in deciding when to take checkpoints
• Advantages
The lower runtime overhead during normal execution
• Disadvantages
1. Domino effect during a recovery
2. Recovery from a failure is slow because processes need to iterate to find a
consistent set of checkpoints
3. Each process maintains multiple checkpoints and periodically invoke a
garbage collection algorithm
4. Not suitable for application with frequent output commits
• The processes record the dependencies among their checkpoints caused by message
exchange during failure-free operation

• The following direct dependency tracking technique is commonly used in uncoordinated


checkpointing.
Direct dependency tracking technique
• Assume each process 𝑃𝑖 starts its execution with an initial checkpoint 𝐶𝑖,0
• 𝐼𝑖,𝑥 : checkpoint interval, interval between 𝐶𝑖,𝑥−1 and 𝐶𝑖,𝑥
• When 𝑃𝑗 receives a message m during 𝐼𝑗,𝑦 , it records the dependency from 𝐼𝑖,𝑥 to 𝐼𝑗,𝑦,
which is later saved onto stable storage when 𝑃𝑗 takes 𝐶𝑗,𝑦

• When a failure occurs, the recovering process initiates rollback by broadcasting a


dependency request message to collect all the dependency information maintained by
each process.
• When a process receives this message, it stops its execution and replies with the
dependency information saved on the stable storage as well as with the dependency
information, if any, which is associated with its current state.
• The initiator then calculates the recovery line based on the global dependency
information and broadcasts a rollback request message containing the recovery line.
• Upon receiving this message, a process whose current state belongs to the recovery line
simply resumes execution; otherwise, it rolls back to an earlier checkpoint as indicated by
the recovery line.
2. Coordinated Checkpointing
In coordinated checkpointing, processes orchestrate their checkpointing activities so that all
local checkpoints form a consistent global state
Types
1. Blocking Checkpointing: After a process takes a local checkpoint, to prevent orphan
messages, it remains blocked until the entire checkpointing activity is complete
Disadvantages: The computation is blocked during the checkpointing
2. Non-blocking Checkpointing: The processes need not stop their execution while taking
checkpoints. A fundamental problem in coordinated checkpointing is to prevent a process
from receiving application messages that could make the checkpoint inconsistent.
Example (a) : Checkpoint inconsistency
• Message m is sent by 𝑃0 after receiving a checkpoint request from the checkpoint
coordinator
• Assume m reaches 𝑃1 before the checkpoint request
• This situation results in an inconsistent checkpoint since checkpoint 𝐶1,𝑥 shows the
receipt of message m from 𝑃0, while checkpoint 𝐶0,𝑥 does not show m being sent from
𝑃0
Example (b) : A solution with FIFO channels
• If channels are FIFO, this problem can be avoided by preceding the first post-checkpoint
message on each channel by a checkpoint request, forcing each process to take a
checkpoint before receiving the first post-checkpoint message

Impossibility of min-process non-blocking checkpointing


• A min-process, non-blocking checkpointing algorithm is one that forces only a minimum
number of processes to take a new checkpoint, and at the same time it does not force any
process to suspend its computation.

Algorithm
• The algorithm consists of two phases. During the first phase, the checkpoint initiator
identifies all processes with which it has communicated since the last checkpoint and
sends them a request.
• Upon receiving the request, each process in turn identifies all processes it has
communicated with since the last checkpoint and sends them a request, and so on, until
no more processes can be identified.
• During the second phase, all processes identified in the first phase take a checkpoint. The
result is a consistent checkpoint that involves only the participating processes.
• In this protocol, after a process takes a checkpoint, it cannot send any message until the
second phase terminates successfully, although receiving a message after the checkpoint
has been taken is allowable.
3. Communication-induced Checkpointing
Communication-induced checkpointing is another way to avoid the domino effect, while
allowing processes to take some of their checkpoints independently. Processes may be forced to
take additional checkpoints
Two types of checkpoints
1. Autonomous checkpoints
2. Forced checkpoints
The checkpoints that a process takes independently are called local checkpoints, while those that
a process is forced to take are called forced checkpoints.
• Communication-induced check pointing piggybacks protocol- related information on
each application message
• The receiver of each application message uses the piggybacked information to determine
if it has to take a forced checkpoint to advance the global recovery line
• The forced checkpoint must be taken before the application may process the contents of
the message
• In contrast with coordinated check pointing, no special coordination messages are
exchanged
Two types of communication-induced checkpointing
1. Model-based checkpointing
2. Index-based checkpointing.
Model-based checkpointing
• Model-based checkpointing prevents patterns of communications and checkpoints
that could result in inconsistent states among the existing checkpoints.
• No control messages are exchanged among the processes during normal operation.
All information necessary to execute the protocol is piggybacked on application
messages
• There are several domino-effect-free checkpoint and communication model.
• The MRS (mark, send, and receive) model of Russell avoids the domino effect by
ensuring that within every checkpoint interval all message receiving events precede
all message-sending events.
Index-based checkpointing.
• Index-based communication-induced checkpointing assigns monotonically increasing
indexes to checkpoints, such that the checkpoints having the same index at different
processes form a consistent state.

KOO AND TOUEG COORDINATED CHECKPOINTING AND RECOVERY


TECHNIQUE:
• Koo and Toueg coordinated check pointing and recovery technique takes a consistent set
of checkpoints and avoids the domino effect and livelock problems during the recovery.
• Includes 2 parts: the check pointing algorithm and the recovery algorithm

A. The Checkpointing Algorithm


The checkpoint algorithm makes the following assumptions about the distributed system:
• Processes communicate by exchanging messages through communication channels.
• Communication channels are FIFO.
• Assume that end-to-end protocols (the sliding window protocol) exist to handle with
message loss due to rollback recovery and communication failure.
• Communication failures do not divide the network.
The checkpoint algorithm takes two kinds of checkpoints on the stable storage: Permanent and
Tentative.
A permanent checkpoint is a local checkpoint at a process and is a part of a consistent global
checkpoint.
A tentative checkpoint is a temporary checkpoint that is made a permanent checkpoint on the
successful termination of the checkpoint algorithm.
The algorithm consists of two phases.
First Phase
1. An initiating process Pi takes a tentative checkpoint and requests all other processes to
take tentative checkpoints. Each process informs Pi whether it succeeded in taking a
tentative checkpoint.
2. A process says “no” to a request if it fails to take a tentative checkpoint
3. If Pi learns that all the processes have successfully taken tentative checkpoints, Pi decides
that all tentative checkpoints should be made permanent; otherwise, Pi decides that all the
tentative checkpoints should be thrown-away.
Second Phase
1. Pi informs all the processes of the decision it reached at the end of the first phase.
2. A process, on receiving the message from Pi will act accordingly.
3. Either all or none of the processes advance the checkpoint by taking permanent
checkpoints.
4. The algorithm requires that after a process has taken a tentative checkpoint, it cannot
send messages related to the basic computation until it is informed of Pi’s decision.
Correctness: for two reasons
i. Either all or none of the processes take permanent checkpoint
ii. No process sends message after taking permanent checkpoint
An Optimization
The above protocol may cause a process to take a checkpoint even when it is not necessary for
consistency. Since taking a checkpoint is an expensive operation, we avoid taking checkpoints.

B. The Rollback Recovery Algorithm


The rollback recovery algorithm restores the system state to a consistent state after a failure. The
rollback recovery algorithm assumes that a single process invokes the algorithm. It assumes that
the checkpoint and the rollback recovery algorithms are not invoked concurrently. The rollback
recovery algorithm has two phases.
First Phase
1. An initiating process Pi sends a message to all other processes to check if they all are
willing to restart from their previous checkpoints.
2. A process may reply “no” to a restart request due to any reason (e.g., it is already
participating in a check pointing or a recovery process initiated by some other process).
3. If Pi learns that all processes are willing to restart from their previous checkpoints, Pi
decides that all processes should roll back to their previous checkpoints. Otherwise,
4. Pi aborts the roll back attempt and it may attempt a recovery at a later time.
Second Phase
1. Pi propagates its decision to all the processes.
2. On receiving Pi’s decision, a process acts accordingly.
3. During the execution of the recovery algorithm, a process cannot send messages related
to the underlying computation while it is waiting for Pi’s decision.
Correctness: Resume from a consistent state
Optimization: May not to recover all, since some of the processes did not change anything

Optimization: May not to recover all, since some of the processes did not change
anything

The above protocol, in the event of failure of process X, the above protocol will require
processes X, Y, and Z to restart from checkpoints x2, y2, and z2, respectively.
Process Z need not roll back because there has been no interaction between process Z and the
other two processes since the last checkpoint at Z.

ALGORITHM FOR ASYNCHRONOUS CHECKPOINTING AND RECOVERY:


The algorithm of Juang and Venkatesan for recovery in a system that uses asynchronous check
pointing.
A. System Model and Assumptions
The algorithm makes the following assumptions about the underlying system:
• The communication channels are reliable, deliver the messages in FIFO order and have
infinite buffers.
• The message transmission delay is arbitrary, but finite.
• Underlying computation/application is event-driven: process P is at state s, receives
message m, processes the message, moves to state s’ and send messages out. So the
triplet (s, m, msgs_sent) represents the state of P
Two type of log storage are maintained:
– Volatile log: short time to access but lost if processor crash. Move to stable log
periodically.
– Stable log: longer time to access but remained if crashed
A. Asynchronous Check pointing
– After executing an event, the triplet is recorded without any synchronization with
other processes.
– Local checkpoint consist of set of records, first are stored in volatile log, then
moved to stable log.
B. The Recovery Algorithm
Notations and data structure
The following notations and data structure are used by the algorithm:
• RCVDi←j(CkPti) represents the number of messages received by processor pi from processor
pj , from the beginning of the computation till the checkpoint CkPti.

• SENTi→j(CkPti) represents the number of messages sent by processor pi to processor pj , from


the beginning of the computation till the checkpoint CkPti.
Basic idea
• Since the algorithm is based on asynchronous check pointing, the main issue in the
recovery is to find a consistent set of checkpoints to which the system can be restored.
• The recovery algorithm achieves this by making each processor keep track of both the
number of messages it has sent to other processors as well as the number of messages it
has received from other processors.
• Whenever a processor rolls back, it is necessary for all other processors to find out if any
message has become an orphan message. Orphan messages are discovered by comparing
the number of messages sent to and received from neighboring processors.
For example, if RCVDi←j(CkPti) > SENTj→i(CkPtj) (that is, the number of messages received
by processor pi from processor pj is greater than the number of messages sent by processor pj to
processor pi, according to the current states the processors), then one or more messages at
processor pj are orphan messages.

The Algorithm
When a processor restarts after a failure, it broadcasts a ROLLBACK message that it had failed
Procedure RollBack_Recovery
processor pi executes the following:
STEP (a)
if processor pi is recovering after a failure then
CkPti := latest event logged in the stable storage
else
CkPti := latest event that took place in pi {The latest event at pi can be either in stable or in
volatile storage.}
end if
STEP (b)
for k = 1 1 to N {N is the number of processors in the system} do
for each neighboring processor pj do
compute SENTi→j(CkPti)

send a ROLLBACK(i, SENTi→j(CkPti)) message to pj


end for
for every ROLLBACK(j, c) message received from a neighbor j do
if RCVDi←j(CkPti) > c {Implies the presence of orphan messages} then
find the latest event e such that RCVDi←j(e) = c {Such an event e may be in the volatile storage
or stable storage.}
CkPti := e
end if
end for
end for{for k}
D. An Example
Consider an example shown in Figure 2 consisting of three processors. Suppose processor Y
fails and restarts. If event ey2 is the latest checkpointed event at Y, then Y will restart from the
state corresponding to ey2.

Figure 2: An example of Juan-Venkatesan algorithm.


• Because of the broadcast nature of ROLLBACK messages, the recovery algorithm is
initiated at processors X and Z.
• Initially, X, Y, and Z set CkPtX ← ex3, CkPtY ← ey2 and CkPtZ ← ez2, respectively,
and X, Y, and Z send the following messages during the first iteration:
• Y sends ROLLBACK(Y,2) to X and ROLLBACK(Y,1) to Z;

• X sends ROLLBACK(X,2) to Y and ROLLBACK(X,0) to Z;


• Z sends ROLLBACK(Z,0) to X and ROLLBACK(Z,1) to Y.
Since RCVDX←Y (CkPtX) = 3 > 2 (2 is the value received in the ROLLBACK(Y,2) message
from Y), X will set CkPtX to ex2 satisfying RCVDX←Y (ex2) = 1≤ 2.
Since RCVDZ←Y (CkPtZ) = 2 > 1, Z will set CkPtZ to ez1 satisfying RCVDZ←Y (ez1) = 1 ≤
1.
At Y, RCVDY←X(CkPtY ) = 1 < 2 and RCVDY←Z(CkPtY ) = 1 = SENTZ←Y (CkPtZ).
Y need not roll back further.
In the second iteration, Y sends ROLLBACK(Y,2) to X and ROLLBACK(Y,1) to Z;

Z sends ROLLBACK(Z,1) to Y and ROLLBACK(Z,0) to X;


X sends ROLLBACK(X,0) to Z and ROLLBACK(X, 1) to Y.
If Y rolls back beyond ey3 and loses the message from X that caused ey3, X can resend this
message to Y because ex2 is logged at X and this message available in the log. The second and
third iteration will progress in the same manner. The set of recovery points chosen at the end of
the first iteration, {ex2, ey2, ez1}, is consistent, and no further rollback occurs.
CLOUD COMPUTING
Definition of Cloud Computing – Characteristics of Cloud – Cloud Deployment
Models – Cloud Service Models – Driving Factors and Challenges of Cloud –
Virtualization – Load Balancing –Scalability and Elasticity – Replication –
Monitoring – Cloud Services and Platforms: Compute Services – Storage Services
– Application Services
Definition of cloud computing
Cloud computing is a virtualization-based technology that allows us to create, configure,
and customize applications via an internet connection. The cloud technology includes a
development platform, hard disk, software application, and database.

What is Cloud Computing

The term cloud refers to a network or the internet. It is a technology that uses remote
servers on the internet to store, manage, and access data online rather than local drives.
The data can be anything such as files, images, documents, audio, video, and more.

Characteristics of Cloud Computing

The characteristics of cloud computing are given below:

1) Agility

The cloud works in a distributed computing environment. It shares resources among


users and works very fast.

2) High availability and reliability

The availability of servers is high and more reliable because the chances of
infrastructure failure are minimum.

3) High Scalability

Cloud offers "on-demand" provisioning of resources on a large scale, without having


engineers for peak loads.

4) Multi-Sharing

With the help of cloud computing, multiple users and applications can work more
efficiently with cost reductions by sharing common infrastructure.

5) Device and Location Independence


Cloud computing enables the users to access systems using a web browser regardless of
their location or what device they use e.g. PC, mobile phone, etc. As infrastructure is
off-site (typically provided by a third-party) and accessed via the Internet, users can
connect from anywhere.

6) Maintenance

Maintenance of cloud computing applications is easier, since they do not need to be


installed on each user's computer and can be accessed from different places. So, it
reduces the cost also.

7) Low Cost

By using cloud computing, the cost will be reduced because to take the services of cloud
computing, IT Company need not to set its own infrastructure and pay-as-per usage
of resources.

8) Services in the pay-per-use mode

Application Programming Interfaces (APIs) are provided to the users so that they can
access services on the cloud by using these APIs and pay the charges as per the usage
of services.

*Cloud Deployment Model*

Today, organizations have many exciting opportunities to reimagine, repurpose and


reinvent their businesses with the cloud. The last decade has seen even more businesses
rely on it for quicker time to market, better efficiency, and scalability. It helps them
achieve lo ng-term digital goals as part of their digital strategy.

Though the answer to which cloud model is an ideal fit for a business depends on your
organization's computing and business needs. Choosing the right one from the various
types of cloud service deployment models is essential. It would ensure your business is
equipped with the performance, scalability, privacy, security, compliance & cost-
effectiveness it requires. It is important to learn and explore what different deployment
types can offer - around what particular problems it can solve.

Read on as we cover the various cloud computing deployment and service models to help
discover the best choice for your business.
What Is A Cloud Deployment Model?

It works as your virtual computing environment with a choice of deployment model


depending on how much data you want to store and who has access to the Infrastructure.

Different Types Of Cloud Computing Deployment Models

Most cloud hubs have tens of thousands of servers and storage devices to enable fast
loading. It is often possible to choose a geographic area to put the data "closer" to users.
Thus, deployment models for cloud computing are categorized based on their location.
To know which model would best fit the requirements of your organization, let us first
learn about the various types.

Public Cloud

The name says it all. It is accessible to the public. Public deployment models in the cloud
are perfect for organizations with growing and fluctuating demands. It also makes a great
choice for companies with low-security concerns. Thus, you pay a cloud service provider
for networking services, compute virtualization & storage available on the public internet.
It is also a great delivery model for the teams with development and testing. Its
configuration and deployment are quick and easy, making it an ideal choice for test
environments.
Benefits of Public Cloud

o Minimal Investment - As a pay-per-use service, there is no large upfront cost and


is ideal for businesses who need quick access to resources
o No Hardware Setup - The cloud service providers fully fund the entire
Infrastructure
o No Infrastructure Management - This does not require an in-house team to utilize
the public cloud.

Limitations of Public Cloud

o Data Security and Privacy Concerns - Since it is accessible to all, it does not fully
protect against cyber-attacks and could lead to vulnerabilities.
o Reliability Issues - Since the same server network is open to a wide range of users,
it can lead to malfunction and outages
o Service/License Limitation - While there are many resources you can exchange
with tenants, there is a usage cap.
Private Cloud

Now that you understand what the public cloud could offer you, of course, you are keen
to know what a private cloud can do. Companies that look for cost efficiency and greater
control over data & resources will find the private cloud a more suitable choice.

It means that it will be integrated with your data center and managed by your IT team.
Alternatively, you can also choose to host it externally. The private cloud offers bigger
opportunities that help meet specific organizations' requirements when it comes to
customization. It's also a wise choice for mission-critical processes that may have
frequently changing requirements.

Benefits of Private Cloud

o Data Privacy - It is ideal for storing corporate data where only authorized
personnel gets access
o Security - Segmentation of resources within the same Infrastructure can help with
better access and higher levels of security.
o Supports Legacy Systems - This model supports legacy systems that cannot access
the public cloud.

Limitations of Private Cloud


o Higher Cost - With the benefits you get, the investment will also be larger than the
public cloud. Here, you will pay for software, hardware, and resources for staff
and training.
o Fixed Scalability - The hardware you choose will accordingly help you scale in a
certain direction
o High Maintenance - Since it is managed in-house, the maintenance costs also
increase.

Community Cloud

The community cloud operates in a way that is similar to the public cloud. There's just
one difference - it allows access to only a specific set of users who share common
objectives and use cases. This type of deployment model of cloud computing is managed
and hosted internally or by a third-party vendor. However, you can also choose a
combination of all three.

Benefits of Community Cloud

o Smaller Investment - A community cloud is much cheaper than the private &
public cloud and provides great performance
o Setup Benefits - The protocols and configuration of a community cloud must align
with industry standards, allowing customers to work much more efficiently.

Limitations of Community Cloud


o Shared Resources - Due to restricted bandwidth and storage capacity, community
resources often pose challenges.
o Not as Popular - Since this is a recently introduced model, it is not that popular or
available across industries

Hybrid Cloud

As the name suggests, a hybrid cloud is a combination of two or more cloud architectures.
While each model in the hybrid cloud functions differently, it is all part of the same
architecture. Further, as part of this deployment of the cloud computing model, the
internal or external providers can offer resources.

Let's understand the hybrid model better. A company with critical data will prefer storing
on a private cloud, while less sensitive data can be stored on a public cloud. The hybrid
cloud is also frequently used for 'cloud bursting'. It means, supposes an organization runs
an application on-premises, but due to heavy load, it can burst into the public cloud.

Benefits of Hybrid Cloud

o Cost-Effectiveness - The overall cost of a hybrid solution decreases since it


majorly uses the public cloud to store data.
o Security - Since data is properly segmented, the chances of data theft from
attackers are significantly reduced.
o Flexibility - With higher levels of flexibility, businesses can create custom
solutions that fit their exact requirements

Limitations of Hybrid Cloud

o Complexity - It is complex setting up a hybrid cloud since it needs to integrate two


or more cloud architectures
o Specific Use Case - This model makes more sense for organizations that have
multiple use cases or need to separate critical and sensitive data.

*Cloud Service Models*

There are the following three types of cloud service models -

1. Infrastructure as a Service (IaaS)


2. Platform as a Service (PaaS)
3. Software as a Service (SaaS)

Infrastructure as a Service (IaaS)

IaaS is also known as Hardware as a Service (HaaS). It is a computing infrastructure


managed over the internet. The main advantage of using IaaS is that it helps users to avoid
the cost and complexity of purchasing and managing the physical servers.

Characteristics of IaaS

There are the following characteristics of IaaS -


o Resources are available as a service
o Services are highly scalable
o Dynamic and flexible
o GUI and API-based access
o Automated administrative tasks

Example: DigitalOcean, Linode, Amazon Web Services (AWS), Microsoft Azure,


Google Compute Engine (GCE), Rackspace, and Cisco Metacloud.

Platform as a Service (PaaS)

PaaS cloud computing platform is created for the programmer to develop, test, run, and
manage the applications.

Characteristics of PaaS

There are the following characteristics of PaaS -

o Accessible to various users via the same development application.


o Integrates with web services and databases.
o Builds on virtualization technology, so resources can easily be scaled up or down
as per the organization's need.
o Support multiple languages and frameworks.
o Provides an ability to "Auto-scale".

Example: AWS Elastic Beanstalk, Windows Azure, Heroku, Force.com, Google App
Engine, Apache Stratos, Magento Commerce Cloud, and OpenShift.

Software as a Service (SaaS)

SaaS is also known as "on-demand software". It is a software in which the applications


are hosted by a cloud service provider. Users can access these applications with the help
of internet connection and web browser.

Characteristics of SaaS

There are the following characteristics of SaaS -

o Managed from a central location


o Hosted on a remote server

IaaS Paas SaaS

It provides a virtual data It provides virtual platforms It provides web software and
center to store information and tools to create, test, and apps to complete business
and create platforms for deploy apps. tasks.
app development, testing,
and deployment.

It provides access to It provides runtime It provides software as a


resources such as virtual environments and deployment service to the end-users.
machines, virtual storage, tools for applications.
etc.

It is used by network It is used by developers. It is used by end users.


architects.

IaaS provides only PaaS provides SaaS provides


Infrastructure. Infrastructure+Platform. Infrastructure+Platform
+Software.

o Accessible over the internet


o Users are not responsible for hardware and software updates. Updates are applied
automatically.
o The services are purchased on the pay-as-per-use basis

Example: BigCommerce, Google Apps, Salesforce, Dropbox, ZenDesk, Cisco WebEx,


ZenDesk, Slack, and GoToMeeting.

Difference between IaaS, PaaS, and SaaS

Infrastructure as a Service | IaaS

Iaas is also known as Hardware as a Service (HaaS). It is one of the layers of the cloud
computing platform. It allows customers to outsource their IT infrastructures such as
servers, networking, processing, storage, virtual machines, and other resources.
Customers access these resources on the Internet using a pay-as-per use model.
In traditional hosting services, IT infrastructure was rented out for a specific period of
time, with pre-determined hardware configuration. The client paid for the configuration
and time, regardless of the actual use. With the help of the IaaS cloud computing platform
layer, clients can dynamically scale the configuration to meet changing requirements and
are billed only for the services actually used.

IaaS cloud computing platform layer eliminates the need for every organization to
maintain the IT infrastructure.

IaaS is offered in three models: public, private, and hybrid cloud. The private cloud
implies that the infrastructure resides at the customer-premise. In the case of public cloud,
it is located at the cloud computing platform vendor's data center, and the hybrid cloud is
a combination of the two in which the customer selects the best of both public cloud or
private cloud.

IaaS provider provides the following services -

1. Compute: Computing as a Service includes virtual central processing units and


virtual main memory for the Vms that is provisioned to the end- users.
2. Storage: IaaS provider provides back-end storage for storing files.
3. Network: Network as a Service (NaaS) provides networking components such as
routers, switches, and bridges for the Vms.
4. Load balancers: It provides load balancing capability at the infrastructure layer.

Advantages of IaaS cloud computing layer

There are the following advantages of IaaS computing layer -


1. Shared infrastructure

IaaS allows multiple users to share the same physical infrastructure.

2. Web access to the resources

Iaas allows IT users to access resources over the internet.

3. Pay-as-per-use model

IaaS providers provide services based on the pay-as-per-use basis. The users are required
to pay for what they have used.

4. Focus on the core business

IaaS providers focus on the organization's core business rather than on IT infrastructure.

5. On-demand scalability

On-demand scalability is one of the biggest advantages of IaaS. Using IaaS, users do not
worry about to upgrade software and troubleshoot the issues related to hardware
components.

Disadvantages of IaaS cloud computing layer

1. Security

Security is one of the biggest issues in IaaS. Most of the IaaS providers are not able to
provide 100% security.

2. Maintenance & Upgrade

Although IaaS service providers maintain the software, but they do not upgrade the
software for some organizations.

3. Interoperability issues

It is difficult to migrate VM from one IaaS provider to the other, so the customers might
face problem related to vendor lock-in.

Platform as a Service | PaaS

Platform as a Service (PaaS) provides a runtime environment. It allows programmers to


easily create, test, run, and deploy web applications. You can purchase these applications
from a cloud service provider on a pay-as-per use basis and access them using the Internet
connection. In PaaS, back end scalability is managed by the cloud service provider, so
end- users do not need to worry about managing the infrastructure.

PaaS includes infrastructure (servers, storage, and networking) and platform


(middleware, development tools, database management systems, business intelligence,
and more) to support the web application life cycle.

Example: Google App Engine, Force.com, Joyent, Azure.

PaaS providers provide the Programming languages, Application frameworks, Databases,


and Other tools:

1. Programming languages

PaaS providers provide various programming languages for the developers to develop the
applications. Some popular programming languages provided by PaaS providers are Java,
PHP, Ruby, Perl, and Go.

2. Application frameworks

PaaS providers provide application frameworks to easily understand the application


development. Some popular application frameworks provided by PaaS providers are
Node.js, Drupal, Joomla, WordPress, Spring, Play, Rack, and Zend.

3. Databases

PaaS providers provide various databases such as ClearDB, PostgreSQL, MongoDB, and
Redis to communicate with the applications.
4. Other tools

PaaS providers provide various other tools that are required to develop, test, and deploy
the applications.

Advantages of PaaS

There are the following advantages of PaaS -

1) Simplified Development

PaaS allows developers to focus on development and innovation without worrying about
infrastructure management.

2) Lower risk

No need for up-front investment in hardware and software. Developers only need a PC
and an internet connection to start building applications.

3) Prebuilt business functionality

Some PaaS vendors also provide already defined business functionality so that users can
avoid building everything from very scratch and hence can directly start the projects only.

4) Instant community

PaaS vendors frequently provide online communities where the developer can get the
ideas to share experiences and seek advice from others.

5) Scalability

Applications deployed can scale from one to thousands of users without any changes to
the applications.

Disadvantages of PaaS cloud computing layer

1) Vendor lock-in

One has to write the applications according to the platform provided by the PaaS vendor,
so the migration of an application to another PaaS vendor would be a problem.

2) Data Privacy
Corporate data, whether it can be critical or not, will be private, so if it is not located
within the walls of the company, there can be a risk in terms of privacy of data.

3) Integration with the rest of the systems applications

It may happen that some applications are local, and some are in the cloud. So there will
be chances of increased complexity when we want to use data which in the cloud with
the local data.

Software as a Service | SaaS

SaaS is also known as "On-Demand Software". It is a software distribution model in


which services are hosted by a cloud service provider. These services are available to
end-users over the internet so, the end-users do not need to install any software on their
devices to access these services.

There are the following services provided by SaaS providers -

Business Services - SaaS Provider provides various business services to start-up the
business. The SaaS business services include ERP (Enterprise Resource
Planning), CRM (Customer Relationship Management), billing, and sales.

Document Management - SaaS document management is a software application offered


by a third party (SaaS providers) to create, manage, and track electronic documents.

Example: Slack, Samepage, Box, and Zoho Forms.

Social Networks - As we all know, social networking sites are used by the general public,
so social networking service providers use SaaS for their convenience and handle the
general public's information.

Mail Services - To handle the unpredictable number of users and load on e-mail services,
many e-mail providers offering their services using SaaS.
Advantages of SaaS cloud computing layer

1) SaaS is easy to buy

SaaS pricing is based on a monthly fee or annual fee subscription, so it allows


organizations to access business functionality at a low cost, which is less than licensed
applications.

Unlike traditional software, which is sold as a licensed based with an up-front cost (and
often an optional ongoing support fee), SaaS providers are generally pricing the
applications using a subscription fee, most commonly a monthly or annually fee.

2. One to Many

SaaS services are offered as a one-to-many model means a single instance of the
application is shared by multiple users.

3. Less hardware required for SaaS

The software is hosted remotely, so organizations do not need to invest in additional


hardware.

4. Low maintenance required for SaaS

Software as a service removes the need for installation, set-up, and daily maintenance for
the organizations. The initial set-up cost for SaaS is typically less than the enterprise
software. SaaS vendors are pricing their applications based on some usage parameters,
such as a number of users using the application. So SaaS does easy to monitor and
automatic updates.

5. No special software or hardware versions required


All users will have the same version of the software and typically access it through the
web browser. SaaS reduces IT support costs by outsourcing hardware and software
maintenance and support to the IaaS provider.

6. Multidevice support

SaaS services can be accessed from any device such as desktops, laptops, tablets, phones,
and thin clients.

7. API Integration

SaaS services easily integrate with other software or services through standard APIs.

8. No client-side installation

SaaS services are accessed directly from the service provider using the internet
connection, so do not need to require any software installation.

Disadvantages of SaaS cloud computing layer

1) Security

Actually, data is stored in the cloud, so security may be an issue for some users. However,
cloud computing is not more secure than in-house deployment.

2) Latency issue

Since data and applications are stored in the cloud at a variable distance from the end-
user, there is a possibility that there may be greater latency when interacting with the
application compared to local deployment. Therefore, the SaaS model is not suitable for
applications whose demand response time is in milliseconds.

3) Total Dependency on Internet

Without an internet connection, most SaaS applications are not usable.

4) Switching between SaaS vendors is difficult

Switching SaaS vendors involves the difficult and slow task of transferring the very large
data files over the internet and then converting and importing them into another SaaS
also.
*Cloud Computing Challenges*

1. Data Security and Privacy

Data security is a major concern when switching to cloud computing. User or


organizational data stored in the cloud is critical and private. Even if the cloud service
provider assures data integrity, it is your responsibility to carry out user authentication
and authorization, identity management, data encryption, and access control. Security
issues on the cloud include identity theft, data breaches, malware infections, and a lot
more which eventually decrease the trust amongst the users of your applications. This
can in turn lead to potential loss in revenue alongside reputation and stature. Also,
dealing with cloud computing requires sending and receiving huge amounts of data at
high speed, and therefore is susceptible to data leaks.

2. Cost Management

Even as almost all cloud service providers have a “Pay As You Go” model, which
reduces the overall cost of the resources being used, there are times when there are huge
costs incurred to the enterprise using cloud computing. When there is under
optimization of the resources, let’s say that the servers are not being used to their full
potential, add up to the hidden costs. If there is a degraded application performance or
sudden spikes or overages in the usage, it adds up to the overall cost. Unused resources
are one of the other main reasons why the costs go up. If you turn on the services or an
instance of cloud and forget to turn it off during the weekend or when there is no current
use of it, it will increase the cost without even using the resources.

3. Multi-Cloud Environments

Due to an increase in the options available to the companies, enterprises not only use a
single cloud but depend on multiple cloud service providers. Most of these companies
use hybrid cloud tactics and close to 84% are dependent on multiple clouds. This often
ends up being hindered and difficult to manage for the infrastructure team. The process
most of the time ends up being highly complex for the IT team due to the differences
between multiple cloud providers.

4. Performance Challenges

Performance is an important factor while considering cloud-based solutions. If the


performance of the cloud is not satisfactory, it can drive away users and decrease profits.
Even a little latency while loading an app or a web page can result in a huge drop in the
percentage of users. This latency can be a product of inefficient load balancing, which
means that the server cannot efficiently split the incoming traffic so as to provide the
best user experience. Challenges also arise in the case of fault tolerance, which means
the operations continue as required even when one or more of the components fail.

5. Interoperability and Flexibility

When an organization uses a specific cloud service provider and wants to switch to
another cloud-based solution, it often turns up to be a tedious procedure since
applications written for one cloud with the application stack are required to be re-written
for the other cloud. There is a lack of flexibility from switching from one cloud to
another due to the complexities involved. Handling data movement, setting up the
security from scratch and network also add up to the issues encountered when changing
cloud solutions, thereby reducing flexibility.

6. High Dependence on Network

Since cloud computing deals with provisioning resources in real-time, it deals with
enormous amounts of data transfer to and from the servers. This is only made possible
due to the availability of the high-speed network. Although these data and resources are
exchanged over the network, this can prove to be highly vulnerable in case of limited
bandwidth or cases when there is a sudden outage. Even when the enterprises can cut
their hardware costs, they need to ensure that the internet bandwidth is high as well
there are zero network outages, or else it can result in a potential business loss. It is
therefore a major challenge for smaller enterprises that have to maintain network
bandwidth that comes with a high cost.

7. Lack of Knowledge and Expertise

Due to the complex nature and the high demand for research working with the cloud
often ends up being a highly tedious task. It requires immense knowledge and wide
expertise on the subject. Although there are a lot of professionals in the field they need
to constantly update themselves. Cloud computing is a highly paid job due to the
extensive gap between demand and supply. There are a lot of vacancies but very few
talented cloud engineers, developers, and professionals. Therefore, there is a need for
upskilling so these professionals can actively understand, manage and develop cloud-
based applications with minimum issues and maximum reliability.
*Virtualization*

Virtualization is the "creation of a virtual (rather than actual) version of something, such
as a server, a desktop, a storage device, an operating system or network resources".
In other words, Virtualization is a technique, which allows to share a single physical
instance of a resource or an application among multiple customers and organizations. It
does by assigning a logical name to a physical storage and providing a pointer to that
physical resource when demanded.

What is the concept behind the Virtualization?

Creation of a virtual machine over existing operating system and hardware is known as
Hardware Virtualization. A Virtual machine provides an environment that is logically
separated from the underlying hardware.

The machine on which the virtual machine is going to create is known as Host
Machine and that virtual machine is referred as a Guest Machine

Types of Virtualization:

1. Hardware Virtualization.
2. Operating system Virtualization.
3. Server Virtualization.
4. Storage Virtualization.

1) Hardware Virtualization:

When the virtual machine software or virtual machine manager (VMM) is directly
installed on the hardware system is known as hardware virtualization.

The main job of hypervisor is to control and monitoring the processor, memory and other
hardware resources.

After virtualization of hardware system we can install different operating system on it


and run different applications on those OS.

Usage:

Hardware virtualization is mainly done for the server platforms, because controlling
virtual machines is much easier than controlling a physical server.

2) Operating System Virtualization:

When the virtual machine software or virtual machine manager (VMM) is installed on the
Host operating system instead of directly on the hardware system is known as operating
system virtualization.
Usage:

Operating System Virtualization is mainly used for testing the applications on different
platforms of OS.

3) Server Virtualization:

When the virtual machine software or virtual machine manager (VMM) is directly
installed on the Server system is known as server virtualization.

Usage:

Server virtualization is done because a single physical server can be divided into multiple
servers on the demand basis and for balancing the load.

4) Storage Virtualization:

Storage virtualization is the process of grouping the physical storage from multiple
network storage devices so that it looks like a single storage device.

Storage virtualization is also implemented by using software applications.

Usage:

Storage virtualization is mainly done for back-up and recovery purposes.

How does virtualization work in cloud computing?

Virtualization plays a very important role in the cloud computing technology, normally
in the cloud computing, users share the data present in the clouds like application etc, but
actually with the help of virtualization users shares the Infrastructure.

The main usage of Virtualization Technology is to provide the applications with the
standard versions to their cloud users, suppose if the next version of that application is
released, then cloud provider has to provide the latest version to their cloud users and
practically it is possible because it is more expensive.

To overcome this problem we use basically virtualization technology, By using


virtualization, all severs and the software application which are required by other cloud
providers are maintained by the third party people, and the cloud providers has to pay the
money on monthly or annual basis.
Mainly Virtualization means, running multiple operating systems on a single machine but
sharing all the hardware resources. And it helps us to provide the pool of IT resources so
that we can share these IT resources in order get benefits in the business.

*Load Balancing*

Load balancing is the method that allows you to have a proper balance of the amount of
work being done on different pieces of device or hardware equipment. Typically, what
happens is that the load of the devices is balanced between different servers or between
the CPU and hard drives in a single cloud server.

Load balancing was introduced for various reasons. One of them is to improve the speed
and performance of each single device, and the other is to protect individual devices from
hitting their limits by reducing their performance.

Cloud load balancing is defined as dividing workload and computing properties in cloud
computing. It enables enterprises to manage workload demands or application demands
by distributing resources among multiple computers, networks or servers. Cloud load
balancing involves managing the movement of workload traffic and demands over the
Internet.

Traffic on the Internet is growing rapidly, accounting for almost 100% of the current
traffic annually. Therefore, the workload on the servers is increasing so rapidly, leading
to overloading of the servers, mainly for the popular web servers. There are two primary
solutions to overcome the problem of overloading on the server
o First is a single-server solution in which the server is upgraded to a higher-
performance server. However, the new server may also be overloaded soon,
demanding another upgrade. Moreover, the upgrading process is arduous and
expensive.
o The second is a multiple-server solution in which a scalable service system on a
cluster of servers is built. That's why it is more cost-effective and more scalable to
build a server cluster system for network services.

Cloud-based servers can achieve more precise scalability and availability by using farm
server load balancing. Load balancing is beneficial with almost any type of service, such
as HTTP, SMTP, DNS, FTP, and POP/IMAP.

It also increases reliability through redundancy. A dedicated hardware device or program


provides the balancing service.

Different Types of Load Balancing Algorithms in Cloud Computing:

1. Static Algorithm

Static algorithms are built for systems with very little variation in load. The entire traffic
is divided equally between the servers in the static algorithm. This algorithm requires in-
depth knowledge of server resources for better performance of the processor, which is
determined at the beginning of the implementation.

However, the decision of load shifting does not depend on the current state of the system.
One of the major drawbacks of static load balancing algorithm is that load balancing tasks
work only after they have been created. It could not be implemented on other devices for
load balancing.

2. Dynamic Algorithm

The dynamic algorithm first finds the lightest server in the entire network and gives it
priority for load balancing. This requires real-time communication with the network
which can help increase the system's traffic. Here, the current state of the system is used
to control the load.

The characteristic of dynamic algorithms is to make load transfer decisions in the current
system state. In this system, processes can move from a highly used machine to an
underutilized machine in real time.
3. Round Robin Algorithm

As the name suggests, round robin load balancing algorithm uses round-robin method to
assign jobs. First, it randomly selects the first node and assigns tasks to other nodes in a
round-robin manner. This is one of the easiest methods of load balancing.

Processors assign each process circularly without defining any priority. It gives fast
response in case of uniform workload distribution among the processes. All processes
have different loading times. Therefore, some nodes may be heavily loaded, while others
may remain under-utilised.

4. Weighted Round Robin Load Balancing Algorithm

Weighted Round Robin Load Balancing Algorithms have been developed to enhance the
most challenging issues of Round Robin Algorithms. In this algorithm, there are a
specified set of weights and functions, which are distributed according to the weight
values.

Processors that have a higher capacity are given a higher value. Therefore, the highest
loaded servers will get more tasks. When the full load level is reached, the servers will
receive stable traffic.

5. Opportunistic Load Balancing Algorithm

The opportunistic load balancing algorithm allows each node to be busy. It never
considers the current workload of each system. Regardless of the current workload on
each node, OLB distributes all unfinished tasks to these nodes.

The processing task will be executed slowly as an OLB, and it does not count the
implementation time of the node, which causes some bottlenecks even when some nodes
are free.

6. Minimum To Minimum Load Balancing Algorithm

Under minimum to minimum load balancing algorithms, first of all, those tasks take
minimum time to complete. Among them, the minimum value is selected among all the
functions. According to that minimum time, the work on the machine is scheduled.

Other tasks are updated on the machine, and the task is removed from that list. This
process will continue till the final assignment is given. This algorithm works best where
many small tasks outweigh large tasks.
Load balancing solutions can be categorized into two types -
o Software-based load balancers: Software-based load balancers run on standard
hardware (desktop, PC) and standard operating systems.
o Hardware-based load balancers: Hardware-based load balancers are dedicated
boxes that contain application-specific integrated circuits (ASICs) optimized for a
particular use. ASICs allow network traffic to be promoted at high speeds and are
often used for transport-level load balancing because hardware-based load
balancing is faster than a software solution.

Major Examples of Load Balancers -

o Direct Routing Request Despatch Technique: This method of request dispatch


is similar to that implemented in IBM's NetDispatcher. A real server and load
balancer share a virtual IP address. The load balancer takes an interface built with
a virtual IP address that accepts request packets and routes the packets directly to
the selected server.
o Dispatcher-Based Load Balancing Cluster: A dispatcher performs smart load
balancing using server availability, workload, capacity and other user-defined
parameters to regulate where TCP/IP requests are sent. The dispatcher module of
a load balancer can split HTTP requests among different nodes in a cluster. The
dispatcher divides the load among multiple servers in a cluster, so services from
different nodes act like a virtual service on only one IP address; Consumers
interconnect as if it were a single server, without knowledge of the back-end
infrastructure.
o Linux Virtual Load Balancer: This is an open-source enhanced load balancing
solution used to build highly scalable and highly available network services such
as HTTP, POP3, FTP, SMTP, media and caching, and Voice over Internet
Protocol (VoIP) is done. It is a simple and powerful product designed for load
balancing and fail-over. The load balancer itself is the primary entry point to the
server cluster system. It can execute Internet Protocol Virtual Server (IPVS),
which implements transport-layer load balancing in the Linux kernel, also known
as layer-4 switching.

Types of Load Balancing


You will need to understand the different types of load balancing for your network. Server
load balancing is for relational databases, global server load balancing is for
troubleshooting in different geographic locations, and DNS load balancing ensures
domain name functionality. Load balancing can also be based on cloud-based balancers.

Network Load Balancing

Cloud load balancing takes advantage of network layer information and leaves it to decide
where network traffic should be sent. This is accomplished through Layer 4 load
balancing, which handles TCP/UDP traffic. It is the fastest local balancing solution, but
it cannot balance the traffic distribution across servers.

HTTP(S) load balancing

HTTP(s) load balancing is the oldest type of load balancing, and it relies on Layer 7. This
means that load balancing operates in the layer of operations. It is the most flexible type
of load balancing because it lets you make delivery decisions based on information
retrieved from HTTP addresses.

Internal Load Balancing

It is very similar to network load balancing, but is leveraged to balance the infrastructure
internally.

Load balancers can be further divided into hardware, software and virtual load balancers.

Hardware Load Balancer

It depends on the base and the physical hardware that distributes the network and
application traffic. The device can handle a large traffic volume, but these come with a
hefty price tag and have limited flexibility.

Software Load Balancer

It can be an open source or commercial form and must be installed before it can be used.
These are more economical than hardware solutions.

Virtual Load Balancer

It differs from a software load balancer in that it deploys the software to the hardware
load-balancing device on the virtual machine.

WHY CLOUD LOAD BALANCING IS IMPORTANT IN CLOUD COMPUTING?


Here are some of the importance of load balancing in cloud computing.

Offers better performance

The technology of load balancing is less expensive and also easy to implement. This
allows companies to work on client applications much faster and deliver better results at
a lower cost.

Helps Maintain Website Traffic

Cloud load balancing can provide scalability to control website traffic. By using effective
load balancers, it is possible to manage high-end traffic, which is achieved using network
equipment and servers. E-commerce companies that need to deal with multiple visitors
every second use cloud load balancing to manage and distribute workloads.

Can Handle Sudden Bursts in Traffic

Load balancers can handle any sudden traffic bursts they receive at once. For example,
in case of university results, the website may be closed due to too many requests. When
one uses a load balancer, he does not need to worry about the traffic flow. Whatever the
size of the traffic, load balancers will divide the entire load of the website equally across
different servers and provide maximum results in minimum response time.

Greater Flexibility

The main reason for using a load balancer is to protect the website from sudden crashes.
When the workload is distributed among different network servers or units, if a single
node fails, the load is transferred to another node. It offers flexibility, scalability and the
ability to handle traffic better.

Because of these characteristics, load balancers are beneficial in cloud environments.


This is to avoid heavy workload on a single server.

Conclusion

Thousands of people have access to a website at a particular time. This makes it


challenging for the application to manage the load coming from these requests at the same
time. Sometimes this can lead to system failure.

*Scalability and Elasticity*


Scalability and elasticity are important characteristics of load balancing in distributed
computing systems. Let's explore each of them:
Scalability: Scalability refers to the ability of a system to handle an increasing amount of
work or accommodate a growing number of users or resources. In the context of load
balancing, scalability means that the system can efficiently distribute incoming requests
across multiple nodes or servers as the workload grows.

To achieve scalability in load balancing, several techniques are commonly employed:

1. Horizontal Scaling: This involves adding more nodes or servers to the system to handle
increased load. Load balancers distribute incoming requests across these additional
resources, allowing the system to handle a larger volume of traffic.
2. Load Balancer Redundancy: To ensure high availability and avoid single points of failure,
load balancers themselves can be scaled by implementing redundancy. Multiple load
balancers can be deployed in parallel, distributing the load across them and providing
fault tolerance. If one load balancer fails, others can take over seamlessly.
3. Dynamic Configuration: Scalable load balancing systems often have dynamic
configurations that allow for automatic adjustment of resources based on demand. This
includes dynamically adding or removing nodes from the load balancing pool based on
factors like CPU utilization, network traffic, or predefined thresholds.

Elasticity: Elasticity is closely related to scalability but emphasizes the ability of a system
to dynamically adapt its resource allocation in response to workload changes. In load
balancing, elasticity refers to the ability to scale resources up or down based on real-time
demand.

Elastic load balancing can be achieved through:

1. Auto Scaling: Auto scaling allows the system to automatically adjust the number of nodes
or servers based on predefined metrics or policies. When the workload increases, new
nodes can be provisioned to handle the additional load, and when the demand decreases,
unnecessary resources can be removed.
2. Load Balancer Health Monitoring: Elastic load balancing systems continuously monitor
the health and performance of individual nodes or servers. If a node becomes overloaded
or unresponsive, the load balancer can dynamically redirect traffic to healthier nodes,
ensuring efficient resource utilization.
3. Dynamic Load Distribution: Elastic load balancers can intelligently distribute incoming
requests based on real-time conditions. For example, they can route requests to nodes
with lower resource utilization or closer proximity to minimize latency.

By combining scalability and elasticity, load balancing systems can efficiently distribute
workload across distributed resources, ensuring optimal performance, responsiveness,
and resource utilization. These characteristics are particularly important in cloud
computing environments, where workloads can vary significantly over time.
*Cloud services and platforms: Compute services*
Compute services are a fundamental component of cloud computing platforms. They
provide the necessary computing resources to run applications, process data, and perform
various computational tasks. Here are some prominent compute services offered by cloud
providers:

1. Amazon EC2 (Elastic Compute Cloud): EC2 is a web service provided by Amazon Web
Services (AWS) that offers resizable virtual servers in the cloud. It allows users to rent
virtual machines (EC2 instances) and provides flexibility in terms of instance types,
operating systems, and configurations. EC2 instances can be rapidly scaled up or down
based on demand, offering a highly scalable compute infrastructure.
2. Microsoft Azure Virtual Machines: Azure Virtual Machines provide users with on-
demand, scalable computing resources in the Microsoft Azure cloud. Users can deploy
virtual machines with various operating systems and configurations, choosing from a
wide range of instance types to meet their specific requirements.
3. Google Compute Engine: Compute Engine is the Infrastructure as a Service (IaaS)
offering of Google Cloud Platform (GCP). It allows users to create and manage virtual
machines with customizable configurations, including options for various CPU and
memory sizes. Compute Engine provides scalable and flexible compute resources in the
Google Cloud environment.
4. IBM Virtual Servers: IBM Cloud offers Virtual Servers, which are scalable and
customizable compute resources. Users can choose from a variety of instance types,
including bare metal servers, virtual machines, and GPU-enabled instances. IBM Virtual
Servers provide the flexibility to customize network and storage configurations according
to specific workload needs.
5. Oracle Compute: Oracle Cloud Infrastructure (OCI) provides compute services through
Oracle Compute, allowing users to provision and manage virtual machines in the Oracle
Cloud. It offers a range of compute shapes, including general-purpose instances,
memory-optimized instances, and GPU instances, enabling users to optimize their
compute resources for different workloads.

These compute services provide the necessary infrastructure to deploy and manage
applications, whether they require simple virtual machines or more specialized instances.
They offer scalability, flexibility, and on-demand provisioning, allowing users to scale
their compute resources up or down based on workload demands. Additionally, these
services often integrate with other cloud services like storage, networking, and databases,
enabling users to build comprehensive cloud-based solutions

*Storage services*
1. Amazon S3 (Simple Storage Service): Amazon S3 is a highly scalable object storage
service provided by AWS. It allows users to store and retrieve any amount of data from
anywhere on the web. S3 provides high durability, availability, and low latency access to
data. It is commonly used for backup and restore, data archiving, content distribution,
and hosting static websites.
2. Azure Blob Storage: Azure Blob Storage is a scalable object storage service in Microsoft
Azure. It offers high availability, durability, and global accessibility for storing large
amounts of unstructured data, such as documents, images, videos, and log files. Blob
Storage provides various storage tiers to optimize costs based on data access patterns.
3. Google Cloud Storage: Google Cloud Storage is a scalable and secure object storage
service in Google Cloud Platform (GCP). It provides a simple and cost-effective solution
for storing and retrieving unstructured data. Google Cloud Storage offers multiple storage
classes, including multi-regional, regional, and nearline, to meet different performance
and cost requirements.
4. IBM Cloud Object Storage: IBM Cloud Object Storage is an scalable and secure storage
service offered by IBM Cloud. It provides durable and highly available storage for storing
large volumes of unstructured data. IBM Cloud Object Storage supports different storage
tiers, data encryption, and integration with other IBM Cloud services.

*Application Services*
1. AWS Lambda: AWS Lambda is a serverless compute service provided by AWS. It allows
developers to run code without provisioning or managing servers. Lambda functions can
be triggered by various events, such as changes in data, API calls, or scheduled events. It
is commonly used for building event-driven architectures, data processing, and executing
small, self-contained tasks.
2. Azure Functions: Azure Functions is a serverless compute service in Microsoft Azure. It
enables developers to run event-triggered code in a serverless environment. Azure
Functions supports multiple programming languages and integrates with various Azure
services, making it suitable for building event-driven applications, data processing
pipelines, and microservices.
3. Google Cloud Functions: Google Cloud Functions is a serverless compute service in
GCP. It allows developers to write and deploy event-driven functions that automatically
scale based on demand. Cloud Functions can be triggered by various events from Google
Cloud services, HTTP requests, or Pub/Sub messages.
4. IBM Cloud Functions: IBM Cloud Functions is a serverless compute service offered by
IBM Cloud. It allows developers to run event-driven functions in a serverless
environment. IBM Cloud Functions supports multiple programming languages and
integrates with other IBM Cloud services, making it suitable for building serverless
applications and event-driven architectures.

These storage services and application services provided by cloud computing platforms
offer scalable, reliable, and cost-effective solutions for data storage, processing, and
application development. They enable organizations to leverage the benefits of cloud
computing while reducing the burden of managing infrastructure and focusing more on
their core business goals.

You might also like