0% found this document useful (0 votes)
40 views

Unit-V: Coordination and Agreement

The document discusses coordination and agreement in distributed systems. It covers topics such as the need for agreement among distributed processes, coping with failures through unreliable failure detectors and simple failure detection algorithms, requirements for distributed mutual exclusion algorithms like safety and liveness, and examples of mutual exclusion algorithms including the centralized server algorithm, ring-based algorithm, and Lamport's algorithm which uses logical clocks and message ordering.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views

Unit-V: Coordination and Agreement

The document discusses coordination and agreement in distributed systems. It covers topics such as the need for agreement among distributed processes, coping with failures through unreliable failure detectors and simple failure detection algorithms, requirements for distributed mutual exclusion algorithms like safety and liveness, and examples of mutual exclusion algorithms including the centralized server algorithm, ring-based algorithm, and Lamport's algorithm which uses logical clocks and message ordering.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

UNIT-V

COORDINATION AND AGREEMENT

Introduction:
• It is generally important that the processes within a distributed system have some
sort of agreement
• I Agreement may be as simple as the goal of the distributed system
• I Has the general task been aborted?
• I Should the main aim be changed?
• I This is made more complicated than it sounds, since all the processes must, not
only agree, but be confident that their peers agree
No Fixed Master
• We will also look at dynamic agreement of a master or leader process i.e. an
election. Generally after the current master has failed.
• I We saw in the Time and Global State section that some algorithms required a
global master/nominee, but there was no requirement for that master/nominee
process to be fixed
• I With a fixed master process agreement is made much simpler
• I However it then introduces a single point of failure
• I So here we are generally assuming no fixed master process
Synchronous vs Asynchronous
• I Again with the synchronous and asynchronous
• I It is an important distinction here, synchronous systems allow us to determine
important bounds on message transmission delays
• I This allows us to use timeouts to detect message failure in a way that cannot be
done for asynchronous systems.

Coping with Failures


• I In this part we will consider the presence of failures, recall
• from our Fundamentals part three decreasingly benign failure
• models:
• Assume no failures occur
• Assume omission failures may occur; both process and
• message delivery omission failures.
• Assume that arbitrary failures may occur both at a process or
• through message corruption whilst in transit.
Failure Detectors
Unreliable Failure Detectors
• I An “unreliable failure detector” will respond with either: “Suspected” or
“Unsuspected”
• I Such a failure detector is termed an “unreliable failure detector”
A simple algorithm
• I If we assume that all messages are delivered within some bound, say D seconds.
• I Then we can implement a simple failure detector as:
• I Every process p sends a “p is still alive” message to all failure detector processes,
periodically, once every T seconds
• If a failure detector process does not receive a message from process q within T + D
seconds of the previous one then it marks q as “Suspected
• If we choose our bound D too high then often a failed process will be marked as
“Unsuspected”
• A synchronous system has a known bound on the message delivery time and the
clock drift and hence can implement a reliable failure detector
• An asynchronous system could give one of three answers: “Unsuspected”,
“Suspected” or “Failed” choosing two different values of D
• In fact we could instead respond to queries about process p with the probability that
p has failed, if we have a known distribution of message transmission times e.g., if
you know that 90% of messages arrive within 2 seconds and it has been two
seconds since your last expected message you can conclude there is a:
Distributed Mutual Exclusion
• As in operating systems we some want mutual exclusion: that is we don’t
want more than one process accessing a resource at a time.
• In operating systems we solved the problem by using either using
semaphores (P and V or by using atomic actions.)
• In a distributed system we want to be sure that we can access a resource and
that nobody else is accessing the resource.
• A critical section is a piece of code where we want mutual exclusion.
Requirements for Mutex Algorithms
• Safety At most one process may execute in the critical section at a time.
• Liveness Requests to enter and exit the critical section eventually succeed.
The liveness condition implies freedom from both deadlock and starvation. A
deadlock would involve two or more of the processes becoming stuck indefinitely.
Starvation is indefinite postponement of entry for a process that has requested it.
• ordering Access is granted in the happened before relation. This is desirable so that
process can coordinate their actions. A process might be waiting for access and
communicating with other processes.
The Central server Algorithm
The central server algorithm can be seen as a token based algorithm. The system
maintains one token and make sure that only one process at a time gets that token.
A process has to not enter the C.S. if it doesn’t have a token.
• A process request to enter the C.S. by asking the central server.
• The central server only grants access to one person at a time.
• The central server maintains a queue of requests and grants them in the order that
people sent them.
• Main disadvantage: Single point of failure.
• Main advantage: Lower overhead in communication.
• Process p1 does not currently require entry to the CS
• Process p2‘s request has been appended to the queue, which already contained
p4‘s request
• Process p3 exits the CS
• The server removes p4‘s entry and grants permission to enter to p4 by replying
to it
Performance of the Central Server Algorithm
Entering the CS:
• It takes 2 messages: a request followed by a grant
• It delays the requesting process (client) by the time for this round-trip
Exiting the CS:
• It takes 1 release message
• Assuming asynchronous message passing, this does not delay the exiting
process
• Synchronization delay: time taken for a round-trip (a release msg to the server, followed
be a grant msg to the next process to enter the CS)
• The server may become a performance bottleneck for the system as a whole
A Ring-Based Algorithm
Logical ring: one of the simplest ways to arrange a ME between N processes without
requiring an additional process
The ring topology may be unrelated to the physical interconnections between
the underlying computers
Basic idea: exclusion is conferred by obtaining a token in the form of a message from
process to process in a single direction around the ring
Algorithm
• If a process does not require to enter the CS when it receives the token, then it
immediately forwards the token to its neighbour
• A process that requires the token waits until it receives it, but retains it
• To exit the CS, the process sends the token on to its neighbour
Performance of the Ring-Based Algorithm
The algorithm continuously consumes network bandwidth, except when a process is inside
the critical section
• The processes send messages around the ring even when no process requires entry to
the CS
• The delay experienced by a process requesting entry to the CS is between 0 messages
(when it has just received the token) and N messages (when it has just passed on the
token)
• To exit the CS requires only one message
• The synchronization delay between one process’s exit from the CS and the next
process’s entry is anywhere from 1 to N message transmissions
Non-token based algorithms
Two or more successive rounds of messages are exchanged among the processes to
determine which process will enter the CS next

• A process enters the CS when an assertion, defined on its local variables, becomes true

• Mutual exclusion is enforced because the assertion becomes true only at one site at any
given time

Lamport’s Algorithm
• Requires communication channels to deliver messages in FIFO order
• Satisfies conditions ME1, ME2 and ME3
• Based on Lamport logical clocks: timestamped requests for entering the CS
• Every process pi keeps a queue, request_queuei, which contains mutual exclusion
requests ordered by their timestamps
• IDEA: the algorithm executes CS requests in the increasing order of timestamps
• Timestamp: (clock value, id of the process)
Requesting the CS
Process pi updates its local clock and timestamps the request (tsi)
Process pi broadcasts a REQUEST(tsi, i) to all the other processes
Process pi places the request on request_queuei
On Receiving REQUEST(tsi, i) from a process pi
Process pj places pi’s request on request_queuej
Process pj returns a timestamped REPLY msg to pi
Executing the CS
Process pi enters the CS when the following two conditions hold:
‣ L1: pi has received a msg with timestamp larger than (tsi, i) from all other processes
‣ L2: pi’s request is at the top of request_queuei
Releasing the CS
Process pi removes its request from the top of request_queuei
Process pi broadcasts a timestamped RELEASE msg to all other processes
On Receiving RELEASE from a process pi
Process pj removes pi’s request from its request queue request_queuej
Entering a CS
• p1 and p2 send out REQUEST messages for the CS to the other processes
Both p1 and p2 have received timestamped REPLY msgs from all processes

Exiting a CS
p1 exits and sends RELEASE msgs to all other processes

p1 exits and sends RELEASE msgs to all other processes


p2 enters the CS...
p1 exits and sends RELEASE msgs to all other processes

Ricart and Agrawala’s Idea


Basic idea:
• processes that require entry to a CS multicast a request message
• processes can enter the CS only when all the other processes have replied to this
message
• node pj does not need to send a REPLY to node pi if pj has a request with timestamp
lower than the request of pi (since pi cannot enter before pj anyway in this case)
• Does NOT require communication channels to be FIFO
• Each process pi keeps a Lamport clock, updated according to LC1 and LC2
• Messages requesting entry are of the form , where T is the sender’s timestamp and pi is
the sender’s identifier
• Every process records its state of being outside the CS (RELEASED), wanting entry
(WANTED) or being in the CS (HELD) in a variable state
Algorithm

Example:

• p3 not interested in entering the CS


• p1 and p2 request it concurrently
• The timestamp of p1’s request is 41, that of p2 is 34.
• When p3 receives their requests, it replies immediately
• When p2 receives p1’s request, it finds its own request has the lower timestamp (34 <
41), and so does not reply, holding p1 off
• However, p1 finds that p2’s request has a lower timestamp than that of its own request
(34 < 41) and so replies immediately
• On receiving the 2nd reply, p2 can enter the CS
• When p2 exits the CS, it will reply to p1’s request and so grant it entry
ELECTIONS
Many algorithms used in distributed system require a coordinator that performs
functions needed by other processes in the system. Election algorithms are designed to
choose a coordinator.
Election algorithms choose a process from group of processors to act as a
coordinator. If the coordinator process crashes due to some reasons, then a new
coordinator is elected on other processor. Election algorithm basically determines
where a new copy of coordinator should be restarted.
Election algorithm assumes that every active process in the system has a unique
priority number. The process with highest priority will be chosen as a new coordinator.
Hence, when a coordinator fails, this algorithm elects that active process which has
highest priority number.Then this number is send to every active process in the
distributed system.
We have two election algorithms for two different configurations of distributed
system.
1. The Bully Algorithm –
This algorithm applies to system where every process can send a message to every
other process in the system.
Suppose process P sends a message to the coordinator.
1. If coordinator does not respond to it within a time interval T, then it is assumed
that coordinator has failed.
2. Now process P sends election message to every process with high priority
number.
3. It waits for responses, if no one responds for time interval T then process P
elects itself as a coordinator.
4. Then it sends a message to all lower priority number processes that it is elected
as their new coordinator.
5. However, if an answer is received within time T from any other process Q,
(I) Process P again waits for time interval T’ to receive another message from Q
that it has been elected as coordinator.
(II) If Q doesn’t responds within time interval T’ then it is assumed to have failed
and algorithm is restarted.
2. The Ring Algorithm –
This algorithm applies to systems organized as a ring(logically or physically). In this
algorithm we assume that the link between the process are unidirectional and every
process can message to the process on its right only. Data structure that this algorithm uses
is active list, a list that has priority number of all active processes in the system.
Algorithm –
1. If process P1 detects a coordinator failure, it creates new active list which is empty
initially. It sends election message to its neighbour on right and adds number 1 to its
active list.
2. If process P2 receives message elect from processes on left, it responds in 3 ways:
(I) If message received does not contain 1 in active list then P1 adds 2 to its active
list and forwards the message.
(II) If this is the first election message it has received or sent, P1 creates new active
list with numbers 1 and 2. It then sends election message 1 followed by 2.
(III) If Process P1 receives its own election message 1 then active list for P1 now
contains numbers of all the active processes in the system. Now Process P1 detects
highest priority number from list and elects it as the new coordinator.
MULTICAST COMMUNICATION
Group (multicast) communication: for each of a group of processes to receive copies of the
messages sent to the group, often with delivery guarantees
· The set of messages that every process of the group should receive
· On the delivery ordering across the group members
Challenges
·Efficiency concerns include minimizing overhead activities and increasing
throughput and bandwidth utilization
· Delivery guarantees ensure that operations are completed
Types of group
· Static or dynamic: whether joining or leaving is considered Closed or open
· A group is said to be closed if only members of the group can multicast to it.
Reliable Multicast
Simple basic multicasting (B-multicast) is sending a message to every process that is a
member of a defined group
• B-multicast (g, m) for each process p ∈ group g, send (p, message m)
• On receive (m) at p: B-deliver (m) at p
• Reliable multicasting (R-multicast) requires these properties
• Integrity: a correct process sends a message to only a member of the group
• Validity: if a correct process sends a message, it will eventually be delivered
• Agreement: if a message is delivered to a correct process, all other correct
processes in the group will deliver it
Types of message ordering
Three types of message ordering
• FIFO (First-in, first-out) ordering: if a correct process delivers a message before
another, every correct process will deliver the first message before the other
• Casual ordering: any correct process that delivers the second message will deliver the
previous message first
• Total ordering: if a correct process delivers a message before another, any other
correct process that delivers the second message will deliver the first message first
TRANSACTIONS & REPLICATIONS:
Introduction
A distributed transaction is a set of operations on data that is performed across two or
more data repositories (especially databases). It is typically coordinated across separate
nodes connected by a network, but may also span multiple databases on a single server.
There are two possible outcomes:
1) all operations successfully complete, or
2) none of the operations are performed at all due to a failure somewhere in the system.
ACID is most commonly associated with transactions on a single database server, but
distributed transactions extend that guarantee across multiple databases.
The operation known as a “two-phase commit” (2PC) is a form of a distributed transaction.
“XA transactions” are transactions using the XA protocol, which is one implementation of a
two-phase commit operation.

A transaction starts by initialising things, then reads and/or modifies objects.


At the end, either Commit Changes are saved, resources are released; state is consistent
Abort For an outsider, nothing happened.
Two phase Commit

Group Communication
Communication between two processes in a distributed system is required to exchange
various data, such as code or a file, between the processes. When one source process tries
to communicate with multiple processes at once, it is called Group Communication.
A group is a collection of interconnected processes with abstraction. This
abstraction is to hide the message passing so that the communication looks like a normal
procedure call. Group communication also helps the processes from different hosts to work
together and perform operations in a synchronized manner, therefore increases the overall
performance of the system.

Types of Group Communication in a Distributed System :


1. Broadcast Communication :
When the host process tries to communicate with every process in a distributed system at
same time. Broadcast communication comes in handy when a common stream of
information is to be delivered to each and every process in most efficient manner possible.
Since it does not require any processing whatsoever, communication is very fast in
comparison to other modes of communication. However, it does not support a large
number of processes and cannot treat a specific process individually.
A broadcast Communication: P1 process communicating with every process in the system
2. Multicast Communication :
When the host process tries to communicate with a designated group of processes in a
distributed system at the same time. This technique is mainly used to find a way to address
problem of a high workload on host system and redundant information from process in
system. Multitasking can significantly decrease time taken for message handling.

A multicast Communication: P1 process communicating with only a group of the process in the system

3. Unicast Communication :
When the host process tries to communicate with a single process in a distributed system
at the same time. Although, same information may be passed to multiple processes. This
works best for two processes communicating as only it has to treat a specific process only.
However, it leads to overheads as it has to find exact process and then exchange
information/data.
A broadcast Communication: P1 process communicating with only P3 process

Concurrency Control in Distributed Transactions


Concurrency control is provided in a database to:

(i) enforce isolation among transactions.


(ii) preserve database consistency through consistency preserving execution of
transactions.
(iii) resolve read-write and write-read conflicts.
Various concurrency control techniques are:
1. Two-phase locking Protocol
2. Time stamp ordering Protocol
3. Multi version concurrency control
4. Validation concurrency control
1. Two-Phase Locking Protocol: Locking is an operation which secures: permission to
read, OR permission to write a data item. Two phase locking is a process used to gain
ownership of shared resources without creating the possibility of deadlock.
The 3 activities taking place in the two phase update algorithm are:

(i). Lock Acquisition


(ii). Modification of Data
(iii). Release Lock
Two phase locking prevents deadlock from occurring in distributed systems by releasing
all the resources it has acquired, if it is not possible to acquire all the resources required
without waiting for another process to finish using a lock. This means that no process is
ever in a state where it is holding some shared resources, and waiting for another process
to release a shared resource which it requires. This means that deadlock cannot occur due
to resource contention.
A transaction in the Two Phase Locking Protocol can assume one of the 2 phases:
(i) Growing Phase: In this phase a transaction can only acquire locks but cannot release
any lock. The point when a transaction acquires all the locks it needs is called the Lock
Point.
(ii) Shrinking Phase: In this phase a transaction can only release locks but cannot acquire
any.
2. Time Stamp Ordering Protocol:
A timestamp is a tag that can be attached to any transaction or any data item, which
denotes a specific time on which the transaction or the data item had been used in any way.
A timestamp can be implemented in 2 ways. One is to directly assign the current value of
the clock to the transaction or data item. The other is to attach the value of a logical counter
that keeps increment as new timestamps are required.
The timestamp of a data item can be of 2 types:
• (i) W-timestamp(X): This means the latest time when the data item X has been written
into.
• (ii) R-timestamp(X): This means the latest time when the data item X has been read
from. These 2 timestamps are updated each time a successful read/write operation is
performed on the data item X.
3. Multiversion Concurrency Control:
Multiversion schemes keep old versions of data item to increase concurrency.
Multiversion 2 phase locking:
Each successful write results in the creation of a new version of the data item written.
Timestamps are used to label the versions. When a read(X) operation is issued, select an
appropriate version of X based on the timestamp of the transaction.
4. Validation Concurrency Control:
The optimistic approach is based on the assumption that the majority of the database
operations do not conflict. The optimistic approach requires neither locking nor time
stamping techniques. Instead, a transaction is executed without restrictions until it is
committed. Using an optimistic approach, each transaction moves through 2 or 3 phases,
referred to as read, validation and write.
• (i) During read phase, the transaction reads the database, executes the needed
computations and makes the updates to a private copy of the the database values. All
update operations of the transactions are recorded in a temporary update file, which is
not accessed by the remaining transactions.
• (ii) During the validation phase, the transaction is validated to ensure that the
changes made will not affect the integrity and consistency of the database. If the
validation test is positive, the transaction goes to a write phase. If the validation test is
negative, he transaction is restarted and the changes are discarded.
• (iii) During the write phase, the changes are permanently applied to the database.
Distributed Dead Locks
The same conditions for deadlock in uniprocessors apply to distributed systems.
Unfortunately, as in many other aspects of distributed systems, they are harder to detect,
avoid, and prevent. Four strategies can be used to handle deadlock:
1. ignorance: ignore the problem; assume that a deadlock will never occur. This is a
surprisingly common approach.

2. detection: let a deadlock occur, detect it, and then deal with it by aborting and later
restarting a process that causes deadlock.

3. prevention: make a deadlock impossible by granting requests so that one of the


necessary conditions for deadlock does not hold.

4. avoidance: choose resource allocation carefully so that deadlock will not occur.
Resource requests can be honored as long as the system remains in a safe (non-
deadlock) state after resources are allocated.

In a distributed system deadlock can neither be prevented nor avoided as the system is so
vast that it is impossible to do so. Therefore, only deadlock detection can be implemented.
The techniques of deadlock detection in the distributed system require the following:
Progress – The method should be able to detect all the deadlocks in the system.
Safety – The method should not detect false or phantom deadlocks.
There are three approaches to detect deadlocks in distributed systems. They are as
follows:
1. Centralized approach –
In the centralized approach, there is only one responsible resource to detect deadlock.
The advantage of this approach is that it is simple and easy to implement, while the
drawbacks include excessive workload at one node, single-point failure (that is the
whole system is dependent on one node if that node fails the whole system crashes)
which in turns makes the system less reliable.
2. Distributed approach –
In the distributed approach different nodes work together to detect deadlocks. No
single point failure ( that is the whole system is dependent on one node if that node fails
the whole system crashes) as the workload is equally divided among all nodes. The
speed of deadlock detection also increases.
3. Hierarchical approach –
This approach is the most advantageous. It is the combination of both centralized and
distributed approaches of deadlock detection in a distributed system. In this approach,
some selected nodes or cluster of nodes are responsible for deadlock detection and
these selected nodes are controlled by a single node.
Transaction Recovery
Atomic Property of Transactions means that the effect of performing a transaction
on behalf of one client is free from interference from concurrent transactions being
performed on behalf of other clients
It requires the effects of all committed transactions reflected in data items, but none
of the effects of incomplete/aborted transactions are reflected in the data items
Two Aspects to consider
Durability - requires that data items are saved in permanent storage and will be available
indefinitely, at the servers, or the sites of storage.
Failure Atomicity - requires that the effects of the transaction are atomic even when the
server failsThese two aspects are not completely independent and they can be handled by a
so called recovery manager, which is based on a two-phase commit protocol.
Recovery Manager
• Restores the server’s database from Recovery File (RF) after a crash, which needs to be
resilient to media failure - stable storage
• Reorganizes the RF to improve the performance of recovery
• Reclaims storage space in the RF, through the execution of the application
Recovery File (as a log) is used to deal with recovery of a server involved in a distributed
transaction.
The RF contains:
• Trans Id and the status of the transaction - prepared, committed, aborted
• Data items that are part of the transaction and their values
• Intentions List for the transaction
• RF represents a log containing the history of all the transactions performed
• Contains a Checkpoint, a point where the state of database is precisely known.
• Order of entries reflects the order in which transactions have prepared, committed
and aborted
Intentions List
Contains a list of data item names and the position in the RF were the values of the data
items that are altered by that transaction reside
When a server is prepared to commit a transaction, the RM must save the intentions
list in the RF, this ensures the server is able to carry out the commitment later, even if it
crashes in the interim
When a transaction is aborted the RM uses the intentions list to delete all the
tentative versions of data items made by that transaction.
Check pointing
• The process of writing the current committed values of a server’s data items to a new
RF, together with transaction status entries and intentions lists of transactions that
have not yet been fully resolved
• Its purpose is to reduce number of transactions to be dealt with during recovery and
reclaim file space
• The failed checkpoint itself must be able to recovered too…
Recovery of Two- Phase Commit Protocol
• RMs use two new transaction status values done and uncertain which can be written to
the RF. Both done and uncertain are used when the RF is re-organized
• RM of coordinator uses done to indicate two- phase commit is complete
• RM of worker uses uncertain to indicate the worker has voted Yes but does not know
the outcome
• The RM at coordinator records a coordinator entry - (Trans Id, list of workers) in
coordinator’s RF
• The RM at worker records a worker entry - (Trans Id, coordinator) in worker’s RF
During Phase 1 - Voting
• When coordinator is prepared to commit, its RM writes prepared and a coordinator
entry to RF
• If worker votes Yes, its RM writes prepared, a worker entry and uncertain to the RF
• If worker votes No, its RM writes aborted to the RF
During Phase 2 - Completion
• RM of Coordinator writes either committed or aborted to the RF according to the
decision made
• RMs of Workers write committed or aborted to their RFs depending on message
received from coordinator
• RM of Coordinator writes done to RF when coordinator has received a have committed
message from all its workers
Replication
In the distributed systems research area replication is mainly used to provide fault
tolerance. The entity being replicated is a process.
The basic model for managing replicated data includes the following components:
• Clients issue requests to a front end
• The front end provides transparency by hiding the fact that data is replicated.
• The front end contacts one or more replica managers to retrieve/store the data
• The replica managers interact to ensure that data is consistent

Two replication strategies have been used in distributed systems: Active and Passive
replication.

Active Replication:

• Each client request is processed by all the servers.


• This requires that the process hosted by the servers is deterministic (given the same
initial state and a request sequence, all processes will produce the same response
sequence and end up in the same final state).
• In order to make all the servers receive the same sequence of operations, an atomic
broadcast protocol must be used.
• An atomic broadcast protocol guarantees that either all servers receive a message or
none, plus that they all receive messages in the same order.
Disadvantage:
In practice most of the real world servers are non – deterministic. Still active
replication is the preferable choice when dealing with real time systems that require
quick response even under the presence of faults or with systems that must handle
byzantine faults.
Sequence of Steps:

Passive Replication
• There is only one server that processes client requests.
• After processing a request, the primary server updates the state on the other servers
and sends back the response to the client.
• If the primary server fails, one of the backup servers takes its place.
• This may be used for non – deterministic processes.
• Disadvantage – In case of failure the response is delayed.
Sequence of Steps:

Data Replication;
• It is the process of storing data in more than one site or node.
• Ensures availability of data.
• There can be two types of replication:
 Full Replication – A copy of the whole database is stored at every site.
 Partial Replication – Some fragment of the database are replicated.

You might also like