Distributed Systems
Distributed Systems
Distributed Systems
Iresh A. Dhotre
M.E. (Information Technology)
Ex-Faculty, Sinhgad College of Engineering,
Pune.
Abhijit D. Jadhav
Ph.D. (Pursuing) (CSE), M. Tech. (CSE)
B.E. (Computer Engineering)
Assistant Professor,
Dr. D. Y. Patil Institute of Technology
Pimpri, Pune.
® ®
TECHNICAL
PUBLICATIONS
SINCE 1993 An Up-Thrust for Knowledge
(i)
Distributed Systems
Subject Code : 310245(C)
All publishing rights (printed and ebook version) reserved with Technical Publications. No part of this book
should be reproduced in any form, Electronic, Mechanical, Photocopy or any information storage and
retrieval system without prior permission in writing, from Technical Publications, Pune.
Published by :
® ® Amit Residency, Office No.1, 412, Shaniwar Peth,
TECHNICAL Pune - 411030, M.S. INDIA Ph.: +91-020-24495496/97
PUBLICATIONS
SINCE 1993 An Up-Thrust for Knowledge Email : [email protected] Website : www.technicalpublications.org
Printer :
Yogiraj Printers & Binders
Sr. No. 10/1A,
Ghule Industrial Estate, Nanded Village Road,
Tal. - Haveli, Dist. - Pune - 411041.
ISBN 978-93-90770-70-0
Authors
Iresh A. Dhotre
Abhijit D. Jadhav
Dedicated to Students
(iii)
SYLLABUS
Distributed Systems - 310245(C)
Credit : Examination Scheme :
03 Mid - Sem (TH) : 30 Marks
End - Sem (TH) : 70 Marks
Unit I Introduction
Defining Distributed Systems, Characteristics, Middleware and Distributed Systems. Design goals :
Supporting resource sharing, Making distribution transparent, Open, Scalable, Pitfalls. Types of
Distributed Systems : High Performance Distributed Computing, Distributed Information Systems,
Pervasive Systems. Architectural styles : Layered architectures, Object based architectures,
Publish Subscribe architectures. Middleware organization : Wrappers, Interceptors, Modifiable
middleware. System architecture : Centralized, Decentralized, Hybrid, Example architectures -
Network File System, Web. (Chapter - 1)
Unit II Communication
Introduction : Layered Protocols, Types of Communication, Remote Procedural Call- Basic RPC
Operation, Parameter Passing, RPC-based application support, Variations on RPC, Example : DCE
RPC, Remote Method Invocation. Message Oriented Communication : Simple Transient
Messaging with Sockets, Advanced Transient Messaging, Message Oriented Persistent
Communication, Examples. Multicast Communication : Application Level Tree-Based
Multicasting, Flooding-Based Multicasting, Gossip-Based Data Dissemination. (Chapter - 2)
(iv)
Replication and Placement, Content Distribution, Managing Replicated Objects. Consistency
Protocols : Continuous Consistency, Sequential Consistency, Cache Coherence Protocols,
Example : Caching, and Replication in the web. (Chapter - 5)
(v)
TABLE OF CONTENTS
Unit - I
Chapter 1 : Introduction 1 - 1 to 1 - 28
Unit - III
Chapter 3 : Synchronization 3 - 1 to 3 - 40
Unit - IV
Chapter 4 : Naming and Distributed File Systems 4 - 1 to 4 - 42
(ix)
4.7 Case Study : Sun Network File System .................................................................... 4 - 29
4.7.1 NFS Architecture ......................................................................................... 4 - 30
4.7.2 Communication ............................................................................................ 4 - 32
4.7.3 Naming and Mounting ................................................................................. 4 - 32
4.7.4 Caching and Replication .............................................................................. 4 - 35
4.7.5 Advantages and Disadvantages of NFS ..................................................... 4 - 36
4.8 Andrew File System ................................................................................................... 4 - 37
4.9 Multiple Choice Questions with Answers .................................................................. 4 - 40
Unit - V
Chapter 5 : Consistency and Replication 5 - 1 to 5 - 22
(x)
5.5.2 Replicated - Write Protocols ........................................................................ 5 - 15
5.5.2.1 Active Replication ......................................................................... 5 - 15
5.5.2.2 Quorum based Protocols .............................................................. 5 - 17
5.6 Caching and Replication in the Web ......................................................................... 5 - 17
5.7 Multiple Choice Questions with Answers .................................................................. 5 - 20
Unit - VI
Chapter 6 : Fault Tolerance 6 - 1 to 6 - 36
(xi)
Notes
(xii)
UNIT - I
1 Introduction
Syllabus
Contents
1.1 Defining Distributed Systems ........ Oct. - 18, Dec. - 18 ........................ Marks 5
1.2 Design Goals . .............................. Oct. - 18, Dec. - 18, May - 19 ........ Marks 5
(1 - 1)
Distributed Systems (1 - 2) Introduction
1.1.1 Characteristics
1. Collection of autonomous computing elements
Distributed systems are often organized as an overlay network, a network built on top of
another network. There are two common overlay networks :
a) Structured overlay where each node has a well-defined set of neighbors it can
communicate with.
b) Unstructured overlay where nodes communicate with a randomly selected set of
nodes.
A well-known class of overlays is formed by peer to peer networks.
2. Single coherent system
The collection of nodes as a whole operates the same, no matter where, when, and how
interaction between a user and the system takes place.
Examples :
1. An end user cannot tell where a computation is taking place
2. Where data is exactly stored should be irrelevant to an application
3. Whether data has been replicated or not is completely hidden
To support heterogeneous, computers and networks while offering a single system view,
distributed system are often organized by means of a layer of software that is logically
placed between a higher level consisting of users and applications and layer underneath
consisting of OS.
Middleware is software which lies between an operating system and the applications
running on it. Distributed system is sometimes called as middleware.
An example of a distributed system would be the World Wide Web where there are
multiple components under the hood that help browsers display content but from a user’s
point of view, all they are doing is accessing the web via a browser.
Resource Management : It offers services that can also be found in most OS and it
includes :
a. Security services
b. Accounting services
c. Masking of and recovery from failures
d. Facilities for inter-application communication
Examples of middleware services :
1. Communication : RPC is common communication services is used.
2. Transaction : Middleware offers services in an all-or-nothing fashion, commonly
referred to as an atomic transaction.
3. Service composition : Web-based middleware can help by standardizing the way web
services are accessed and providing the means to generate their functions in a specific
order.
4. Reliability : Reliability is the ability for a system to remain available over a period of
time. Reliable systems are those that can continuously perform their core functions
without service disruptions, errors, or significant reductions in performance.
Review Questions
The main goal of a distributed system is to connect users and resources in a transparent,
open and scalable way.
1. Supporting resource sharing
2. Making distribution transparent
3. Open
4. Scalable
5. Failure transparency : Users and applications to complete their tasks despite the
failure of hardware and software components, e.g. email.
6. Mobility transparency : Movement of resources and clients within a system without
affecting the operation of users and programs, e.g. mobile phone.
7. Performance transparency : Allows the system to be reconfigured to improve
performance as loads vary.
8. Scaling transparency : Allows the system and applications to expand in scale without
change to the system structure or the application algorithms.
Sr. No. Transparency Description
1. Access Hide differences in data representation and how a resource is
accessed.
2. Location Hide where a resource is located.
3. Migration Hide that a resource may move to another location.
4. Relocation Hide that a resource may be moved to another location while
in use.
5. Replication Hide that a resource may be shared by several competitive
users.
6. Concurrency Hide that a resource may be shared by several competitive
users.
7. Failure Hide the failure and recovery of a resource.
8. Persistence Hide whether a (software) resource is in memory or on disk.
1.2.3 Open
Openness means that the system can be easily extended and modified. Openness refers to
the ability to plug and play. You can, in theory, have two equivalent services that follow the
same interface contract, and interchange one with the other.
The integration of new components means that they have to be able to communicate with
some of the components that already exist in the system. Openness and distribution are
related. Distributed system components achieve openness by communicating using well-
defined interfaces.
If the well-defined interfaces for a system are published, it is easier for developers to add
new features or replace sub-systems in the future.
Open systems can easily be extended and modified. New components can be integrated
with existing components.
Differences in data representation or interface types on different processors have to be
resolved. Openness and distribution are related to each other. System components need to
have well-defined and well-documented interfaces.
It can be constructed from heterogeneous hardware and software. Openness is concerned
with extensions and improvements of distributed systems. Detailed interfaces of
components need to be published. New components have to be integrated with existing
components.
The system needs to have a stable architecture so that new components can be easily
integrated while preserving previous investments.
An open distributed system offers services according to standard rules that describe the
syntax and semantics of those services.
1.2.4 Scalable
A system is said to be scalable if it can handle the addition of users and resources without
suffering a noticeable loss of performance or increase in administrative complexity.
The ability to accommodate any growth in the future be it expected or not. Distributed
system architectures achieve scalability through employing more than one host. Distributed
systems can be scalable because additional computers can be added in order to host
additional components.
1. In size : Dealing with large numbers of machines, users, tasks.
2. In location : Dealing with geometric distribution and mobility.
3. In administration : Addressing data passing through different regions of ownership.
The design of scalable distributed systems presents the following challenges :
1. Controlling the cost of resources.
2. Controlling the performance loss.
3. Preventing software resources from running out.
4. Avoiding performance bottlenecks.
Controlling the cost of physical resources i.e. servers and users.
Controlling the performance loss : DNS hierarchic structures scale better than linear
structures and save time for access structured data.
Preventing software resources running out : Internet 32 bits addresses run out soon. 128 bits
one gives extra space in messages.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge
Distributed Systems (1 - 8) Introduction
Avoiding performance bottlenecks : DNS name table was kept in a single master file
partitioning between servers.
Example : File system scalability is defined as the ability to support very large file systems,
large files, large directories and large numbers of files while still providing I/O
performance. Google file system aims at efficiently and reliably managing many extremely
large files for many clients, using commodity hardware.
Various techniques such as replication, caching and cache memory management and
asynchronous processing help to achieve scalability.
Scaling techniques
1. Hiding communication latencies : Examples would be asynchronous communication
as well as pushing code down to clients (e.g. Java applets and Javascript).
2. Distribution : Taking a component, splitting into smaller parts, and subsequently
spreading them across the system.
3. Replication : Replicating components increases availability, helps balance the load
leading to better performance, helps hide latencies for geographically distributed
systems. Caching is a special form of replication.
1.2.5 Pitfalls
False assumptions made by first time developer :
a) The network is reliable. b) The network is secure.
c) The network is homogeneous. d) The topology does not change.
e) Latency is zero. f) Bandwidth is infinite.
g) Transport cost is zero. h) There is one administrator.
Review Questions
1. Explain what is scalability in distributed system ? What are the challenges to design
scalable distributed system ? SPPU : May - 19, End sem, Marks 5
A device must be continually aware of the fact that its environment may change at any
time. Many devices in pervasive system will be used in different ways by different users.
Devices generally join the system in order to access information and information should
then be easy to read, store, manage and share.
Pervasive systems are all around us and ideally should be able to adapt to the lack of human
administrative control :
1. Automatically connect to a different network ;
2. Discover services and react accordingly ;
3. Automatic self configuration
Electronic Health Care Systems
New devices are being developed to monitor the well-being of individuals and to
automatically contact physicians when needed. Major goal is to prevent people from being
hospitalized.
Personal health care systems equipped with various sensors organized in a Body-Area
Network (BAN). Such a network should at worst only minimally hinder a person.
A central hub is part of the BAN and collects data as needed. Data is then offloaded to a
larger storage device. The BAN is continuously hooked up to an external network through a
wireless connection, to which it sends monitored data.
Sensor Network
A sensor network consists of tens to hundreds or thousands of relatively small nodes, each
equipped with a sensing device. Most sensor networks use wireless communication and the
nodes are often battery powered.
Their limited resources, restricted communication capabilities, and constrained power
consumption demand that efficiency be high on the list of design criteria.
The relation with distributed systems can be made clear by considering sensor networks as
distributed databases. To organize a sensor network as a distributed database, there are
essentially two extremes :
1. Sensors do not cooperate but simply send their data to a centralized database located at
the operator's site.
2. Forward queries to relevant sensors and to let each compute an answer, requiring the
operator to sensibly aggregate the returned answers.
Disadvantages : Limited resources including power, restricted communication capabilities.
Object - based architectures are attractive because they provide a natural way to encapsulate
data and the operations that can be performed on that data in a single entity.
The interface provided by an object hides implementation details, meaning that at first we
can consider an object completely independent of its environment.
For instance many networked applications use a shared distributed file system in which
communication takes place through files.
Web - based distributed systems use shared web - based data services.
Example : Wealth of networked applications has been developed that rely on a shared
distributed file system in which virtually all communication takes place through files.
Web - based distributed systems are largely data - centric : processes communicate through
the use of shared Web - based data services.
For instance, publish/subscribe systems are event-based systems. Components are loosely
coupled.
Processes publish events after which the middleware ensures that only those processes that
subscribed to those events will receive them.
In principle, they need not explicitly refer to each other. This is also referred to as being
decoupled in space, or referentially decoupled.
It is software construct that will break the usual flow of control and allow other code to
be executed.
Interceptor allows node owners to specify quantitative constraints on the share allocated
to P2P applications for each node resource, and enforces them by means of a set of
resource-limitation mechanisms.
Interceptor is a software layer, placed on top of the local operating system, that
intercepts the resource access requests issued by P2P applications, and controls them, in
order to
(a) Provide application segregation for P2P applications
(b) Maximize their performance without violating the above limitations.
3. Modifiable Middleware
Fig. 1.6.1 shows client server model. Distributed services which are called on by clients.
Servers that provide services are treated differently from clients that use services.
Processes divided into two groups : Server and client process.
In the three-tier architecture, process between Server and client (intermediary) process
is :
a) Separate the clients and servers.
b) Cache frequently accessed server data to ensure better performance and scalability.
c) Performance can be increased by having the intermediary process to distribute client
requests to several servers such that requests execute in parallel.
d) The intermediary can also act as a translation service by converting requests and
replies to and from a mainframe format, or as a security service that grants server-
access only to trusted clients.
1.6.2 Decentralized
All processes play similar role. Processes interact without particular distinction between
clients and servers. The pattern of communication depends on the particular application.
Napster is a system for sharing files, usually audio files, between different systems. These
systems are peers of each other in that any of them may request a file hosted by another
system.
Fig. 1.6.3 shows peer-to-peer communication. All peers run the same program and offer the
same set of interfaces to each other.
It is a client/server application that provides shared file storage for clients across a network.
NFS is stateless. All client requests must be self-contained. Each procedure call contains all
the information necessary to complete the call. Server maintains no "between call"
information.
It uses an External Data Representation (XDR) specification to describe protocols in a
machine and system independent way.
NFS is implemented on top of a Remote Procedure Call package (RPC) to help simplify
protocol definition, implementation, and maintenance.
NFS is not so much a true file system, as a collection of protocols that together provide
clients with a model of a distributed file system.
Goals of NFS design :
1. Compatibility : NFS should provide the same semantics as a local Unix file system.
Programs should not need or be able to tell whether a file is remote or local.
2. Easy deployable : Implementation should be easily incorporated into existing systems
remote files should be made available for local programs without these having to be
modified or re-linked.
3. Machine and OS independence : NFS Clients should run in non-Unix platforms.
4. Efficiency : NFS should be good enough to satisfy users, but did not have to be as fast
as local FS. Clients and Servers should be able to easily recover from machine crashes
and network problems.
Each vnode contains a pointer to its parent VFS and a pointer to a mounted-on VFS. This
means that any node in a file system tree can be a mount point for another file system.
A root operation is provided in the VFS to return the root vnode of a mounted file system.
This is used by the pathname traversal routines in the kernel to bridge mount points.
The root operation is used instead of keeping a pointer so that the root vnode for each
mounted file system can be released.
Server Side
Because the NFS server is stateless, when servicing an NFS request it must commit any
modified data to stable storage before returning results.
The implication for UNIX based servers is that requests which modify the file system must
flush all modified data to disk before returning from the call.
For example, on a write request, not only the data block, but also any modified indirect
blocks and the block containing the inode must be flushed if they have been modified.
Client Side
The Sun implementation of the client side provides an interface to NFS which is transparent
to applications.
To make transparent access to remote files work we had to use a method of locating remote
files that does not change the structure of path names.
Transparent access to different types of file systems mounted on a single machine is
provided by a new file system interface in the kernel.
Each "filesystem type" supports two sets of operations : the Virtual Filesystem (VFS)
interface defines the procedures that operate on the filesystem as a whole; and the Virtual
Node (vnode) interface defines the procedures that operate on an individual file within that
filesystem type.
The ability of the client to simply retry the request is due to an important property of most
NFS requests: they are idempotent.
An operation is called idempotent when the effect of performing the operation multiple
times is equivalent to the effect of performing the operation a single time.
Working :
When a user is accessing a file, the kernel determines whether the file is a local file or an
NFS file. The kernel passes all references to local files to the local file access module and
all references to the NFS files to the NFS client module.
The NFS client sends RPC requests to the NFS server through its TCP/TP module.
Normally, NFS is used with UDP, but newer implementations can use TCP. Then the NFS
server receives the requests on port 2049.
Next, the NFS server passes the request through its local file access routines, which access
the file on server's local disk.
After the server gets the results back from the local file access routines, the NFS server
sends back the reply in the RPC reply format to the client.
While the NFS server is handling the client's request, the local file system needs some
amount of time to return the results to the server. During this time the server does not want
to block other incoming client requests.
To handle multiple client requests, NFS servers are multithreaded or there are multiple
servers running at the same time.
Q.4 __________ transparency allows the movement of resources and clients within a
system without affecting the operation of users or programs.
a Location b Access c Mobility d Replication
Q.5 URLs are __________ transparent because the part of the URL that identifies a web
server domain name refers to a computer name in a domain, rather than to an
Internet address.
a Mobility b replication
c security d location
Q.14 The DNS name space is hierarchically organized into a tree of domains, which are
divided into nonoverlapping __________.
a zones b subzones c area d location
Q.15 In peer-to-peer systems, the processes are organized into an __________ network.
a static b dynamic c overlay d All of these
2 Communication
Syllabus
Contents
(2 - 1)
Distributed Systems (2 - 2) Communication
Seven layer reference model for open systems interconnection (OSI) adopted by
International Organization for Standardization (ISO) to encourage the development of
protocol standards that would meet the requirements for open systems.
An open system is a model that allows any two different systems to communicate
regardless of their underlying architecture (hardware or software). The OSI model is not a
protocol; it is model for understanding and designing a network architecture that is flexible,
robust and interoperable.
The OSI model is a layered framework for the design of network systems that allows for
communication across all types of computer systems. Fig. 2.1.2 shows OSI model.
3. Network layer : The network layer is responsible for the source-to-destination delivery
of a packet possible across multiple networks.
4. Transport layer : The transport layer is responsible for process-to-process delivery of
the entire message. The network layer oversees host-to-destination delivery of
individual packets; it does not recognize any relationship between those packets. The
transport layer ensures that the whole message arrives intact and in order, overseeing
both error control and flow control at the process-to-process level.
5. Session layer : The session layer is the network dialog controller. It was designed to
establish, maintain, and synchronize the interaction between communicating devices.
6. Presentation layer : The presentation layer was designed to handle the syntax and
semantics of the information exchanged between the two systems. It was designed for
data translation, encryption, decryption, and compression.
7. Application layer : The application layer enables the user to access the network. It
provides user interfaces and support for services such electronic email, remote file
access, WWW, etc.
The task of dividing messages into packets before transmission and reassembling them at
receiving computer is performed in the transport layer. The transport layer is responsible
for delivering messages to destinations with transport addresses.
A transport address is composed of the network address number of a host computer and a
port number. Ports are software-definable destination points for communication within a
host computer. In the internet there are typically several ports at each host computer with
well-known numbers, each allocated to a given internet service.
Routing
Routing is a function that is required in all networks except that LANs such as ethernet that
provide the direct connection between all pairs of attached hosts.
The best route for communication between points in the network is re-evaluated
periodically to take into account the current traffic and any faults in the network : adaptive
routing. Packets delivery to their destinations is the collective responsibility of the routers
located at connection points.
Routing algorithm is implemented by a program in the network layer at each point, has two
functions :
1. Decide the routes for packets transmission : Whenever a virtual circuit or connection is
established in case of circuit-switched and frame-relay network layers.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge
Distributed Systems (2 - 5) Communication
2. Update its knowledge of the network based on traffic monitoring and the detection of
failures.
A simple algorithm for routing discussed here is "distance vector" algorithm which is the
basis for link-state algorithm that is used by internet. In this algorithm each router has a
table contains a single entry for each possible destination showing the next hope that packet
must take toward its destination. Cost field in the table is simple calculation of vector
distance or number of hopes for a given destination. See the next slide that shows routing
tables for the previous network.
Fig. 2.1.3 shows routing in wide area network. For a packet addressed to C, when it arrives
at the router at A, the algorithm uses routing table in A and choose the row starting with C
therefore forwards the packet to link labeled 1. When the packet arrives at B same
procedure is followed and link 2 will be selected.
When packet arrives at C, routing table entry shows local that means packet should be
delivered to a local host. The routing tables will be built up and maintained whenever faults
occur in the network.
RIP Routing Algorithm
Each router exchanges and modifies information of its routing table by using router
information protocol (RIP) routing algorithm, which does the following high level actions :
1. Periodically and when the local routing changes each router sends the table to all
accessible neighbors. The summary of table is sent in a RIP packet.
2. When a table is received from a neighboring router if received table shows a route to a
new destination or lower cost route to an existing destination then it updates the local
table with the new route.
Internetworking
Many subnets based on many network technologies are integrated to build an internetwork.
To make this possible, the following are needed :
a. A unified internetwork addressing scheme enables packets to be addressed to any host
connected to any subnets (provided by IP addresses in the internet).
b. A protocol defining the format of internetwork packets and giving rules of handling
them (IP protocol in the internet).
c. Interconnecting components that route packets to their destination in terms of
internetwork addresses (performed by internet routers in the internet).
To build an integrated network (an internetwork) many subnets of different network
technologies are integrated. Internet made this possible by providing the following items :
a. IP addresses b. IP protocol c. Intrernet routers
The routers are in fact the general purpose computers that serve as firewalls. They may be
interconnected through the subnets or direct connection. In any case they are responsible
for forwarding the internetwork packets and maintaining routing tables.
1. Hub : A common connection point for devices in a network. Hubs are commonly used
to connect segments of a LAN. A hub contains multiple ports. When a packet arrives at
one port, it is copied to the other ports so that all segments of the LAN can see all
packets.
2. Switch : A device that filters and forwards packets between LAN segments. It can
interconnect two or more workstations, but like a bridge, it observes traffic flow and
learns. When a frame arrives at a switch, the switch examines the destination address
and forwards the frame out the one necessary connection.
3. Bridge : A bridge is a device that connects two segments of the same network. The two
networks being connected can be alike or dissimilar. Unlike routers, bridges are
protocol-independent. They simply forward packets without analyzing and re-routing
messages.
4. Router : A router is a device that connects two distinct networks. Routers are similar to
bridges, but provide additional functionality, such as the ability to filter messages and
forward them to different places based on various criteria. The internet uses routers
extensively to forward packets from one host to another.
The version of IP currently using is IPv4. New version is IPv6 that designed to overcome
addressing limitation of IPv4.
IP address written as a sequence of four decimal numbers separated by dots. Has equivalent
symbolic domain name represented in a hierarchy. IP address has five classes :
a. Class A : Reserved for very large networks (224 hosts on each).
b. Class B : Allocated for organization networks contain more than 255 hosts.
c. Class C : Allocated to all other networks (less than 255 hosts on each).
d. Class D : Reserved for multicasting but this is not supported by all routers.
e. Class E : Unallocated addresses reserved for future requirements.
When an IP datagram (up to 64 Kbytes) is longer than the Maximum Transfer Unit (MTU)
of the underlying network :
a. It is broken into smaller packets at the source and reassembled at its final destination.
b. Each packet has a fragment identifier to enable out-of-order fragments to be collected.
Remote Procedure Call (RPC), originally developed by Sun Microsystems and currently
used by many UNIX-based systems, is an Application Programming Interface (API)
available for developing distributed applications.
It allows programs to execute subroutines on a remote system. The caller program, which
represents the client instance in the client/server model sends a call message to the server
process, and waits for a reply message.
The call message includes the subroutine's parameters, and the reply message contains the
results of executing the subroutine.
RPC also provides a standard way of encoding data passed between the client servers in a
portable fashion called External Data Representation (XDR).
Traditionally the calling procedure is known as the client and the called procedure is known
as the server.
When making a remote procedure call :
1. The calling environment is suspended, procedure parameters are transferred across the
network to the environment where the procedure is to execute, and the procedure is
executed there.
2. When the procedure finishes and produces its results, its results are transferred back to
the calling environment, where execution resumes as if returning from a regular
procedure call.
The main goal of RPC is to hide the existence of the network from a program. As a result,
RPC doesn't quite fit into the OSI model :
a. The message passing nature of network communication is hidden from the user. The
user doesn't first open a connection, read and write data, and then close the connection.
Indeed, a client often does not even know they are using the network.
b. RPC often omits many of the protocol layers to improve performance. Even a small
performance improvement is important because a program may invoke RPCs often. For
example, on (diskless) Sun workstations, every file access is made via an RPC.
RPC is especially well suited for client-server (e.g., query-response) interaction in which
the flow of control alternates between the caller and callee.
Conceptually, the client and server do not both execute at the same time. Instead, the thread
of execution jumps from the caller to the callee and then back again
The procedure call (same as function call or subroutine call) is a well-known method for
transferring control from one part of a process to another, with a return of control to the
caller.
Associated with the procedure call is the passing of arguments from the caller (the client) to
the callee (the server).
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge
Distributed Systems (2 - 10) Communication
In most current systems the caller and the callee are within a single process on a given host
system. This is what we called "local procedure calls".
In a RPC, a process on the local system invokes a procedure on a remote system. The
reason we call this a "procedure call" is because the intent is to make it appear to the
programmer that a normal procedure call is taking place.
We use the term "request" to refer to the client calling the remote procedure, and the term
"response" to describe the remote procedure returning its result to the client.
1. The client process send request message to the server process and waits for a reply
message. The request message contains the remote procedure's parameters and other
things.
2. Server process executes the procedure and then returns the result of procedure execution
in a reply message to the client process.
3. Once the reply message is received, the result of procedure execution is extracted, and
the caller's execution is resumed.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge
Distributed Systems (2 - 11) Communication
2. These network messages are sent to the remote system by the client stub. This requires
a system call to the local kernel.
3. The network messages are transferred to the remote system. Either a
connection-oriented or a connection-less protocol is used.
4. A server stub procedure is waiting on the remote system for the client's request. It
unmarshals the arguments from the network message and possibly converts them.
5. The server stub executes a local procedure call to invoke the actual server function,
passing it the arguments that it received in the network messages from the client stub.
6. When the server procedure is finished, it returns to the server stub with return values.
7. The server stub converts the return values, if necessary, and marshals them into one or
more network messages to send back to the client stub.
8. The messages get transferred back across the network to the client stub.
9. The client stub reads the network messages from the local kernel.
10. After possibly converting the return values, the client stub finally returns to the client
function. This appears to be a normal procedure return to the client.
Client-Server binding
Binding is the process of connecting the client and server. The server, when it starts up,
exports its interface, identifying itself to a network name server and telling the local
runtime its dispatcher address.
The client, before issuing any calls, imports the server, which causes the RPC runtime to
lookup the server through the name service and contact the requested server to setup a
connection. The import and export are explicit calls in the code.
After a RPC time-out (or a client crashed and restarted), the client is not sure if the RP
may or may not have been called.
This is the case when no fault tolerance is built into RPC mechanism.
Clearly, maybe semantics is not desirable.
2. At-least-once call semantics
With this call semantics, the client can assume that the RP is executed at least once.
Can be implemented by retransmission of the (call) request message on time-out.
Acceptable only if the server's operations are idempotent. That is f(x) = f(f(x)).
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge
Distributed Systems (2 - 16) Communication
When a RPC returns, it can assume that the Remote Procedure (RP) has been called
exactly once or not at all.
Implemented by the server's filtering of duplicate requests and caching of replies.
This ensures the RP is called exactly once if the server does not crash during execution
of the RP.
When the server crashes during the RP's execution, the partial execution may lead to
erroneous results.
In this case, we want the effect that the RP has not been executed at all.
At-most-once call semantics are for those RPC applications which require a guarantee
that multiple invocations of the same RPC call by a client will not be processed on the
server.
Such applications usually maintain state information on the server and more than one
invocation of the same RPC call must be detected in order to avoid corruption of the
state information.
1. Describe with diagram the role of client and server stub procedures in RPC in the
context of a procedural language. SPPU : Oct. - 18, In Sem, Marks 5
Java RMI is the Java distributed object model for facilitating communications among
distributed objects. RMI is a higher-level API built on top of sockets.
Socket-level programming allows you to pass data through sockets among computers.
RMI enables you not only to pass data (parameters and return values) among objects on
different systems, but also to invoke methods in a remote object.
Distributed object
Objects consist of a set of data and its methods. Objects provide methods, through the
invocation of which an application obtains access to services.
Distributed object systems may adopt the client server architecture. Here objects are
managed by severs and their clients invoke their methods using RMI.
Remote and local method invocations
Fig. 2.3.1 shows the remote and local method invocations.
Remote method invocation : Method invocation between objects in different processes,
whether in the same computer or not.
Remote object reference : Identifier to refer to a certain remote object in a distributed
system, e.g. B's must be made available to A.
Remote interface : Every remote object has one that specifies which methods can be
invoked remotely. e.g. B and F specify what methods in remote interface.
Server interface : Server provides a set of procedure that is available for use by client. File
server provide reading and writing files.
Remote interface : Class of remote object implements the methods of its remote interface.
Object in other processes can only invoke methods that belong to its remote interface.
However, local object can invoke remote interface methods as well as other methods
implemented by remote object.
Define the remote interface by extending an interface named remote.
There are three processes that participate in supporting remote method invocation.
1. The client is the process that is invoking a method on a remote object.
2. The server is the process that owns the remote object. The remote object is an ordinary
object in the address space of the server process.
3. The object registry is a name server that relates objects with names. Objects are
registered with the object registry. Once an object has been registered, one can use the
object registry to obtain access to a remote object using the name of the object.
There are two kinds of classes that can be used in Java RMI.
1. A remote class is one whose instances can be used remotely. An object of such a class
can be referenced in two different ways :
a. Within the address space where the object was constructed, the object is an ordinary
object which can be used like any other object.
b. Within other address spaces, the object can be referenced using an object handle.
While there are limitations on how one can use an object handle compared to an
object, for the most part one can use object handles in the same way as an ordinary
object.
c. For simplicity, an instance of a remote class will be called a remote object.
2. A serializable class is one whose instances can be copied from one address space to
another. An instance of a serializable class will be called a serializable object. In other
words, a serializable object is one that can be marshaled.
Remote classes and interfaces
A remote class has two parts : The interface and the class itself.
Stub informs the remote reference layer that the call should be invoked. Then
un-marshaling the return value or exception from a marshal stream and informing the
remote reference layer that the call is complete.
Client stub responsible for :
1. Initiate remote calls
2. Marshal arguments to be sent
3. Inform the remote reference layer to invoke the call
4. Unmarshaling the return value
5. Inform remote reference the call is complete
Server skeleton responsible for :
1. Unmarshaling incoming arguments from client
2. Calling the actual remote object implementation
3. Marshaling the return value for transport back to client
The remote reference layer
The remote reference layer deals with the lower level transport interface and is responsible
for carrying out a specific remote reference protocol which is independent of the client
stubs and server skeletons.
The remote reference layer has two cooperating components : The client-side and the
server-side components.
The client-side component contains information specific to the remote server and
communicates via the transport to the server-side component. During each method
invocation, the client and server-side components perform the specific remote reference
semantics.
For example, if a remote object is part of a replicated object, the client-side component can
forward the invocation to each replica rather than just a single remote object.
Define a class that implements the server object interface, as shown in the following
outline :
public class ServerInterfaceImpl extends UnicastRemoteObject
implements ServerInterface
// Implement it
}
The server implementation class must extend the java.rmi.server.UnicastRemoteObject
class. The UnicastRemoteObject class provides support for point-to-point active object
references using TCP streams.
3. Step 3 : Create and register server object
Create a server object from the server implementation class and register it with an RMI
registry :
ServerInterface obj= new ServerInterfaceImpl(...);
registry.rebind("RemoteObjectName", obj);
4. Step 4 : Develop client program
Develop a client that locates a remote object and invokes its methods, as shown in the
following outline :
Registry registry = LocateRegistry.getRegistry(host);
registry.lookup("RemoteObjectName");
server.service1(...);
Syntax of the API functions is independent of the protocol being used. Ex:- TCP/IP and
UNIX domain protocols can be used by applications using a common set of functions.
Gives way to better portability of applications across protocol suites.
Hides the finer details of the protocols from application programs thereby yielding faster
and bug free application development
Sockets are referenced through socket descriptors which can be passed directly to UNIX
system I/O calls. File I/O and socket I/O are exactly similar from the programmer
perspective.
Working with sockets is very similar to working with files. The socket ( ) and accept ( )
functions both return handles (file descriptor) and reads and writes to the sockets
requires the use of these handles (file descriptors).
In Linux, sockets and file descriptors also share the same file descriptor table. That is, if
you open a file and it returns a file descriptor with value say 8, and then immediately
open a socket, you will be given a file descriptor with value 9 to reference that socket.
Even though sockets and files share the same file descriptor table, they are still very
different. Sockets have addresses associated with them whereas files do not; notice that
this distinguishes sockets form pipes, since pipes do not have addresses with which they
associate.
You cannot randomly access a socket like you can a file with lseek ( ). Sockets must be
in the correct state to perform input or output.
Socket abstraction
Socket is the basic abstraction for network communication in the socket API. Socket
defines an endpoint of communication for a process.
Operating system maintains information about the socket and its connection. Fig. 2.4.1
shows the socket and process.
Socket Creation
printf(socket() failed.);
exit(1);
Creating a socket is in some ways similar to opening a file. This function creates a file
descriptor and returns it from the function call. You later use this file descriptor for
reading, writing and using with other socket functions.
Remember that the sockets API are generic. There must be a generic way to specify
endpoint addresses. TCP/IP requires an IP address and port number for each endpoint
address. Other protocol suites (families) may use other schemes.
In UNIX, whenever there is a need for IPC within the same machine, we use mechanism
like signals or pipes. When we desire a communication between two applications
possibly running on different machines, we need Sockets.
Sockets are treated as another entry in the UNIX open file table.
Sockets provide an interface for programming networks at the transport layer.
Network communication using sockets is very much similar to performing file I/O. In
fact, socket handle is treated like file handle.
Socket-based communication is programming language independent.
To the kernel, a socket is an endpoint of communication. To an application, a socket is a
file descriptor that lets the application read/write from/to the network.
A server (program) runs on a specific computer and has a socket that is bound to a
specific port. The server waits and listens to the socket for a client to make a connection
request.
To review, there are five significant steps that a program which uses TCP must take to
establish and complete a connection. The server side would follow these steps :
1. Create a socket.
2. Listen for incoming connections from clients.
3. Accept the client connection.
4. Send and receive information.
5. Close the socket when finished, terminating the conversation.
In the case of the client, these steps are followed :
1. Create a socket.
2. Specify the address and service port of the server program.
3. Establish the connection with the server.
4. Send and receive information.
5. Close the socket when finished, terminating the conversation.
Only steps two and three are different, depending on if it's a client or server application.
Fig. 2.4.2 shows a timeline of the typical scenario that takes place between a TCP client
and server.
Fig. 2.4.3
Fig. 2.4.4
Fig. 2.4.5
Fig. 2.4.6
Source queue : It is on source machine or same machine. Message can be read only from
local queue.
Destination queue : Message is stored on the queue where queue contains specification of
the destination.
Queue is managed by queue managers. Queue manager directly interact with application.
Router or relay is the special manager used for forwarding the messages.
It works at the application level.
Message broker
A message queue broker provides delivery services for a message queue messaging system.
Message delivery relies upon a number of supporting components that handle connection
services, message routing and delivery, persistence, security, and logging.
A message server can employ one or more broker instances. Broker components are shown
in Fig. 2.4.8.
To perform this complex set of functions, a broker uses a number of different internal
components, each with a specific role in the delivery process.
The message router component performs the key message routing and delivery service, and
the others provide important support services upon which the Message Router depends.
Main broker service components and functions
1. Message Router : Manages the routing and delivery of messages.
2. Connection Services : Manages the physical connections between a broker and clients,
providing transport for incoming and outgoing messages.
3. Persistence Manager : Manages the writing of data to persistent storage so that system
failure does not result in failure to deliver messages.
4. Security Manager : Provides authentication services for users requesting connections to
a broker and authorization services for authenticated users.
5. Monitoring Service : Generates metrics and diagnostic information that can be written to
a number of output channels that an administrator can use to monitor and manage a
broker.
Group communication that simplifies building reliable efficient distributed systems. Most
current distributed operating systems are based on Remote Procedure Call. The idea is to
hide the message passing and make the communication look like an ordinary procedure
call. Fig. 2.5.1 shows multicast communication.
c parallel d serial
c static d dynamic
Q.7 With asynchronous RPCs, the _________ immediately sends a reply back to the
client the moment the RPC request is received, after which it calls the requested
procedure.
a router b gateway c client d server
Q.12 The distributed object model is an extension of the local object model used in
_________ programming languages.
a function based b procedural based
the receiver.
c The sender keeps on executing after sending a message. The message should
be stored by the middleware.
d The sender blocks execution after sending a message and waits for response
Q.16 Event based architectures can be combined with _________ architecture yielding
what is also known as shared data spaces.
a layered b object based
Q.18 In the case of an _________ server, the server itself handles the request and, if
necessary, returns a response to the requesting client.
a multithreaded b concurrent
c iterative d none
3 Synchronization
Syllabus
Contents
(3 - 1)
Distributed Systems (3 - 2) Synchronization
There is no common universal time but the speed of light is constant for all observers
irrespective of their velocity.
Timers in computers are based on frequency of oscillation of a quartz crystal. Each
computer has a timer that interrupts periodically.
Time is also an important theoretical construct in understanding how distributed executions
unfold. But time is problematic in distributed systems. Each computer may have its own
physical clock, but the clocks typically deviate, and we cannot synchronize them perfectly.
Needs for precision time :
a. Stock market buy and sell orders stock market buy and sell orders
b. Secure document timestamps
c. Distributed network gaming and training
d. Aviation traffic control and position reporting
e. Multimedia synchronization for real-time teleconferencing
f. Event synchronization and ordering
g. Network monitoring measurement and control.
Each computer in a DS has its own internal clock
1. Used by local processes to obtain the value of the current time
2. Processes on different computers can timestamp their events
3. But clocks on different computers may give different times
4. Computer clocks drift from perfect time and their drift rates differ from one another.
Consider a group of people going to a meeting. Each person has a watch. Each watch has a
similar, but different time. Even with the error in time, the group is able to meet and
conduct business. This is how distributed time works. It is difficult to make temporal order
of events and difficult to collect up-to-date information on the state of the entire system.
Algorithm for designing and debugging of distributed system is more difficult than
centralized systems.
While the best quartz resonators can achieve an accuracy of one second in 10 years, they
are sensitive to changes in temperature and acceleration and their resonating frequency can
change as they age.
The only problem with maintaining a concept of time is when multiple entities attempt to
do it concurrently. Two watches hardly ever agree. Computers have the same problem :
A quartz crystal on one computer will oscillate at a slightly different frequency than on
another computer, causing the clocks to tick at different rates.
The phenomenon of clocks ticking at different rates, creating a ever widening gap in
perceived time is known as clock drift. The difference between two clocks at any point in
time is called clock skew and is due to both clock drift and the possibility that the clocks
may have been set differently on different machines.
Fig. 3.1.1 shows skew with two clocks.
Consider two clocks A and B, where clock B runs slightly faster than clock A by
approximately two seconds per hour. This is the clock drift of B relative to A. At one point
in time, the difference in time between the two clocks is approximately 4 seconds. This is
the clock skew at that particular time.
Successive events will correspond to different timestamps only if the clock resolution is
smaller than the rate at which events can occur. The rate at which events occur depends on
such factors as the length of the processor instruction cycle.
Applications running at a given computer require only the value of the counter to
timestamp events. The date and time-of-day can be calculated from the counter value.
Clock drift may happen when computer clocks count time at different rates.
Co-ordinated Universal Time (UTC) is an international standard that is based on atomic
time. UTC signals are synchronized and broadcast regularly from land-based radio stations
and satellites.
If the computer clock is behind the time service's, it is OK to set the computer clock to be
the time service's time. However, when the computer clock runs faster, then it should be
slowed down for a period instead of set back to the time service's time directly.
The way to cause computer's clock run to slow for a period can be achieved in software
without changing the rate of the hardware clock. Also called timer, usually a quartz crystal,
oscillating at a well-defined frequency.
A timer is associated with two registers : a counter and a holding register, and counter
decreasing one at each oscillation. When the counter gets to zero, an interruption is
generated and is called one clock tick.
Crystals run at slightly different rates, the difference in time value is called a clock skew.
Clock skew causes time-related failures. Fig. 3.1.2 shows working of computer clock.
Working :
1. Oscillation at a well-defined frequency
2. Each crystal oscillation decrements the counter by 1
3. When counter gets 0, its value reloaded from the holding register
4. When counter is 0, an interrupt is generated, which is call a clock tick
5. At each clock tick, an interrupt service procedure add 1 to time stored in memory
Synchronization of physical clocks with real-world clock :
1. TAI (International Atomic Time) : Cs133 atomic clock
2. UTC (Universal Co-ordinated Time) : Modern civil time, can be received from WWV
(shortwave radio station), satellite, or network time server.
3. ITS (Internet Time Service) NTS (Network Time Protocol)
Some definitions :
1. Transit of the sun : The event of the sun's reaching its highest apparent point in the sky.
2. Solar day : The interval between two consecutive transits of the sun is called the solar
day.
3. Coordinated Universal Time (UTC) : The most accurate physical clocks known use
13
atomic oscillators, whose accuracy is about one part in 10 . The output of these atomic
clocks is used as the standard for elapsed real time, known as International Atomic
Time. Co-ordinated universal time is an international standard that is based on atomic
time, but a so-called leap second is occasionally inserted or deleted to keep in step with
astronomical time.
Review Questions
The difference between two clocks at any point in time is called clock skew and is due to
both clock drift and the possibility that the clocks may have been set differently on different
machines.
Fig. 3.2.1 shows the drift rate of clocks.
If a clock is fast, it simply has to be made to run slower until it synchronizes. If a clock is
slow, the same method can be applied and the clock can be made to run faster until it
synchronizes.
The operating system can do this by changing the rate at which it requests interrupts. For
example, suppose the system requests an interrupt every 17 milliseconds and the clock run
a bit too slowly. The system can request interrupts at a faster rate, say every 16 or
15 milliseconds, until the clock catches up. This adjustment changes the slope of the system
time and is known as a linear compensating function.
Each of these machines sends a timestamp as a response to the query. The server now
averages the three timestamps : The two it received and its own, computing
= (3:00 + 3:25 + 2:50)/3 = 3:05
Now it sends an offset to each machine so that the machine's time will be synchronized to
the average once the offset is applied. The machine with a time of 3:25 gets sent an offset
of – 0:20 and the machine with a time of 2:50 gets an offset of + 0:15. The server has to
adjust its own time by + 0:05.
The algorithm also has provisions to ignore readings from clocks whose skew is too great.
The master may compute a fault-tolerant average i.e. averaging values from machines
whose clocks have not drifted by more than a certain amount. If the master machine fails,
any other slave could be elected to take over.
For a certain class of algorithms, it is the internal consistency of the clocks that matters.
The convention in these algorithms is to speak of logical clocks.
Lamport showed clock synchronization need not be absolute. What is important is that all
processes agree on the order in which events occur.
A logical clock Cp of a process p is a software counter that is used to timestamp events
executed by p so that the happened-before relation is respected by the timestamps.
where send(m) is the event of sending the message, and rcv(m) is the event of receiving it.
3. HB3 : If x, y and z are events such that x y and y z, then x z.
If x y, then we can find a series of events occurring at one or more processes such that
either HB1 or HB2 applies between them. The sequence of events need not be unique.
If two events are not related by the relation (i.e., neither a b nor b a), then they are
concurrent (a || b).
a b c d f ; e f but a || e.
Example : Event ordering
The processes run on different machines, each with its own clock and running with own
speed.
When the clock has ticked 6 times in process P1 , it has ticked 8 times in process P2 and 10
times in process P3. Each clock runs at a constant rate, but rate varies according to the
crystals.
At time 6, process P1 sends message m1 to process P2 . The clock in process 2 reads 16
when it arrives. Process 2 will conclude that it took 10 ticks to reach from process 1 to
process 2.
According to this reasoning, message m2 from process 2 to process 3 takes 16 ticks.
In Fig. 3.3.2 (a) , message m3 from process 3 to process 2 leaves at 60 and arrives at 56.
Similarly, message m4 from process 2 to process 2 leave at 64 and arrive at 54. These
values are not possible.
Lamport solution is given in Fig. 3.3.2 (b) which uses happen-before relation. Since
message m3 left at 60, it must arrive at 61 or later. So each message carry its sending time
according to the sender's clock.
When message arrives and the receiver's clock shows a value prior to the time the message
was sent, the receiver fast forwards its clock to be one more than sending time.
Totally ordered multicasting :
We can use the logical clocks satisfying the clock condition to place a total ordering on the
set of all system events. Simply order the events by the times at which occur.
To break the ties, lamport proposed the use of any arbitrary total ordering of the processes,
i.e. process id
Using this method, we can assign a unique timestamp to each event in a distributed system
to provide a total ordering of all events. Very useful in distributed system for solving the
mutual exclusion problem
We sometimes need to guarantee that concurrent updates on a replicated database are seen
in the same order everywhere :
P1 adds ` 100 to an account (initial value: ` 1000)
P2 increments account by 1 %
There are two replicas. Fig. 3.3.3 shows the updating a replicated database and leaving it in
an inconsistent state.
Fig. 3.3.4
VCi[j] represents the number of events Pj produced that belong to the current causal past of
Pi. When a process Pi produces an event e, it can associate with that event a vector
timestamp whose value equals the current value of VCi.
Example : Assign the Lamport's clock values for all the events in the above timing
diagram. Assume that each process's logical clock is set to 0 initially.
Fig. 3.3.5
Fig. 3.3.6
Review Questions
Mutual exclusion ensures that concurrent processes make a serialized access to shared
resources or data. It requires that the actions performed by a user on a shared resource must
be atomic.
In a distributed system neither shared variables nor a local kernel can be used in order to
implement mutual exclusion. Thus, mutual exclusion has to be based exclusively on
message passing, in the context of unpredictable message delays and no complete
knowledge of the state of the system.
Mutual exclusion : Makes sure that concurrent process access shared resources or data in a
serialized way. If a process, say Pi, is executing in its critical section, then no other
processes can be executing in their critical sections.
Example : Updating a DB or sending control signals to an I/O device
Problem of mutual exclusion frequently arises in distributed systems whenever concurrent
access to shared resources by several sites is involved.
Mutual exclusion is the fundamental issue in the design of distributed systems.
Entry section : The code executed in preparation for entering the critical section
Critical section : The code to be protected from concurrent execution
Exit section : The code executed upon leaving the critical section
Remainder section : The rest of the code
Each process cycles through these sections in the order : remainder, entry, critical, exit.
System model
The system consists of N sites, S1, S2, ...., SN. We assume that a single process is running
on each site. The process at site Si is denoted by Pi.
At any instant, a site may have several requests for critical section. A site queues up these
requests and serves them one at a time.
Site may in one of the three states :
1. Requesting CS
2. Executing CS
3. Neither requesting nor executing requests for CS
Classification of Mutual Exclusion
Different types of algorithm are used to solve problem of mutual exclusion in distributed
system. But these algorithms differ in their communication topology. Topology may be
ring, bus, star etc. They also maintain different types of information.
These algorithms are divided into two classes :
1. Non-token based : Require multiple rounds of message exchanges for local states to
stabilize
2. Token based : Permission passes around from one site to another. Site is allowed to
enter its critical section if it possesses the token and it continues to hold the token until
the execution of the critical section is over.
Token passes in one direction through the ring. The token passes around the ring
continuously. When a process receives the token from its neighbour, if it does not require
access to the critical section it immediately forwards on the token to the next neighbour in
the ring.
If it requires access to the critical section, the process :
1. Retains the token
2. Performs the critical section and then :
3. To relinquish access to the critical section
4. Forwards the token on to the next neighbour in the ring.
Fig. 3.4.4 shows ring based algorithm.
Once again it is straight forward to determine that this algorithm satisfies the safety and
liveness properties. However once again we fail to satisfy the fairness property.
Suppose again we have two processes P1 and P4 consider the following events
1. Process P1 wishes to enter the critical section but must wait for the token to reach it.
2. Process P1 sends a message m to process P4.
3. The token is currently between process P1 and P4 within the ring, but the message m
reaches process P4 before the token.
4. Process P4 after receiving message m wishes to enter the critical section
5. The token reaches process P4 which uses it to enter the critical section before process
P1.
Performance
Constant bandwidth consumption
Entry delay between 0 and N message transmission times
Synchronization delay between 1 and N message transmission times
Fig. 3.4.5(a)
Step 2 : Site S2 enters the critical section.
Fig. 3.4.5(b)
Fig. 3.4.5(c)
Fig. 3.4.5(d)
Each process Pi is associated with a voting set Vi of processes. The set Vi for the process Pi
is chosen such that :
1. Pi Vi : A process is in its own voting set.
2. Vi Vj { } : There is at least one process in the overlap between any two voting sets.
3. |Vi| = |Vj |: All voting sets are the same size.
4. Each process Pi is contained within M voting sets.
When a processor wants to enter a critical section, it sends a request to all members of its
district. It may enter, if it gets a grant from all members. When a processor receives a
request it answers with yes, if it has not already cast its vote. On exit it informs its district to
enable a new voting.
As before each process maintains a state variable which can be one of the following :
1. Released : Does not have access to the critical section and does not require it.
2. Wanted : Does not have access to the critical section but does require it.
3. Held : Currently has access to the critical section.
In addition each process maintains a boolean variable indicating whether or not the process
has "voted". Of course voting is not a one-time action. This variable really indicates
whether some process within the voting set has access to the critical section and has yet to
release it. To begin with, these variables are set to "Released" and False respectively.
Review Question
Processes 5 and 6 both respond with OK, as shown in Fig. 3.5.1 (b).
Upon getting the first of these responses, 4 knows that its job is over. It knows that one of
these will take over and become co-ordinator.
In Fig. 3.5.1 (c), both 5 and 6 hold elections, each one only sending messages to those
processes higher than itself.
If there is state information to be collected from disk or elsewhere to pick up where the old
co-ordinator left off, 6 must now do what is needed. When it is ready to take over,
6 announces this by sending a CO-ORDINATOR message to all running processes.
When 4 gets this message, it can now continue with the operation it was trying to do when
it discovered that 7 was dead, but using 6 as the co-ordinator this time. In this way the
failure of 7 is handled and the work can continue.
If process 7 is ever restarted, it will just send all the others a CO-ORDINATOR message
and bully them into submission.
If two processes, 2 and 5 discover simultaneously that the previous co-ordinator, process 7
has crashed. Each of these builds an ELECTION message and each of them starts
circulating its message, independent of the other one.
Both messages will go all the way around, and both 2 and 5 will convert them into
CO-ORDINATOR messages with exactly the same number and in the same order. When
both have gone around again, both will be removed. It does not harm to have extra message
circulating at worst it consumes a little bandwidth, but this not considered wasteful.
1. Explain in detail ring algorithm. SPPU : Dec. - 18, End sem, Marks 4
A satellite continuously broadcasts its position, and time stamps each message with its local
time. This broadcasting allows every receiver on Earth to accurately compute its own
position using, in principle, only three satellites.
In order to compute a position, consider first the two - dimensional case, in which two
satellites are drawn, along with the circles representing points at the same distance from
each respective satellite.
The y-axis represents the height, while the x-axis represents a straight line along the Earth's
surface at sea level. The intersection of the two circles is a unique point. Because the GPS
receiver does not carry atomic clocks, the measured distances between the receiver and
GPS satellites introduce errors originating from the clock error, and the distance is called
the pseudo range.
The real distance is simply computed as :
2 2 2
Ri = (Xsi – X) + (Ysi – Y) + (Zsi – Z)
where
th
Ri is the real distance between the i satellite to the receiver P;
Assuming the position of the satellite Si and the receiver P under the geocentric rectangular
coordinate system is (Xsi, Ysi, Zsi) and (X, Y, Z), respectively.
Fig. 3.6.2
In GPS, node P can compute is own coordinates (xp, yp) by solving the three equations with
the two unknowns xp and yp :
2 2
di = (xi – xp) + (yi + yp) (i = 1,2,3)
Fig. 3.7.1
Key components :
1. Publishers : Publishers generate event data and publishes them.
2. Subscribers : Subscribers submit their subscriptions and process the events received.
3. P/S service : It's the mediator/broker that filters and routes events from publishers to
interested subscribers.
Publishers form a one-to-many relationship with their subscribers, but the publishers do not
know who is subscribed. Subscribers also do not need to know the publisher, as long as
they can specify which kind of messages they would like to receive.
Event space is divided in topics, corresponding to logical channels. The participants
subscribe for a topic and publish on a topic.
Publish-subscribe is also a key component of Google's infrastructure.
Examples of Internet applications that can use a publish/subscribe system are multi-party
messaging, personal information management, information sharing, on-line news
distribution, service discovery and electronic auctions.
Characteristics of publish-subscribe system
1. Asynchronous communication : Publishers and subscribers are loosely coupled.
2. Many to many interaction between publisher and scribers.
3. Content-based pub/sub very expressive
4. Heterogeneous : Distributed event-based system allows to connect heterogeneous
components across the Internet.
1. Content Based :
Publishers publish events to named channels and subscribers then subscribe to one of
these named channels to receive all events sent to that channel.
In content-based publish/subscribe, notifications are not classified according to some
pre-defined external criterion (topic name).
The advantage of a content-based system is its flexibility. More flexibility and power to
subscribers, by allowing more expression in arbitrary/customized query over the
contents of the event.
In most content-based systems, events are viewed as sets of values of primitive types or
records and properties of events are viewed as fields of such structures.
In a content-based distributed system, messages from publishers do not contain any
address; instead, they are routed through the system based on their content. A network
of brokers can be formed to create a content-based routing system.
Advantages :
1. A notification that does not match any subscription is not sent to any client, saving
network resources.
2. Enable subscribers to describe runtime properties of the message objects they wish to
receive.
Disadvantages :
1. Expressive, but higher runtime overhead.
2. It requires complex protocols/implementation to determine the subscriber.
2. Topic-based :
Topic based is also known as subject based. Message belongs to one of a fixed set of
what are variously referred to as groups, channels or topics. Subscription targets a
group, channel or topic and the user receives all events that are associated with that
group.
In a topic-based system, processes exchange information through a set of predefined
subjects which represent many-to-many distinct (and fixed) logical channels.
For example, in a subject based system for stock trading, a participant could select one
or two stocks then subscribe based on stock name if that were one of valid groups.
Advantages :
1. Efficient implementations
2. Routing is simple
Disadvantages :
1. It is the very limited expressiveness it offers to subscribers.
2. Inefficient use of bandwidth : A subscriber has to subscribe to a topic even if he/she
is interested in certain specific criteria.
Clients connect to one of several distributed access points. The access points are
themselves inter-connected through message routers that cooperate to form a distributed,
coherent communication service.
For example : CORBA event services, JMS, JEDI etc.
2. Peer-to-Peer model
In a gossip-based PSS, protocol execution at each node is divided into periodic cycles.
In each cycle, every node selects a node from its partial view and exchanges a subset of
its partial view with the selected node.
Subsequently, both nodes update their partial views. Implementations of a PSS vary
based on a number of different policies :
i. Node selection : Determines how a node selects another node to exchange
information with. It can be either randomly (rand), or based on the node’s age (tail).
ii. View propagation : Determines how to exchange views with the selected node. A
node can send its view with or without expecting a reply, called push-pull and push,
respectively.
iii. View selection : Determines how a node updates its view after receiving the nodes’
descriptors from the other node.
d the set of NTP servers with which you are currently synchronizing.
b sending a token around a set of nodes. Whoever has the token is the coordinator.
c sending a message around all available nodes and choosing the first one on the
resultant list.
d building a list of all live nodes and choosing the largest numbered node in the list.
Q.5 The Ricart and Agrawala distributed mutual exclusion algorithm is _________.
a more efficient and more fault tolerant than a centralized algorithm.
Q.6 A client has a time of 5:05 and a server has a time of 5:25. Using the Berkeley
algorithm, the client's clock will be set to :
a 5:15 b 5:20 c 5:25 d 5:30
d Hardware multicast.
c Assigns the role of coordinator to the processs holding the token at the time of
election.
d Picks the process with the largest ID.
Q.9 Which mutual exclusion algorithm works when the membership of the group is
unknown ?
a Centralized b Ricart-Agrawala
d None of above
Syllabus
Contents
(4 - 1)
Distributed Systems (4 - 2) Naming and Distributed File Systems
The naming and locating facilities jointly form a naming system that provides the users
with an abstraction of an object that hides the details of how and where an object is actually
located in the network.
It provides a further level of abstraction when dealing with object replicas. Given an object
name, it returns a set of the locations of the object's replicas. The naming system plays a
very important role in achieving the goal of :
1. Location transparency,
2. Facilitating transparent migration and replication of objects,
3. Object sharing
6. Group naming : A naming system should allow many different objects to be identified
by the same name. Such a facility is useful to support broadcast facility or to group
objects for conferencing or other applications.
7. Meaningful names : A name can be simply any character string identifying some
object. However, for users, meaningful names are preferred to lower level identifiers
such as memory pointers, disk block numbers or network addresses.
8. Performance : The performance measurement of a naming system is the amount of
time needed to map an object's name to its attributes, such as its location. Naming
system should be efficient in the sense that the number of messages exchanged in a
name-mapping operation should be as small as possible.
9. Fault tolerance : A naming system should be capable of tolerating, to some extent,
faults that occur due to the failure of a node or a communication link in a distributed
system network. That is, the naming system should continue functioning, perhaps in a
degraded form, in the event of these failures.
10. Replication transparency : In a
distributed system, replicas of an object
are generally created to improve
performance and reliability. A naming
system should support the use of
multiple copies of the same object in a
user-transparent manner.
The cost is high if the object locating mechanism maps to node N3 instead of node N1.
Forwarding Pointers
To locate mobile entities, concept of forwarding pointers is used. Forwarding pointers
enable locating mobile entities. Mobile entities move from one access point to another.
When an entity moves from place A to another place B, it leaves behind (at A) a reference
to its new location at B.
Advantage
1. Simple : As soon as the first name is located using traditional naming service, the chain
of forwarding pointers can be used to find the current address.
Drawbacks
1) The chain can be too long - locating becomes expensive.
2) All the intermediary locations in a chain have to maintain their pointers.
3) Vulnerability if links are broken.
Hence, making sure that chains are short and that forwarding pointers are robust is an
important issue.
Chord is a protocol and algorithm for a peer-to-peer distributed hash table. A distributed
hash table stores key-value pairs by assigning keys to different computers (known as
"nodes"); a node will store the values for all the keys for which it is responsible.
Chord specifies how keys are assigned to nodes, and how a node can discover the value for
a given key by first locating the node responsible for that key. Chord assigns an m-bit
identifier (randomly chosen) to each node. A node can be contacted through its network
address.
The Chord protocol supports just one operation : Given a key, it will determine the node
responsible for storing the key's value. Chord does not itself store keys and values.
A node generates its identifier by picking a value randomly from the hash space. The node
joins the DHT and determines who its predecessor and successor are in the table.
Predecessor(n) : The node with the highest identifier less than n's identifier, allowing for
wrapround.
Successor(n) : The node with the lowest identifier greater than n's identifier, allowing for
wrapround.
A node is then responsible for its own identifier and the identifiers between its identifier
and its predecessor's identifier. Fig. 4.2.1 shows identifier circle for 3 bit identifier.
Internally, chord uses a consistent hash function for mapping keys to node locations. The
consistent hash function of chord is based on standard hash functions like SHA1 that
produces m bit output. The nodes are hashed based on their IP address, while key, value
pair is hashed based on their key.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge
Distributed Systems (4 - 8) Naming and Distributed File Systems
m
An identifiers are arranged on a identifier circle modulo 2 Chord ring
Key (k) is assigned to the node whose identifier is equal to or greater than the key's
identifier. This node is called successor(k) and is the first node clockwise from k. The
identifier ring is called Chord ring.
Key k is assigned to the first node whose identifier is equal to or follows (the identifier of)
k in the identifier space. This node is the successor node of key k, denoted by successor(k).
If each node knows only how to contact its current successor node on the identifier circle,
all node can be visited in linear order. Queries for a given identifier could be passed around
the circle via these successor pointers until they encounter the node that contains the key.
Fig. 4.2.2 shows chord with finger table.
th i–1 m
The i entry of node n will contain successor (n + 2 ) mod 2 . The first entry of finger
table is actually the node's immediate successor.
Every time a node wants to look up a key k, it will pass the query to the closest successor or
predecessor (depending on the finger table) of k in its finger table (the "largest" one on the
circle whose ID is smaller than k), until a node finds out the key is stored in its immediate
successor.
With such a finger table, the number of nodes that must be contacted to find a successor in
an N-node network is O(log N)
When a node n joins the network, certain keys previously assigned to n's successor now
become assigned to n. When node n leaves the network, all of its assigned keys are
reassigned to n's successor.
Each node n maintains a routing table with upto m entries called finger table.
th
The i entry in the table at node n contains the identity of the first node s that succeeds n
i–1
by at least 2 on the identifier circle.
i–1
s = successor (n + 2 ).
th
Where s is called the i finger of node n, denoted by n.finger(i).
A finger table entry includes both the Chord identifier and the IP address (and port number)
of the relevant node. The first finger of n is the immediate successor of n on the circle.
DHT construction
Use a logical name space, called the identifier space, consisting of identifiers
{0, 1, 2, …, N – 1}. Identifier space is a logical ring modulo N.
Every node picks a random identifier though Hash H.
Example : Space N = 16 {0,…,15}
Five nodes a, b, c, d, e. H(a) = 6, H(b) = 5, H(c) = 0, H(d) = 11, H(e) = 2
Fig. 4.2.3 shows chord ring and successor.
The successor of an identifier is the first node met going in clockwise direction starting at
the identifier.
succ(x) : is the first node on the ring with id greater than or equal x. Succ(12) = 0,
Succ(1) = 2 and Succ(6) = 6
th
The node s is called the i finger of node n, and denoted by n.finger[i].node. The first
finger of n is its immediate successor on the circle.
When a node n does not know the successor of a key k, it sends a ``find successor'' request
to a intermediate node whose ID is closer to k.
Node n finds the intermediate node by searching its finger table for the closest finger f
preceding k, and sends the find successor request to f.
Node f looks in its finger table for the closest entry preceding k, and sends that back to n.
As a result n learns about nodes closer and closer to the target ID.
Data Storing
Fig. 4.2.5 shows data storing.
The name assigned to machines must be carefully selected from a name space with
complete control over the binding between the names and IP addresses.
i) Flat name spaces :
In DNS, names are defined in an inverted tree structure with the root at the top. The tree
can have only 128 levels : Level 0 to Level 127.
Each node in the tree has a label, which is a string with a maximum of 63 characters. The
root label is a null string , i.e. empty string.
Each node in the tree has a domain name, A full domain name is a sequence of labels
separated by dots(.). Fig. 4.3.2 shows the domain names and labels.
In fully qualified domain name, label is terminated by a null string. Fully Qualified Domain
Name (FQDN) contains the full name of host.
For example, sinhgad.it.edu
If a label is not terminated by null switch it is called a Partially Qualified Domain Name
(PQDN). It starts from a node but not reach the root.
Hierarchy of Name Servers
To distribute the information among many computers, DNS servers are used. Creates many
domains as there are first level nodes. Fig. 4.3.3 shows hierarchy of name servers.
In zone, a server is responsible and have some authority. The server makes database called
zone file and keeps all the information for every node under that domain.
Domain and zone are same if server accepts responsibility for a domain and does not divide
the domain into subdomain.
Domain and zone are different, if a server divides its domain into subdomains and delegates
part of its authority to other server.
A root sever is a server whose zone consists of the whole tree. A root server usually does
not store any information about domains but delegates its authority to other servers.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge
Distributed Systems (4 - 17) Naming and Distributed File Systems
Primary server : It stores a file about the zone for which it is an authority. It is responsible
for creating, maintaining and updating the zone file.
Secondary server : It transfers the complete information about a zone from another server
and stores the files on its local disk. These servers neither creates nor updates the zone files.
Iterative Resolution
Only a single resolution is made and returned (not recursive).
Client must now explicitly contact different name servers if further resolution is needed.
If the server is an authority for the name, it sends the answer. If it is not, it returns the IP
address of the server that is thinks can resolve the query. The client is responsible for
repeating the query to this second server. This process is called iterative resolution because
the client repeats the same query to multiple servers.
Fig. 4.3.6 shows iterative resolution.
Conceptually, name resolution proceeds in a top-down fashion.
Name resolution can occur in one of two different ways : Recursive resolution and Iterative
resolution.
Name servers use name caching to optimize search costs.
Time To Live (TTL) is used to determine a guaranteed name binding during it’s time
interval. When time expires, the cache name binding is no longer valid, so the client must
make a direct name resolution request once again.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge
Distributed Systems (4 - 19) Naming and Distributed File Systems
A resolver can have multiple request outstanding at any time. Hence the identification field
is used to relate a subsequent response message to an earlier request message.
The name resolver passes the request message to its local domain name server using
TCP/IP. If the request is for a server on this network, the local domain name server obtains
the corresponding IP address from its DIB and returns it in a reply message.
Step 1 : It sends a query to the local name server, cs.vu.ne. This query contains the domain
name sought, the type (A) and the class (IN).
Step 2 : The local name server has never had a query for this domain before and knows
nothing about it. It may ask a few other nearby name servers, but if none of them know, it
sends a UDP packet to the server for edu given in its database, edu-server.net.
Step 3 : It is unlikely that this server knows the address of india.cs.stes.edu and probably
does not know cs.stes.edu either, but it must know all of its own children, so it forwards the
request to the name server for stes.edu.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge
Distributed Systems (4 - 21) Naming and Distributed File Systems
Step 4 : In turn, this one forwards the request to cs.stes.edu, which must have the
authoritative resource records.
Step 5 - 8 : Each request is from a client to a server, the resource record requested works its
way back.
Once these records get back to the cs.vu.nl name server, they will be entered into a cache
there, in case they are needed later.
Sometimes users require a service, but they are not concerned with that system entity
supplies that service. Attributes may be used as values to be looked up.
Directory service is a service that stores collections of bindings between names and
attributes and that looks up entries that match attribute-based specifications. Sometimes
called yellow pages services or attribute-based name services. A directory service returns
the sets of attributes of any objects found to match some specified attributes.
Discovery Services
Directory service that registers services provided in a spontaneous networking environment.
It provide an interface for automatically registering and de-registering services, as well as
an interface for clients to look up the services they require.
Directory service is automatically updated as the network configuration changes and meets
the needs of clients in spontaneous networks. It also discovers services required by a client
(who may be mobile) within the current scope, for example, to find the most suitable
printing service for image files after arriving at a hotel.
Examples of discovery services : Jini discovery service, the 'service location protocol', the
'simple service discovery protocol', the 'secure discovery service'.
Example of discovery service : A printer may register its attributes with the discovery
service as follows :
'resourceClass = printer, type=laser, color=yes, resolution=600dpi, location=room101,
url=http://www.collegeNW.com/services/laserprinter'
4.4.2 LDAP
LDAP stands for Lightweight Directory Access Protocol. LDAP defines a standard method
for accessing and updating information in a directory. It has gained wide acceptance as the
directory access method of the Internet and is therefore also becoming strategic within
corporate intranets.
LDAP is based on X.500. It is a fast growing technology for accessing common directory
information. Fig. 4.4.1 shows LDAP uses X.500.
Directories are usually accessed using the client/server model of communication. LDAP
defines a message protocol used by directory clients and directory servers but does not
define a programming interface for the client.
X.500 organizes directory entries in a hierarchal name space capable of supporting large
amounts of information. It also defines powerful search capabilities to make retrieving
information easier. Because of its functionality and scalability.
X.500 is often used together with add-on modules for interoperation between incompatible
directory services. It specifies that communication between the directory client and the
directory server uses the directory access protocol.
LDAP defines a communication protocol. Every directory needs a namespace. The LDAP
namespace is the system used to reference objects in an LDAP directory. Each object must
have a name.
Namespace hierarchy allows management control. DNS is by definition hierarchical in
nature. The LDAP name-space is hierarchical too. LDAP uses strings to represent data
rather than complicated structured syntaxes such as ASN.1
LDAP defines a set of server operations used to manipulate the data stored by the directory.
LDAP uses TCP/IP for its communications. For a client to be able to connect to an LDAP
directory, it must open a TCP/IP session with the LDAP server.
LDAP minimizes the overhead to establish a session allowing multiple operations from the
same client session. LDAP defines operations for accessing and modifying directory entries
such as :
1. Searching for entries meeting user-specified criteria.
2. Adding an entry.
3. Deleting an entry.
4. Modifying an entry.
5. Modifying the distinguished name or relative distinguished name of an entry (move).
6. Comparing an entry.
A distributed file system enables programs to store and access remote files exactly as they
do local ones.
Two distributed systems that have been in widespread used for a decade or more.
a. Sun Network File System (NFS).
b. Andrew File System (AFS).
File systems are abstraction that enables users to read, manipulate and organize data.
Typically the data is stored in units known as files in a hierarchical tree where the nodes are
known as directories.
The file system enables a uniform view, independent of the underlying storage devices
which can range between anything from floppy drives to hard drives and flash memory
cards. Since file systems evolved from stand-alone computers the connection between the
logical file system and the storage device was typically a one-to-one mapping.
Even software RAID that is used to distribute the data on multiple storage devices is
typically implemented below the file system layer.
Distributed file system is a resource management component of a distributed operating
system. Distributed file system is a part of distributed system that provides a user with a
unified view of the files on the network. A machine that holds the shared files is called a
server, a machine that accesses the files is called a client.
The file systems in the 1970s were developed for centralized computer systems, where the
data was only accessed by one user at a time. When multiple users and processes were to
access files at the same time a notion of locking was introduced. There are two kinds of
locks, read and write.
Goals of distributed file systems are as follows :
1. Network transparency : Clients should be able to access remote files using the same
operations that apply to local files.
2. High availability : Users should have the same easy access to files, irrespective of their
physical location.
c. Naming transparency : The name of file should give no lint as to where the file is
located.
d. Replication transparency : The clients do not need to know the existence or
locations of multiple file copies.
2. User mobility : User should not force to work on a specific node but should have the
flexibility to work on different nodes at different times.
3. Performance : The performance of the file system is usually measured as the average
amount of time needed to satisfy client requests.
4. Scalability : A good distributed file system should be designed to easily cope with the
growth of nodes and users in the system.
5. High availability : DFS should continue to function even when partial failures occur
due to the failure of one or more components, such as a communication link failure, a
machine failure or a storage device crash.
6. High reliability : In a good distributed file system, the probability of loss of stored data
should be minimized as far as practicable.
7. Security : DFS should be secure so that its users can be confident of the privacy of their
data.
Provides mapping between text names for the files and their UFIDs.
Clients may obtain the UFID of a file by quoting its text name to directory service.
Directory service supports functions needed generate directories, to add new files to
directories.
3. Client module :
It runs on each computer and provides integrated service (flat file and directory) as a
single API to application programs. For example, in UNIX hosts, a client module
emulates the full set of UNIX file operations.
It holds information about the network locations of flat-file and directory server
processes; and achieves better performance through implementation of a cache of
recently used file blocks at the client.
Flat file service operations :
1. Read(FileId, i, n) -> Data - throws if 1 i Length(File) : Reads a sequence of upto
BadPosition n items from a file starting at item i and returns it
in Data.
2. Write(FileId, i, Data) - throws if 1 i Length(File) + 1 : Write a sequence of
BadPosition Data to a file, starting at item i, extending the file
if necessary.
3. Create ( ) -> FileId Creates a new file of length0 and delivers a UFID
for it.
4. Delete(FileId) Removes the file from the file store.
5. GetAttributes(FileId) -> Attr Returns the file attributes for the file.
6. SetAttributes(FileId, Attr) Sets the file attributes
4. Efficiency : NFS should be good enough to satisfy users, but did not have to be as fast
as local FS. Clients and Servers should be able to easily recover from machine crashes
and network problems.
The Virtual File System (VFS) interface is implemented using a structure that contains the
operations that can be done on a file system.
Likewise, the vnode interface is a structure that contains the operations that can be done on
a node (file or directory) within a file system.
There is one VFS structure per mounted file system in the kernel and one vnode structure
for each active node. Using this abstract data type implementation allows the kernel to treat
all file systems and nodes in the same way without knowing which underlying file system
implementation it is using.
Each vnode contains a pointer to its parent VFS and a pointer to a mounted-on VFS. This
means that any node in a file system tree can be a mount point for another file system.
A root operation is provided in the VFS to return the root vnode of a mounted file system.
This is used by the pathname traversal routines in the kernel to bridge mount points.
The root operation is used instead of keeping a pointer so that the root vnode for each
mounted file system can be released.
Server Side
Because the NFS server is stateless, when servicing an NFS request it must commit any
modified data to stable storage before returning results.
The implication for UNIX based servers is that requests which modify the file system must
flush all modified data to disk before returning from the call.
For example, on a write request, not only the data block, but also any modified indirect
blocks and the block containing the inode must be flushed if they have been modified.
Client Side
The Sun implementation of the client side provides an interface to NFS which is transparent
to applications.
To make transparent access to remote files work we had to use a method of locating remote
files that does not change the structure of path names.
Transparent access to different types of file systems mounted on a single machine is
provided by a new file system interface in the kernel.
Each "filesystem type" supports two sets of operations : the Virtual Filesystem (VFS)
interface defines the procedures that operate on the filesystem as a whole; and the Virtual
Node (vnode) interface defines the procedures that operate on an individual file within that
filesystem type.
The ability of the client to simply retry the request is due to an important property of most
NFS requests: they are idempotent.
An operation is called idempotent when the effect of performing the operation multiple
times is equivalent to the effect of performing the operation a single time.
Working :
When a user is accessing a file, the kernel determines whether the file is a local file or an
NFS file. The kernel passes all references to local files to the local file access module and
all references to the NFS files to the NFS client module.
The NFS client sends RPC requests to the NFS server through its TCP/TP module.
Normally, NFS is used with UDP, but newer implementations can use TCP. Then the NFS
server receives the requests on port 2049.
Next, the NFS server passes the request through its local file access routines, which access
the file on server's local disk.
After the server gets the results back from the local file access routines, the NFS server
sends back the reply in the RPC reply format to the client.
While the NFS server is handling the client's request, the local file system needs some
amount of time to return the results to the server. During this time the server does not want
to block other incoming client requests.
To handle multiple client requests, NFS servers are multithreaded or there are multiple
servers running at the same time.
4.7.2 Communication
In NFS, all communication between a client and server proceeds along the open network
computing RPC protocol. ONC RPC is similar to other RPC systems.
Every NFS operations can be implemented as a single remote procedure call to a file server.
Up until NFS version 4, the client was made responsible for making the server life as easy
as possible by keeping requests relatively simple.
For example, in order to read data from a file for the first time, a client normally first has to
look up the file handle using the lookup operation, after which it can issue a read request.
This approach requires two successive RPCs. In a wide-area system the drawback is that
the extra latency of a second RPC may lead to a performance degradation.
NFS version 4 supports compound procedures by which several RPCs can be grouped into
a single request. In the previous example, the client combines the lookup and read request
into a single RPC.
In the case of version 4, it is also necessary to open the file before reading can take place.
There are no transactional semantics associated with compound procedures.
The operations are simply handled in the order as requested. If there are concurrent
operations from other clients then no measures are taken to avoid conflicts.
The NFS naming model provides complete transparent access to a remote file system as
maintained by a server. This transparency is achieved by letting a client be able to mount a
remote file system into its own local file system.
Each client maintains a table which maps the remote file directories to servers.
Instead of mounting an entire file system, NFs allows clients to part of a file system. A
server is said to export directory when it makes that directory and its entries available to
clients.
The mount protocol is used to establish the initial logical connection between a server and a
client. A mount operation includes the name of the remote directory to be mounted and the
name of the server machine storing it.
The server maintains an export list which specifies local file system that it exports for
mounting along with the permitted machine names.
An NFS server can itself mount directories that are exported by other servers. However, it
is not allowed to export those directories to its own clients.
Instead, a client will have to explicitly mound such a directory from the server that
maintains it.
There is a problem with this model that has to do with deciding when a remote file system
should be mounted. To deal with the problem NFS implements on demand mounting of a
remote file system that is handled by an automounter, which runs as a separate process on
the client's machine.
Fig. 4.7.2 shows simple automounter in NFS.
To access a file, a client must first look up its name in a naming service and obtain the
associated file handle. A file handle is a reference to a file within a file system.
It is independent of the name of the file it refers to. A file handle is created by the server
that is hosting the file system and is unique with respect to all file systems exported by the
server.
It is created when the file is created. The client is kept ignorant of the content of a file
handle. In version 4, file handles can have a variable length up to 128 bytes.
The automounter was added to the UNIX implementation of NFS in order to mount a
remote directory dynamically whenever an 'empty' mount point is referenced by a client.
Automounter has a table of mount points with a reference to one or more NFS servers listed
against each. It sends a probe message to each candidate server and then uses the mount
service to mount the file system at the first server to respond.
Automounter keeps the mount table small. Automounter provides a simple form of
replication for read-only file systems.
An NFS file has a number of associated attributes. With NFS version 4, the set of file
attributes has been split into a set of mandatory attributes that every implementation must
support (type, size, change, FSID), a set of recommended attributes that should be
preferably supported, and an additional set of named attributes.
Named attributes are actually not part of the NFS protocol, but are encoded as an array of
(attribute, value)-pairs in which an attribute is represented as a string, and its value as an
un-interpreted sequence of bytes. They are stored along with the file (or directory) and NFS
provides operations to read and write attribute values.
The mount protocol is used to establish the initial logical connection between a server and a
client. A mount operation includes the name of the remote directory to be mounted and the
name of the server machine storing it.
The server maintains an export list which specifies local file system that it exports for
mounting along with the permitted machine names.
UNIX uses /etc./exports for this purpose. The list has a maximum length, NFS is limited in
scalability. Any directory within an exported file system can be mounted remotely on a
machine. When the server receives a mount request, it returns a file handle to the client.
File handle is basically a data-structure of length 32 bytes. It serves as the key for further
access to files within the mounted system.
In UNIX term, the file handle consists of a file system identifier that is stored in super
block and an inode number to identify the exact mounted directory within the exported file
system.
In NFS, one new field is added in inode that is called the generic number. Mount can be is
of three types -
1. Soft mount : A time bound is there.
2. Hard mount : No time bound.
3. Automount : Mount operation done on demand.
In addition, write operations can be carried out in the cache as well. When the client closes
the file, NFS requires that if modifications have taken place, the cached data must be
flushed back to the server. This approach corresponds to implementing session semantic.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge
Distributed Systems (4 - 36) Naming and Distributed File Systems
Once a file has been cached, a client can keep its data in the cache even after closing the
file. Also, several clients on the same machine can share a single cache.
Blocks that are read from a NFS server are kept in a disk cache. As blocks of a file are read
they are added to the cache for this file. Once the file is complete it is marked as being
persistent and can survive client crashes.
This is possible because once the whole file is cached kernel data structures are no longer
necessary to gain information about which blocks of the file are present.
If the cache becomes full a process runs through the cache and removes persistent objects
with preference for least recently used objects.
Partially cached files are not eligible for cleaning as the additional complexity of updating
the kernel data structures associated with these files during cache cleaning is tedious at best.
As this is mechanism favors files that are complete in the cache a background process runs
to collect uncached blocks of partially cached files.
The client also supports an RPC lookup cache that holds recently requested information
about file and directory attributes. These attribute requests actually contribute a large
amount of the RPC traffic associated with NFS.
This cache is however limited in its usefulness as the cache must expire after a time in the
order of 100 ms to maintain NFS semantics.
If the cache is held for longer then the client may no longer hold an accurate view of the
attributes held on the server and there are no conflict resolution procedures in the NFS
protocol to handle such a situation.
Finally the client supports asynchronous writing of files. Buffering of writes of a file to a
server avoids traffic in the situation where a file is modified many times in succession.
This is of limited importance in the HTTP server application but becomes much more
poignant when serving files to a group of workstations.
Disadvantages
1. Network slower than local disk
2. Network or server may fail even when client OK
3. Complexity, security issues
Andrew scenario
User process issues an open and there is not a current copy of the file in the client cache.
The client sends a request to the server for the whole file and stores it as a local file in a
local file system (client cache).
An open is then performed on the local file.
Subsequent operations work on the local file.
When the client issues a close, the entire file is written back, but the copy is still kept on the
client machine.
Implementation of AFS
The key software components in AFS are :
1. Vice : The server side process that resides on top of the UNIX kernel, providing shared
file services to each client. Collection of servers is referred to as vice.
2. Venus : Client workstations are called Venus. The client side cache manager which acts
as an interface between the application program and the Vice.
Fig. 4.8.1 shows distribution of processes in the Andrew File System. All the files in AFS
are distributed among the servers. The set of files in one server is referred to as a volume.
In case a request cannot be satisfied from this set of files, the vice server informs the client
where it can find the required file.
Venus interacts with the kernels virtual file system (VFS) which provides the abstraction of
a common file system at each client and is responsible for all distributed file operation. The
files available to user processes running on clients are either local or shared. Local files are
handled as normal UNIX files. They are stored on a client disk and are available only to
local user processes. Shared files are stored on servers, and copies of them are cached on
the local disks of clients.
The client-side component of AFS is the cache manager. The responsibilities of the cache
manager include retrieving files from servers, maintaining a local file cache, translating file
requests into remote procedure calls, and storing callbacks.
The cache manager redirects all read and write calls to the cached copy. When the client
closes the file, the cache manager flushes the changes to the server.
When the cache manager fetches the file from the server, the server also supplies a callback
associated with the data. The callback is a promise that the data is valid. If another client
modifies the file and writes the changes back to the server, the server notifies all clients
holding callbacks for the file. This is called breaking the callback.
The basic file operations :
1. Open a file : Venus traps application generated file open system calls, and checks
whether it can be serviced locally before requesting Vice for it. It then returns a file
descriptor to the calling application. Vice, along with a copy of the file, transfers a
callback promise, when Venus requests for a file.
2. Read and Write : Reads/Writes are done from/to the cached copy.
3. Close a file : Venus traps file close system calls and closes the cached copy of the file.
If the file had been updated, it informs the Vice server which then replaces its copy with
the updated one, as well as issues callbacks to all clients holding callback promises on
this file. On receiving a callback, the client discards its copy, and works on this fresh
copy.
The server wishes to maintain its states at all times, so that no information is lost due to
crashes. This is ensured by the Vice which writes the states to the disk. When the server
comes up again, it also informs all the servers about its crash, so that information about
updates may be passed to it. The callback mechanism implies a stateful server.
System call interception in AFS
Fig. 4.8.2 shows system call interception in AFS.
Venus intercepts two system calls sent to the OS : open ( ) and close ( ). On an open ( )
request, it investigates the filename to determine whether it lies in AFS space.
If the filename does not lie in AFS space, then the file is a file on the local hard drive, and
Venus simply passes the system call on to the regular open( ) system call handler to handle
as normal. But if it lies in AFS space, then Venus has some work to do.
Other Issues of AFS :
AFS presents a location-transparent UNIX file name space to client, using a set of trusted
servers. Directories are cached in their entirety, while files are cached in 64 KB chunks. All
updates to a file are propagated to its server upon close. Directory modifications are
propagated immediately.
Backup, disk quota enforcement, and most other administrative operations in AFS operate
on volumes. AFS uses ACLs and the granularity of protection is an entire directory.
b home-based approaches
d all of these
c non-consistent d consistent
Q.6 Which one of the following hides the location where in the network the file is stored ?
a Transparent distributed file system
b local name
Q.9 The NFS client and server modules communicate using ________.
a remote method invocation b remote procedure calls
Syllabus
Contents
5.6 Caching and Replication in the Web ............. May - 19 ........................ Marks 9
(5 - 1)
Distributed Systems (5 - 2) Consistency and Replication
If a file system has been replicated it may be possible to continue working after one
replica crashes by simply switching to one of the other replicas.
Also, by maintaining multiple copies, it becomes possible to provide better protection
against corrupted data.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge
Distributed Systems (5 - 3) Consistency and Replication
For example, imagine there are three copies of a file and every read and write operation
is performed on each copy.
It can safe against a single, failing write operation, by considering the value that is
returned by at least two copies as being the correct one.
2. Replication for performance
Fig. 5.2.1
In this model, writes must occur in the same order on all copies; reads however can be
interleaved on each system, as convenient.
A DSM system is said to be sequentially consistent if for any execution there is some
interleaving of the series of operations issued by all the processes that satisfies the
following two criteria :
1. SC1 : The interleaved sequence of operations is such that if occurs in the sequence, then
either the last write operation that occurs before it in the interleaved sequence is, or no
write operation occurs before it and a is the initial value of x.
2. SC2 : The order of operations in the interleaving is consistent with the program order in
which each individual client executed them.
The result of the execution of a parallel program is the same as if the program is
executed on a single processor in a sequential order :
P: write x; write y; read x;
Q: read y; write x; read x;
Some legitimate sequential orders :
P write x; P write y; P read x; Q read y; Q write x; Q read x;
P write x; Q read y; P write y; Q write x; P read x; Q read x;
Q read y; Q write x; P write x; P write y; P read x; Q read x
All processors see the same sequence of memory references
a. Concurrently P : write x; Q : write x;
b. One process sees P writes first, then Q, then every process sees the same order.
If write requests are processed exclusively, sequential consistency can be achieved.
A sequential consistency memory model provides one - copy/single - copy semantics
because all the processes sharing a memory location always see exactly the same
contents stored in it.
5.2.3 Linearizability
The result of any execution is the same as if the operations by all processes on the data
were executed in some total order.
The operations of each individual process appear in this sequence are in the order as how
they actually happened in real time :
a. Bring in server's view to define the ordering of concurrent events.
b. Real times of how activities have actually happened are defined by the actions
performed on the servers. It is define by the actual enqueuing time of each request.
c. Non-overlapping requests have to follow the order of the requests' enqueuing times.
d. Overlapping requests : Enqueuing times of the requests are in different orders on
different servers can have arbitrary order, but sequentially consistent.
Fig. 5.2.2
Fig. 5.2.3
Fig. 5.2.4
P3 and P4 observe the writes by P1 and P2 in different orders, although W(x)1 and W(x)2
are potentially causally related.
PRAM consistency model is simple and easy to implement and also has good performance.
PRAM consistency can be implemented by simply sequencing the write operations
performed at each node independently of the write operations performed on other nodes.
Eventual consistency requires only that updates are guaranteed to propagate to all replicas.
Eventual consistent data stores work fine as long as clients always access the same replica.
Eventual consistency for replicated data is fine if clients always access the same replica.
Client centric consistency provides consistency guarantees for a single client with respect to
the data stored by that client.
What happens when different replicas are accessed ?
Example : Consider a distributed database to which you have access through your
notebook. Assume your notebook acts as a front end to the database. At location A you
access the database doing reads and updates. At location B you continue your work, but
unless you access the same server as the one at location A, you may detect inconsistencies,
because :
1. Your updates at A may not have yet been propagated to B
2. You may be reading newer entries than the ones available at A
3. Your updates at B may eventually conflict with those at A
Fig. 5.3.1 shows distributed database for mobile user.
For the mobile user example, eventual consistent data stores will not work properly. Client
- centric consistency provides guarantees for a single client concerning the consistency of
access to a data store by that client. No guarantees are given concerning concurrent
accesses by different clients.
Example : Automatically reading your personal calendar updates from different servers.
Monotonic Reads guarantees that the user sees all updates, no matter from which server the
automatic reading takes place.
Example : Reading (not modifying) incoming mail while you are on the move.
Each time you connect to a different e-mail server, that server fetches (at least) all the
updates from the server you previously visited.
Example : The read operations performed by a single process P1 at two different local
copies of the same data store.
The vertical axis shows the two different local copies of the data store. We called as
Location1 and Location2.
Horizontal axis shows the time. Operations carried out by a single process P1 in boldface
are connected by a dashed line representing the order in which they are carried out.
Fig. 5.3.2 shows monotonic read operation.
Process P1 first performs a read operation on X at Location1, returning the value of X1.
This value results from the write operations in Write (X1) performed at Location1. Later,
P1 performs a read operation on X at Location2, shown as Read (X2).
To guarantee monotonic-read consistency, all operations in Write (X1) should have been
propagated to Location2 before the second read operation takes place.
Fig. 5.3.2 (b) : A data store that does not provide monotonic reads
Situation in which monotonic-read consistency is not guaranteed. After process P1 has read
X1 at Location1, it later performs the operation Read (X2 ) at Location2. But, only the
write operations in Write (X2 ) have been performed at Location2. No guarantees are given
that this set also contains all operations contained in Write (X1).
A write operation on a copy of item x is performed only if that copy has been brought up to
date by means of any preceding write operation, which may have taken place on other
copies of x. If need be, the new write must wait for old ones to finish.
Example : Updating a program at server S2, and ensuring that all components on which
compilation and linking depends, are also placed at S2.
Example : Maintaining versions of replicated files in the correct order everywhere.
The write operations performed by a single process P at two different local copies of the
same data store.
Resembles to PRAM, but here we are considering consistency only for a single process
(client) instead of for a collection of concurrent processes.
Fig. 5.3.3 shows monotonic - write consistent data store and data store that does not provide
monotonic-write consistency.
Fig. 5.3.3 (b) : Store that does not provide monotonic - write consistency
Write(X1) has not been propagated to Location2
Example 1 : Updating a program at server S2, and ensuring that all components on which
compilation and linking depends, are also placed at S2.
Example 2 : Maintaining versions of replicated files in the correct order everywhere.
Fig. 5.3.4 (b) : A data store that does not provides read-your-writes consistency
All of those writes can take a long time. Using a non-blocking write protocol to handle
the updates can lead to fault tolerant problems.
As the primary is in control, all writes can be sent to each backup replica IN THE
SAME ORDER, making it easy to implement sequential consistency.
2. Local - Write Protocols
It is fully migrating approach. A single copy of the data item is still maintained.
Upon a write, the data item gets transferred to the replica that is writing. The status of
primary for a data item is transferrable.
Process : Whenever a process wants to update data item x, it locates the primary copy of
x, and moves it to its own location.
Example : Fig. 5.5.2 shows local write protocol.
Primary-based local - write protocol in which a single copy is migrated between
processes (prior to the read/write).
Requests are processed by all RM’s independently. Client interface compares all replies
received and can tolerate N out of 2N+1 failure, i.e. consensus when N + 1 identical
response received. This model also can tolerate Byzantine failure.
Fig. 5.5.3 shows active replication.
Web caching refers to the temporary storage of web content somewhere between web
servers and clients in order to satisfy future requests from the nearby location. Fig. 5.6.1
shows proxy web cache.
Co-operative caching
In co-operative caching mechanisms, a group of caches work together by collectively
pooling their memory resources to provide a larger proxy cache. These co-operating caches
can also be centrally managed by a server.
To search in the co-operative cache, the proxy forwards the requested URL to a mapping
server. The use of a central mapping service distinguishes the CRISP cache from other
co-operative Internet caches.
Often, multiple caches in a network coordinate and share resources in order to serve each
others’ requests. This is also known as cooperative caching.
When a cache does not have the requested data object, it can forward the request to a
nearby cooperating cache that can serve the object faster than the origin server.
Cooperative caching is typically implemented across caches within an organization such as
a large enterprise, ISP, or a Content Delivery Network (CDN).
CDN providers use caching and replica servers located in different geographical locations
to replicate content. CDN cache servers are also called edge servers or surrogates. The edge
servers of a CDN are called Web cluster as a whole.
CDNs distribute content to the edge servers in such a way that all of them share the same
content and URL. Client requests are redirected to the nearby optimal edge server and it
delivers requested content to the end users. Thus, transparency for users is achieved.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge
Distributed Systems (5 - 20) Consistency and Replication
Akamai is one of the largest CDNs currently deployed, with tens of thousands of replica
servers placed all over the Internet. To a large extent, Akamai uses well-known technology
to replicate content, notably a combination of DNS based redirection and proxy-caching
techniques.
There are essentially three different kinds of aspects related to replication in Web hosting
systems :
1) Metric estimation
2) Adaptation triggering
3) Taking appropriate measures :
A. Replica placement decisions
B. Consistency enforcement
C. Client-request routing
Metric estimation
1. Latency : Time is measured for an action. Fetching a document is an example of latency.
2. Spatial metrics : It consists of measuring the distance between nodes in terms of network
level routing hops.
3. Consistency metrics : Tell user to what extent a replica is deviating from its master
copy.
4. Financial metrics are closely related to the acutal infrastructure of the Internet. For
example, most commerical CDNs place servers at the edge of the Internet, meaning that
they hire capacity from ISPs directly servicing end users.
Review Questions
Q.4 __________ redundancy adds extra equipment or processes so that the system can
tolerate the loss or malfunctioning of some components.
a Physical b Time c Information d Sequential
Q.5 If local states jointly do not form a distributed snapshot, further rollback is necessary.
This process of a cascaded rollback may lead to what is called the __________.
a checkpointing b recovery line
d All of these
Q.8 __________ protocols assume that a failure can occur after any non-deterministic
event in the computation.
a Optimistic logging b Pessimistic logging
Q.9 If no updates take place for a long time, all replicas will gradually become consistent.
This form of consistency is called __________ consistency.
a sequential b strict c weak d eventual
Q.10 In a push based approach, also referred to as __________ protocols, updates are
propagated to other replicas without those replicas even asking for the updates.
a client-based b server-based
Q.11 __________ approaches are often used between permanent and server-initiated
replicas, but can also be used to push updates to client caches.
a Push-based b Pull-based
c Client-based d Server-based
6 Fault Tolerance
Syllabus
Contents
(6 - 1)
Distributed Systems (6 - 2) Fault Tolerance
Fig. 6.1.1
Requirements :
1. Availability : It is defined as the property that a system is ready to be used immediately.
The fraction of the time that a system meets its specification. The probability that the
system is operational at a given time t.
2. Reliability : It refers to the property that a system can run continuously without failure.
Typically used to describe systems that cannot be repaired or where the continuous
operation of the system is critical.
3. Safety : It refers to the situation that when a system temporarily fails to operate
correctly.
4. Maintainability : It refers to how easy a failed system can be repaired.
3. Timing failures : Applicable only to synchronous distributed systems where time limits
may not be met.
Fault models : Following are the fault models
o Omission faults
o Arbitrary faults
o Timing faults
Faults can occur both in processes and communication channels. The reason can be both
software and hardware faults.
Fault models are needed in order to build systems with predictable behaviour in case of
faults.
Of course, such a system will function according to the predictions, only as long as the
real faults behave as defined by the “fault model”.
1. Omission failures
Arbitrary failures :
Arbitrary process failure : Arbitrarily omits intended processing steps or takes unintended
processing steps.
Arbitrary channel failures : Messages may be corrupted, duplicated, delivered out of
order, incur extremely large delays; or non - existent messages may be delivered.
Above two are Byzantine failures, e.g., due to hackers, man-in-the-middle attacks, viruses,
worms, etc.
A variety of Byzantine fault-tolerant protocols have been designed in literature.
Arbitrary failures in processes cannot be detected by seeing whether the process responds to
invocations, because it might arbitrarily omit to reply.
Communication channel also suffer from arbitrary failures. For examples : Messages
contents can be corrupted, a duplicate message can be sent or message can be lost on its
way.
Omission and arbitrary failures are as follows :
Sr. No. Class of failure Affects Description
1. Fail-stop or Crash-stop Process Process halts and remains halted.
Other processes may detect this
state.
2. Omission Channel A message inserted in an
outgoing message buffer never
arrives at the other end's
incoming message buffer.
3. Send-omission Process A process completes a send, but
the message is not put in its
outgoing message buffer.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge
Distributed Systems (6 - 5) Fault Tolerance
Key property : When a message is sent, all members of the group must receive it. So, if one
fails, the others can take over for it.
Groups could be dynamic. We need mechanisms to manage groups and membership
(e.g., join, leave, be part of two groups)
Flat groups versus hierarchical groups
Fig. 6.2.1 shows communication in a flat group and communication in a simple hierarchical
group.
Fig. 6.2.1
All the processes are equal and decisions are made collectively.
There is no single point-of-failure, however decision making is complicated as
consensus is required.
Good for fault tolerance as information exchange immediately occurs with all group
members. May impose overhead as control is completely distributed, and voting needs
to be carried out.
Harder to implement.
2. Communication in a simple hierarchical group :
One of the processes is elected to be the coordinator, which selects another process (a
worker) to perform the operation.
Not really fault tolerant or scalable
However, easier to implement
But the one or more of the generals may be treacherous, i.e. faulty.
If the commander is treacherous, he proposes attacking to one general and retreating to
another.
If a lieutenant is treacherous, he tells one of his peers that the commander told him to attack
and another that they are to retreat.
Source processor broadcasts its values to others. Solution must meet following objectives :
Agreement : All non-faulty processors agree on the same value.
Validity : If source is nonfaulty, then the common agreed value must be the value supplied
by the source processor.
“If source is faulty then all non - faulty processors can agree on any common value”.
“Value agreed upon by faulty processors is irrelevant”
Fig. 6.2.2 shows Byzantine agreement.
No solution for three processes can handle a single traitor. In a system with m faulty
processes agreement can be achieved only if there are 2m+1 (more than 2/3) functioning
correctly.
Failure detector is object/code in a process that detects failures of other processes. Failure
detector is not every time accurate. This is category as unreliable failure detector.
Unreliable failure detector is one of two values : Unsuspected or Suspected.
Unsuspected : Failure is unlikely, for example : Failure detector has recently received
communication from unsuspected peer. This may be inaccurate.
Suspected : Indication that peer process failed. For example : No message received in quite
sometime. Also this may be inaccurate because peer process hasn't failed, but the
communication link is down, or peer process is much slower than expected.
A simple algorithm
If we assume that all messages are delivered within some bound, say D seconds. Then we
can implement a simple failure detector as :
Every process p sends a "p is still alive" message to all failure detector processes,
periodically, once every T seconds.
If a failure detector process does not receive a message from process q within T + D
seconds of the previous one then it marks q as "Suspected".
If we choose our bound D too high then often a failed process will be marked as
"Unsuspected". A synchronous system has a known bound on the message delivery time
and the clock drift and hence can implement a reliable failure detector. An asynchronous
system could give one of three answers : "Unsuspected, Suspected or Failed" choosing
two different values of D.
In fact we could instead respond to queries about process p with the probability that p has
failed, if we have a known distribution of message transmission times. For example : If you
know that 90 % of messages arrive within 2 seconds and it has been two seconds since your
last expected message you can conclude there is a : NOT a 90 % chance that the process p
has failed.
The client cannot tell if the crash occurred before or after the request is carried out
Three possible semantics
1. At - least-once : keep trying until a reply is received
2. At - most-once : give up immediately and report back failure
3. Exactly once : desirable but not achievable.
b. Lost Request / Reply Messages
Client waits for reply message, resents the request upon timeout
Problem : Upon timeout, client cannot tell whether the request was lost or the reply was
lost
Client can safely resend the request for idempotent operations
An idempotent operation is an operation that can be safely repeated
E.g., reading the first line of a file is idempotent, transferring money is not
For non-idempotent operations, client can add sequence numbers to requests so that the
server can distinguish a retransmitted request from an original request
Server need keep track of the most recently received sequence number from each client.
Server will not carry out a retransmitted request, but will send a reply to the client.
c. Client Crashes after Sending a Request
Each multicast message is stored locally in a history buffer at the sender. Assuming the
receivers are known to the sender, the sender simply keeps the message in its history buffer
until each receiver has returned an acknowledgment.
If a receiver detects it is missing a message, it may return a negative acknowledgment,
requesting the sender for a retransmission
Another important design decision in group communication is the ordering of messages
sent to a group. Roughly speaking, there are four possible orderings: no ordering, FIFO
ordering, causal ordering, and total ordering.
Two message sending events are said to be causally related if they are correlated by the
happened before relation.
Fig. 6.4.4 shows the causal ordering.
Some applications perform operations on multiple databases. For example : Transfer funds
between two bank accounts or debiting one account and crediting another.
We would like a guarantee that either all the databases get updated, or none does.
Distributed commit problem : Operation is committed when all participants can perform it.
Once a commit decision is reached, this requirement holds even if some participants fail
and later recover.
Commit protocols are used to ensure atomicity across sites.
Transaction which executes at multiple sites must either be committed at all the sites, or
aborted at all the sites. But it is not acceptable to have a transaction committed at one site
and aborted at another.
The two-phase commit (2PC) protocol is widely used.
The three-phase commit (3PC) protocol is more complicated and more expensive, but
avoids some drawbacks of two-phase commit protocol. This protocol is not used in
practice.
Transaction behave as one operation :
1. Atomicity : All-or-none, if transaction failed then no changes apply to the database
2. Consistency : There is no violation of the database integrity constraints
3. Isolation : Partial results are hidden (due to incomplete transactions)
4. Durability : The effects of transactions that were committed are permanent.
The objective of the two-phase commit is to ensure that each node commits its part of the
transaction; otherwise, the transaction is aborted. If one of the nodes fails to commit, the
information necessary to recover the database is in the transaction log, and the database can
be recovered with the DO-UNDO-REDO protocol.
Time-out actions in the Two phase commit
It is to avoid blocking forever when a process crashes or a message is lost. Fig. 6.5.1 shows
the communication in two phase commit protocol.
1. To deal with server crashes : Each participant saves tentative updates into permanent
storage, right before replying yes/no in first phase. It is retrievable after crash recovery.
2. To deal with can commit ? loss : The participant may decide to abort unilaterally after a
timeout.
3. To deal with Yes/No loss : The coordinator aborts the transaction after a timeout. It must
announce doAbort to those who sent in their votes.
4. To deal with doCommit loss : The participant may wait for a timeout, send a
getDecision request cannot abort after having voted Yes but before receiving
doCommit/doAbort.
Advantages of two phase commit
1. It ensures atomicity even in the presence of deferred constraints.
2. It ensures independent recovery of all sites.
3. Since it takes place in two-phases, it can handle network failures, disconnections and in
their presence assure atomicity.
Disadvantages of two phase commit
1. Involves a great deal of message complexity.
2. Greater communication overheads as compared to simple optimistic protocols.
3. Blocking of site nodes in case of failure of coordinator.
4. Multiple forced writes of log, which increase latency.
5. Its performance is again a trade off, especially for short lived transactions, like internet
applications.
Review Question
1. Explain the requirements of atomic commitment problem. How atomic commit protocol
can be implemented by two phase commit ? SPPU : Dec. - 18, End sem, Marks 8
Recovery refers to restoring a system to its normal operational state. Once a failure has
occurred, it is essential that the process where the failure happened can recover to a correct
state. Fundamental to fault tolerance is the recovery from an error.
Resources are allocated to executing processes in a computer. For example : A process has
memory allocated to it and a process may have locked shared resources, such as files and
memory.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge
Distributed Systems (6 - 21) Fault Tolerance
a. System failure : System does not meet requirements, i.e. does not perform its services
as specified.
b. Erroneous system state : State which could lead to a system failure by a sequence of
valid state transitions.
c. Error : The part of the system state which differs from its intended value.
d. Fault : Anomalous physical condition, e.g. design errors, manufacturing problems,
damage, external disturbances.
A failure occurs when an actual running system deviates from this specified behavior. The
cause of a failure is called an error. An error represents an invalid system state, one that is
not allowed by the system behavior specification.
The error itself is the result of a defect in the system or fault. In other words, a fault is the
root cause of a failure. That means that an error is merely the symptom of a fault. A fault
may not necessarily result in an error, but the same fault may result in multiple errors.
Similarly, a single error may lead to multiple failures.
To ensure correctness, recovery mechanisms must be in place to ensure transaction
atomicity and durability even in the midst of failures.
Distributed recovery is more complicated than centralized database recovery because
failures can occur at the communication links or a remote site. Ideally, a recovery system
should be simple, incur tolerable overhead, maintain system consistency, provide partial
operability and avoid global rollback.
Reliability refers to the probability that the system under consideration does not experience
any failures in a given time period. Availability refers to the probability that the system can
continue its normal execution according to the specification at a given point in time in spite
of failures.
2. System failure :
Behavior : Processor fails to execute.
A system failure occurs when the processor fails to execute. It is caused by software
errors and hardware problems.
Caused by software errors or hardware faults i.e. CPU/memory/bus failure.
Recovery : System stopped and restarted in correct state.
Assumption : Fail-stop processors, i.e. system stops execution, internal state is lost.
3. Secondary storage failure :
A secondary storage failure is said to have occurred when the stored data cannot be
accessed. This failure is usually caused by parity error, head crash or dust particles
settled on the medium.
Behavior : Stored data cannot be accessed.
Errors causing failure : Parity error, head crash, etc.
Recovery/Design strategies : Reconstruct content from archive plus log of activities and
design mirrored disk system.
A system failure can further be classified as follows.
1. An amnesia failure 2. A partial amnesia failure
3. A pause failure 4. A halting failure
4. Communication medium failure :
Behavior : A site cannot communicate with another operational site.
A communication medium failure occurs when a site cannot communicate with another
operational site in the network. It is usually caused by the failure of the switching nodes
and/or the links of the communicating system.
Errors/Faults : Failure of switching nodes or communication links.
Recovery/Design strategies : Reroute, error-resistant communication protocols.
3. Checking must be done periodically to see whether the failed site has recovered or not.
4. After restarting the failure site, site must initiate a recovery procedure to abort all partial
transactions that were active at the time of failure.
Fig. 6.6.2
6.6.4 Checkpoint
Checkpointing : The process of writing the current committed values of a server’s object
to a new recovery file, together with transaction status entries and intentions lists of
transactions that have not yet been fully resolved.
Checkpoint : The information stored by the checkpointing process. It is a point of
synchronization between database and log file. All buffers are force-written to secondary
storage.
Establish a set of local checkpoints (one for each process in the set) such that no
information flow takes place (i.e., no orphan messages) during the interval spanned by the
checkpoints.
A strongly consistent set of checkpoints (recovery line) corresponds to a strongly consistent
global state.
There is one recovery point for each process in the set during the interval spanned by the
checkpoints; there is no information flow between any pair of processes in the set and
process in the set and any process outside the set.
A consistent set of checkpoints corresponds to a consistent global state.
No local checkpoint includes an effect whose would be undone due to the rollback of
another process.
Consistent set of checkpoints
Similar to the consistent global state.
Each message that is received in a checkpoint (state) should also be recorded as sent in
another checkpoint (state).
Suppose that Y fails after receiving message 'm'. If Y restarts from checkpoint, message 'm'
is lost due to rollback.
Checkpoint notation :
Each node maintains :
1. Monotonically increasing counter with which each message from that node is labelled.
2. Records of the last message from and the first message to all other nodes.
Fig. 6.6.5
Note : "sl" denotes a "smallest label" that is <any other label and "ll" denotes a "largest
label" that is > any other label.
Each checkpoint on a data item is assigned a unique sequence number. A data item is
checkpointed only after the state of the data item changes. That is, after a data item is
checkpointed, it is not checkpointed again until at least one other transaction has
accessed and changed the data item.
Let T = {Ti | 1 i m}be a set of transactions that access the database system.
Each regular transaction is a partially ordered set of read and/or write operations. A
checkpointing transaction consists of only one operation, an operation that is similar to a
write operation which requires mutually exclusive access to the data item.
Checkpointing in distributed systems requires that all processes (sites) that interact with
one another establish periodic checkpoints. All the sites save their local states :local
checkpoints. All the local checkpoints, one from each site, collectively form a global
checkpoint. The domino effect is caused by orphan messages, which in turn are caused
by rollbacks.
Simple method for taking consistent set of checkpoint
a. Every process takes a checkpoint after sending every message.
Initiating process Pi takes a tentative checkpoint and requests that all the processes take
tentative checkpoints.
Each process informs Pi whether it succeeded in taking a tentative checkpoint.
If Pi learns that all processes have taken tentative checkpoints, Pi decides that all
tentative checkpoints should be made permanent.
Otherwise, Pi decides that all tentative checkpoints should be discarded.
Phase Two
1. Pi propagates its decision to all processes.
2. On receiving the message from Pi, all processes act accordingly.
Between tentative checkpoint and commit/abort of checkpoint process must hold back
messages.
Does this guarantee we have a strongly consistent state ? Can you construct an example
that shows we can still have lost messages ?
Synchronous Checkpointing : Properties
5. Y takes tentative checkpoint only if the last message received by X from Y was sent
after Y sent the first message after the last checkpoint (last_recv(x, y) > =
first_send(y, x)).
When a process takes a checkpoint, it will ask all other processes that sent messages to
the process to take checkpoints.
Synchronous Checkpointing Disadvantages
1. Additional messages must be exchanged to coordinate checkpointing.
2. Synchronization delays are introduced during normal operations.
3. No computational messages can be sent while the checkpointing algorithm is in
progress.
4. If failure rarely occurs between successive checkpoints, then the checkpoint algorithm
places an unnecessary extra load on the system, which can significantly affect
performance.
2. Rollback Recovery
Restore the system state to a consistent state after a failure with assumptions : Single
initiator, checkpoint and rollback recovery algorithms are not invoked concurrently.
Phase One :
Process Pi checks whether all processes are willing to restart from their previous
checkpoints.
A process may reply “no” if it is already participating in a checkpointing or recovering
process initiated by some other process.
If all processes are willing to restart from their previous checkpoints, Pi decides that they
should restart.
Otherwise, Pi decides that all the processes continue with their normal activities.
Phase Two :
Pi propagates its decision to all processes.
On receiving Pi’s decision, the processes act accordingly.
Optimization
A minimum number of processes roll back
Y will restart from its permanent checkpoint only if X is rolling back to a state where
the sending of one or more messages from X to Y is being undone.
Fig. 6.6.7 shows the unnecessary rollback.
Fig. 6.6.8 shows a system consisting of three processes, where horizontal lines extending
toward the right-hand side represent the execution of each process, and arrows between
processes represent messages.
An orphan process is a process that survives the crash of another process, but whose state
is inconsistent with the crashed process after its recovery.
Three processes X, Y and Z are exchange their information. Information exchange is shown
by arrows () and symbol "[" marks a recovery point to which a process can be rolled back
in the event of a failure.
1. Case 1 : Failure of X after x3 : no impact on Y or Z.
Lost Messages
Regenerating lost messages on recovery :
1. If implemented on unreliable communication channels, the application is responsible
2. If implemented on reliable communication channels, the recovery algorithm is
responsible.
Fig. 6.6.10 shows lost messages due to roll back recovery.
Q.2 A problem with the __________ commit protocol is that when the coordinator has
crashed, participants may not be able to reach a final decision.
a three-phase b two-phase c checkpoint d none
Q.3 __________ redundancy add extra bits to allow recovery from garbled bits.
a Physical b Time c Information d All of these
Q.6 __________ redundancy adds extra equipment or processes so that the system can
tolerate the loss or malfunctioning of some components.
a Physical b Time c Information d Sequential
Q.7 If local states jointly do not form a distributed snapshot, further rollback is necessary.
This process of a cascaded rollback may lead to what is called the __________
a checkpointing b recovery line
Q.8 As processes take local checkpoints independent of each other, this method is also
referred to as __________.
a coordinated checkpointing b independent checkpointing
d All of these
Q.11 __________ protocols assume that a failure can occur after any non-deterministic
event in the computation.
a Optimistic logging b Pessimistic logging
c uncoordinated d none
Q.13 The most recent consistent global checkpoint is termed as the __________.
a domino effect b recovery line
Q.14 The checkpoints that a process takes independently are __________ checkpoints
while those that a process is forced to take are called forced checkpoints.
a global b communication-induced
c local d uncoordinated
®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
(M - 1)
Distributed Systems M-2 Solved Model Question Papers
1
Time : 2 Hours] [Maximum Marks : 70
2
N.B : i. Attempt Q.1 or Q.2, Q.3 or Q.4, Q.5 or Q.6, Q.7 or Q.8.
ii. Neat diagrams must be drawn wherever necessary.
iii. Figures to the right side indicate full marks.
iv. Assume suitable data, if necessary.
Q.1 a) Explain Berkeley Algorithm. [Refer section 3.2.1] [4]
b) Explain happened before relationship in a distributed system for logical clock.
[Refer section 3.3] [6]
c) What is election algorithm ? Explain Bully Algorithm.
[Refer sections 3.5 and 3.5.1] [8]
OR
Q.2 a) Explain Maekawas Voting Algorithm. [Refer section 3.4.7] [4]
b) Explain the following terms :
i) Drift rate ii) Clock skew iii) Coordinated universal time. [Refer section 3.1] [6]
c) Discuss central server algorithm. Explain performance metrics for mutual exclusion
algorithms. [Refer section 3.4] [8]
Q.3 a) Explain desirable features of a good naming system. [Refer section 4.1.1.1] [4]
b) How resolver looks up a remote name ? [Refer section 4.3.3] [6]
c) What is distributed hash tables ? Explain chord with finger table.
[Refer section 4.2.3] [7]
OR
Q.4 a) What is LDAP ? Why use LDAP ? [Refer section 4.4.2] [3]
b) Explain distributed file system requirements. [Refer section 4.5.2] [6]
c) What is NFS ? List goals of NFS design. Draw and explain NFS architecture.
[Refer section 4.7] [8]
Q.5 a) What is replica management ? [Refer section 5.4] [4]
Notes