0% found this document useful (0 votes)

708 views

Distributed Systems

Uploaded by

Malang Two

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

708 views

Distributed Systems

Uploaded by

Malang Two

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 228

SUBJECT CODE : 310245(C)

As per Revised Syllabus of

Savitribai Phule Pune University

Choice Based Credit System (CBCS)
T.E. (Computer) Semester - V (Elective - I)

Distributed Systems
Iresh A. Dhotre
M.E. (Information Technology)
Ex-Faculty, Sinhgad College of Engineering,
Pune.

Abhijit D. Jadhav
Ph.D. (Pursuing) (CSE), M. Tech. (CSE)
B.E. (Computer Engineering)
Assistant Professor,
Dr. D. Y. Patil Institute of Technology
Pimpri, Pune.

® ®
TECHNICAL
PUBLICATIONS
SINCE 1993 An Up-Thrust for Knowledge

(i)
Distributed Systems
Subject Code : 310245(C)

T.E.(Computer Engineering) Semester - V (Elective - I)

ã Copyright with I. A. Dhotre

All publishing rights (printed and ebook version) reserved with Technical Publications. No part of this book
should be reproduced in any form, Electronic, Mechanical, Photocopy or any information storage and
retrieval system without prior permission in writing, from Technical Publications, Pune.

Published by :
® ® Amit Residency, Office No.1, 412, Shaniwar Peth,
TECHNICAL Pune - 411030, M.S. INDIA Ph.: +91-020-24495496/97
PUBLICATIONS
SINCE 1993 An Up-Thrust for Knowledge Email : [email protected] Website : www.technicalpublications.org

Printer :
Yogiraj Printers & Binders
Sr. No. 10/1A,
Ghule Industrial Estate, Nanded Village Road,
Tal. - Haveli, Dist. - Pune - 411041.

ISBN 978-93-90770-70-0

9 789390 770700 SPPU 19

9789390770700 [1] (ii)

preface
The importance of Distributed Systems is well known in various engineering fields.
Overwhelming response to our books on various subjects inspired us to write this book. The
book is structured to cover the key aspects of the subject Distributed Systems.
The book uses plain, lucid language to explain fundamentals of this subject. The book
provides logical method of explaining various complicated concepts and stepwise methods
to explain the important topics. Each chapter is well supported with necessary illustrations,
practical examples and solved problems. All the chapters in the book are arranged in a
proper sequence that permits each topic to build upon earlier studies. All care has been
taken to make students comfortable in understanding the basic concepts of the subject.
Representative questions have been added at the end of each section to help the
students in picking important points from that section.
The book not only covers the entire scope of the subject but explains the philosophy of
the subject. This makes the understanding of this subject more clear and makes it more
interesting. The book will be very useful not only to the students but also to the subject
teachers. The students have to omit nothing and possibly have to cover nothing more.
We wish to express our profound thanks to all those who helped in making this book a
reality. Much needed moral support and encouragement is provided on numerous
occasions by our whole family. We wish to thank the Publisher and the entire team of
Technical Publications who have taken immense pain to get this book in time with quality
printing.
Any suggestion for the improvement of the book will be acknowledged and well
appreciated.

Authors
Iresh A. Dhotre
Abhijit D. Jadhav

Dedicated to Students

(iii)
SYLLABUS
Distributed Systems - 310245(C)
Credit : Examination Scheme :
03 Mid - Sem (TH) : 30 Marks
End - Sem (TH) : 70 Marks

Unit I Introduction
Defining Distributed Systems, Characteristics, Middleware and Distributed Systems. Design goals :
Supporting resource sharing, Making distribution transparent, Open, Scalable, Pitfalls. Types of
Distributed Systems : High Performance Distributed Computing, Distributed Information Systems,
Pervasive Systems. Architectural styles : Layered architectures, Object based architectures,
Publish Subscribe architectures. Middleware organization : Wrappers, Interceptors, Modifiable
middleware. System architecture : Centralized, Decentralized, Hybrid, Example architectures -
Network File System, Web. (Chapter - 1)

Unit II Communication
Introduction : Layered Protocols, Types of Communication, Remote Procedural Call- Basic RPC
Operation, Parameter Passing, RPC-based application support, Variations on RPC, Example : DCE
RPC, Remote Method Invocation. Message Oriented Communication : Simple Transient
Messaging with Sockets, Advanced Transient Messaging, Message Oriented Persistent
Communication, Examples. Multicast Communication : Application Level Tree-Based
Multicasting, Flooding-Based Multicasting, Gossip-Based Data Dissemination. (Chapter - 2)

Unit III Synchronization

Clock Synchronization : Physical Clocks, Clock Synchronization Algorithms. Logical Clocks -
Lamport’s Logical clocks, Vector Clocks. Mutual Exclusion : Overview, Centralized Algorithm,
Distributed Algorithm, Token-Ring Algorithm, Decentralized Algorithm. Election Algorithms :
Bully Algorithm, Ring Algorithm. Location Systems : GPS, Logical Positioning of nodes,
Distributed Event Matching. Gossip-Based Contribution : Aggregation, A Peer-Sampling Service,
Gossip-Based Overlay Construction. (Chapter - 3)

Unit IV Naming and Distributed File Systems

Names, Identifiers, Addresses, Flat Naming, Structured Naming, Attributed Based Naming,
Introduction to Distributed File Systems, File Service Architecture. Case study : Suns Network file
System, Andrew File System. (Chapter - 4)

Unit V Consistency and Replication

Introduction : Reasons for Replication, Replication as Scaling Technique. Data-Centric
Consistency Models : Continuous Consistency, Consistent Ordering of Operations. Client-Centric
Consistency Models : Eventual Consistency, Monotonic Reads, Monotonic Writes, Read Your
Writes, Writes Follow Reads. Replica Management : Finding the best server location, Content

(iv)
Replication and Placement, Content Distribution, Managing Replicated Objects. Consistency
Protocols : Continuous Consistency, Sequential Consistency, Cache Coherence Protocols,
Example : Caching, and Replication in the web. (Chapter - 5)

Unit VI Fault Tolerance

Introduction to Fault Tolerance : Basic Concepts, Failure Models, Failure Masking by
Redundancy. Process Resilience : Resilience by Process Groups, Failure Masking and Replication,
Example : Paxos, Consensus in faulty systems with crash failures, some limitations on realizing
Fault Tolerant tolerance, Failure Detection. Reliable Client Server Communication : Point to
Point Communication, RPC Semantics in the Presence of Failures. Reliable Group
Communication : Atomic multicast, Distributed commit. Recovery : Introduction, Check pointing,
Message Logging, Recovery Oriented Computing. (Chapter - 6)

(v)
TABLE OF CONTENTS
Unit - I
Chapter 1 : Introduction 1 - 1 to 1 - 28

1.1 Defining Distributed Systems ...................................................................................... 1 - 2

1.1.1 Characteristics ............................................................................................... 1 - 2
1.1.2 Middleware and Distributed Systems ............................................................ 1 - 3
1.2 Design Goals ............................................................................................................... 1 - 4
1.2.1 Supporting Resource Sharing ....................................................................... 1 - 5
1.2.2 Making Distribution Transparent.................................................................... 1 - 5
1.2.3 Open .............................................................................................................. 1 - 6
1.2.4 Scalable ......................................................................................................... 1 - 7
1.2.5 Pitfalls ............................................................................................................ 1 - 8
1.3 Types of Distributed Systems ...................................................................................... 1 - 9
1.3.1 Distributed Computing Systems .................................................................... 1 - 9
1.3.2 Distributed Information Systems.................................................................. 1 - 11
1.3.3 Distributed Pervasive Systems .................................................................... 1 - 13
1.4 Architectural Styles .................................................................................................... 1 - 15
1.4.1 Layered Architectures .................................................................................. 1 - 15
1.4.2 Object-Based Architectures ......................................................................... 1 - 16
1.4.3 Data - Centered Architectures ..................................................................... 1 - 17
1.4.4 Event - Based Architectures ........................................................................ 1 - 17
1.5 Middleware Organization ........................................................................................... 1 - 18
1.6 System Architecture .................................................................................................. 1 - 19
1.6.1 Centralized Organizations ........................................................................... 1 - 19
1.6.2 Decentralized ............................................................................................... 1 - 21
1.6.2.1 Peer-to-Peer Process .................................................................... 1 - 21
1.6.2.2 Difference between Client-Server and Peer-to-Peer Model .......... 1 - 23
1.7 Example Architectures : Network File System .......................................................... 1 - 23
1.7.1 NFS Architecture ......................................................................................... 1 - 24
1.8 Multiple Choice Questions with Answers .................................................................. 1 - 26
(vi)
Unit - II
Chapter 2 : Communication 2 - 1 to 2 - 42

2.1 Layered Protocols ........................................................................................................ 2 - 2

2.1.1 OSI Model ...................................................................................................... 2 - 2
2.1.2 Internet Protocols........................................................................................... 2 - 7
2.2 Remote Procedure Call ............................................................................................... 2 - 8
2.2.1 RPC Model .................................................................................................. 2 - 10
2.2.2 Transparency of RPC .................................................................................. 2 - 11
2.2.3 Implementing RPC Mechanism ................................................................... 2 - 11
2.2.4 Stub Generation........................................................................................... 2 - 12
2.2.5 RPC Messages ............................................................................................ 2 - 13
2.2.6 Marshalling Arguments and Result.............................................................. 2 - 14
2.2.7 Server Management .................................................................................... 2 - 14
2.2.8 RPC Problems ............................................................................................. 2 - 15
2.2.9 Call Semantics ............................................................................................. 2 - 15
2.2.10 Lightweight Remote Procedure Call ............................................................ 2 - 16
2.2.11 DEC RPC ..................................................................................................... 2 - 17
2.3 Remote Method Invocation........................................................................................ 2 - 17
2.3.1 Design Issues for RMI ................................................................................. 2 - 20
2.3.1.1 Stub and Skeleton in RMI .............................................................. 2 - 20
2.3.2 RMI Invocation Semantics ........................................................................... 2 - 22
2.3.3 Passing Parameters .................................................................................... 2 - 22
2.3.4 RMI Registry ................................................................................................ 2 - 23
2.3.5 Developing RMI Applications ....................................................................... 2 - 23
2.3.6 Advantages of RMI ...................................................................................... 2 - 25
2.3.7 Disadvantages of RMI ................................................................................. 2 - 25
2.4 Message Oriented Communication ........................................................................... 2 - 25
2.4.1 Persistence and Synchronicity in Communication ...................................... 2 - 25
2.4.2 Message Oriented Transient Communication ............................................. 2 - 27
2.4.3 Message Oriented Persistent Communication ............................................ 2 - 33
2.4.4 IBM MQSeries ............................................................................................. 2 - 36
(vii)
2.5 Multicast Communication .......................................................................................... 2 - 37
2.5.1 Application-level Tree-based Communication ............................................. 2 - 38
2.5.2 Gossip-based Data Dissemination .............................................................. 2 - 39
2.6 Multiple Choice Questions with Answers .................................................................. 2 - 39

Unit - III
Chapter 3 : Synchronization 3 - 1 to 3 - 40

3.1 Clock Synchronization ................................................................................................. 3 - 2

3.1.1 Absence of a Global Clock ............................................................................ 3 - 2
3.1.2 Clock and Event............................................................................................. 3 - 3
3.1.3 Physical Clock ............................................................................................... 3 - 3
3.2 Clock Synchronization Algorithms ............................................................................... 3 - 6
3.2.1 Berkeley Algorithm......................................................................................... 3 - 7
3.3 Logical Clocks.............................................................................................................. 3 - 8
3.3.1 Event Ordering............................................................................................... 3 - 8
3.3.2 Lamport Timestamp ..................................................................................... 3 - 11
3.3.3 Vector Timestamp........................................................................................ 3 - 14
3.4 Mutual Exclusion ....................................................................................................... 3 - 16
3.4.1 Requirement of Mutual Exclusion ................................................................ 3 - 16
3.4.2 Algorithm for Mutual Exclusion .................................................................... 3 - 16
3.4.3 Performance Metrics for Mutual Exclusion Algorithms ................................ 3 - 17
3.4.4 Central Server Algorithm ............................................................................. 3 - 18
3.4.5 Ring-Based Algorithm .................................................................................. 3 - 19
3.4.6 Algorithm using Multicast and Logical Clocks ............................................. 3 - 21
3.4.7 Maekawas Voting Algorithm ........................................................................ 3 - 23
3.5 Election Algorithm ...................................................................................................... 3 - 24
3.5.1 The Bully Algorithm ..................................................................................... 3 - 25
3.5.2 A Ring Algorithm .......................................................................................... 3 - 27
3.5.3 Comparison between Ring and Bully Algorithm .......................................... 3 - 28
3.6 Location System : GPS ............................................................................................. 3 - 28
3.6.1 Global Positioning of Nodes ........................................................................ 3 - 30
3.7 Distributed Event Matching........................................................................................ 3 - 30
(viii)
3.7.1 Programming Model .................................................................................... 3 - 31
3.7.2 Difference between Content Based and Topic Based................................. 3 - 33
3.7.3 Implementation of Publish-Subscribe System ............................................. 3 - 34
3.7.4 Architecture of Publish-Subscribe Systems ................................................ 3 - 35
3.7.5 Publish/Subscribe Benefits .......................................................................... 3 - 36
3.7.6 Disadvantages of Publish/Subscribe ........................................................... 3 - 36
3.8 Gossip Based Coordination ....................................................................................... 3 - 37
3.9 Multiple Choice Questions with Answers .................................................................. 3 - 38

Unit - IV
Chapter 4 : Naming and Distributed File Systems 4 - 1 to 4 - 42

4.1 Names, Identifiers and Addresses............................................................................... 4 - 2

4.1.1 Naming Overview .......................................................................................... 4 - 2
4.1.1.1 Desirable Features of a Good Naming System ............................. 4 - 3
4.2 Flat Naming ................................................................................................................. 4 - 4
4.2.1 Broadcasting and Multicasting....................................................................... 4 - 5
4.2.2 Home-based Approaches .............................................................................. 4 - 6
4.2.3 Distributed Hash Tables ................................................................................ 4 - 6
4.3 Structured Naming ..................................................................................................... 4 - 12
4.3.1 Name Spaces .............................................................................................. 4 - 12
4.3.2 Name Resolution ......................................................................................... 4 - 17
4.3.3 Name Servers .............................................................................................. 4 - 20
4.3.4 Resource Records ....................................................................................... 4 - 21
4.4 Attribute - based Naming ........................................................................................... 4 - 21
4.4.1 Directory Services........................................................................................ 4 - 21
4.4.2 LDAP ........................................................................................................... 4 - 22
4.5 Introduction to Distributed File Systems .................................................................... 4 - 24
4.5.1 Characteristics of File Systems ................................................................... 4 - 25
4.5.2 Distributed File System Requirements ........................................................ 4 - 26
4.6 File Service Architecture ............................................................................................ 4 - 27

(ix)
4.7 Case Study : Sun Network File System .................................................................... 4 - 29
4.7.1 NFS Architecture ......................................................................................... 4 - 30
4.7.2 Communication ............................................................................................ 4 - 32
4.7.3 Naming and Mounting ................................................................................. 4 - 32
4.7.4 Caching and Replication .............................................................................. 4 - 35
4.7.5 Advantages and Disadvantages of NFS ..................................................... 4 - 36
4.8 Andrew File System ................................................................................................... 4 - 37
4.9 Multiple Choice Questions with Answers .................................................................. 4 - 40

Unit - V
Chapter 5 : Consistency and Replication 5 - 1 to 5 - 22

5.1 Introduction to Replication ........................................................................................... 5 - 2

5.1.1 Reasons for Replication ................................................................................ 5 - 2
5.1.2 Replication as Scaling Technique ................................................................. 5 - 3
5.2 Data-Centric Consistency Models ............................................................................... 5 - 4
5.2.1 Strict Consistency .......................................................................................... 5 - 4
5.2.2 Sequential Consistency ................................................................................. 5 - 4
5.2.3 Linearizability ................................................................................................. 5 - 5
5.2.4 Causal Consistency ....................................................................................... 5 - 6
5.2.5 Pipelined RAM Consistency .......................................................................... 5 - 7
5.2.6 Weak Consistency ......................................................................................... 5 - 7
5.2.7 Entry Consistency .......................................................................................... 5 - 8
5.3 Client-Centric Consistency Models ............................................................................. 5 - 9
5.3.1 Eventual Consistency .................................................................................... 5 - 9
5.3.2 Monotonic - Read Consistency.................................................................... 5 - 10
5.3.3 Monotonic-Write Consistency ...................................................................... 5 - 11
5.3.4 Read Your Writes ........................................................................................ 5 - 12
5.4 Replica Management ................................................................................................. 5 - 13
5.5 Consistency Protocols ............................................................................................... 5 - 14
5.5.1 Primary - Based Protocols ........................................................................... 5 - 14

(x)
5.5.2 Replicated - Write Protocols ........................................................................ 5 - 15
5.5.2.1 Active Replication ......................................................................... 5 - 15
5.5.2.2 Quorum based Protocols .............................................................. 5 - 17
5.6 Caching and Replication in the Web ......................................................................... 5 - 17
5.7 Multiple Choice Questions with Answers .................................................................. 5 - 20

Unit - VI
Chapter 6 : Fault Tolerance 6 - 1 to 6 - 36

6.1 Introduction to Fault Tolerance .................................................................................... 6 - 2

6.1.1 Failure Models ............................................................................................... 6 - 2
6.1.2 Failure Masking by Redundancy ................................................................... 6 - 6
6.2 Process Resilience ...................................................................................................... 6 - 7
6.2.1 Design Issue .................................................................................................. 6 - 7
6.2.2 Failure Masking and Replication ................................................................... 6 - 9
6.2.3 Byzantine Agreement Problem ...................................................................... 6 - 9
6.2.4 Failure Detection.......................................................................................... 6 - 10
6.3 Reliable Client Server Communication ...................................................................... 6 - 12
6.4 Reliable Group Communication................................................................................. 6 - 13
6.4.1 Message Ordering ....................................................................................... 6 - 14
6.5 Distributed Commit .................................................................................................... 6 - 16
6.5.1 Atomic Commit Protocols ............................................................................ 6 - 17
6.5.2 Two Phase Commit Protocols ..................................................................... 6 - 18
6.6 Recovery.................................................................................................................... 6 - 20
6.6.1 Classification of Failures .............................................................................. 6 - 22
6.6.2 Steps after Failure ....................................................................................... 6 - 23
6.6.3 Backward and Forward Recovery ............................................................... 6 - 24
6.6.4 Checkpoint ................................................................................................... 6 - 25
6.6.5 Checkpoint and Rollback Recovery............................................................. 6 - 28
6.6.6 Message Logging ........................................................................................ 6 - 32
6.7 Multiple Choice Questions with Answers .................................................................. 6 - 35
Solved Model Question Papers ....................................................................... (M - 1) to (M - 4)

(xi)
Notes

(xii)
UNIT - I

1 Introduction

Syllabus

Defining Distributed Systems, Characteristics, Middleware and Distributed Systems.

Design goals : Supporting resource sharing, Making distribution transparent, Open,
Scalable, Pitfalls. Types of Distributed Systems : High Performance Distributed
Computing, Distributed Information Systems, Pervasive Systems. Architectural styles :
Layered architectures, Object based architectures, Publish Subscribe architectures.
Middleware organization : Wrappers, Interceptors, Modifiable middleware. System
architecture : Centralized, Decentralized, Hybrid, Example architectures - Network File
System, Web.

Contents

1.1 Defining Distributed Systems ........ Oct. - 18, Dec. - 18 ........................ Marks 5

1.2 Design Goals . .............................. Oct. - 18, Dec. - 18, May - 19 ........ Marks 5

1.3 Types of Distributed Systems

1.4 Architectural Styles

1.5 Middleware Organization

1.6 System Architecture

1.7 Example Architectures : Network File System

1.8 Multiple Choice Questions

(1 - 1)
Distributed Systems (1 - 2) Introduction

 1.1 Defining Distributed Systems

 SPPU : Oct. - 18, Dec. - 18
1. A distributed system is one in which components located at networked computers
communicate and co-ordinate their actions only by passing messages.
2. A distributed system is collection of independent entities that co-operate to solve a problem
that cannot be individually solved.
3. Tanenbaum's definition : A distributed system is a collection of idependent computers that
appears to its users a single coherent system.
 Each node of distributed computing system is equipped with a processor, a local
memory and interfaces. Communication between any pair of nodes is realized only by
message passing as no common memory and interfaces. Communication between any
pair of nodes is realized only by message passing as no common memory is available.
 Usually, distributed systems are asynchronous, i.e., they do not use a common clock and
do not impose any bounds on relative processor speeds or message transfer times.
 Differences between the various computers and the ways in which they communicate
are mostly hidden from users.

 1.1.1 Characteristics
 1. Collection of autonomous computing elements

 Distributed systems consist of autonomous computing elements defined as either

hardware devices or software processes. Generally, these computing elements are
referred to as nodes.
 A fundamental principle of distributed systems is that nodes can act independently of
each other. Nodes communicate by sending and receiving messages, which the nodes
then use to determine how they should behave.
 Membership : Since there is a collection of nodes involved, a system whereby we
register which nodes may or may not belongs to the system must be developed. Hence,
managing group membership is an important task.
a) Open group - Any node is allowed to join the distributed system, effectively meaning
that it can send messages to any other node in the system.
b) Closed group - Only the members of that group can communicate with each other
and a separate mechanism is needed to let a node join or leave the group.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (1 - 3) Introduction

 Distributed systems are often organized as an overlay network, a network built on top of
another network. There are two common overlay networks :
a) Structured overlay where each node has a well-defined set of neighbors it can
communicate with.
b) Unstructured overlay where nodes communicate with a randomly selected set of
nodes.
 A well-known class of overlays is formed by peer to peer networks.
 2. Single coherent system

 The collection of nodes as a whole operates the same, no matter where, when, and how
interaction between a user and the system takes place.
 Examples :
1. An end user cannot tell where a computation is taking place
2. Where data is exactly stored should be irrelevant to an application
3. Whether data has been replicated or not is completely hidden

 1.1.2 Middleware and Distributed Systems

 Fig. 1.1.1 shows distributed system organized as middleware.

Fig. 1.1.1 : Distributed system organized as middleware

 To support heterogeneous, computers and networks while offering a single system view,
distributed system are often organized by means of a layer of software that is logically
placed between a higher level consisting of users and applications and layer underneath
consisting of OS.
 Middleware is software which lies between an operating system and the applications
running on it. Distributed system is sometimes called as middleware.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (1 - 4) Introduction

 An example of a distributed system would be the World Wide Web where there are
multiple components under the hood that help browsers display content but from a user’s
point of view, all they are doing is accessing the web via a browser.
 Resource Management : It offers services that can also be found in most OS and it
includes :
a. Security services
b. Accounting services
c. Masking of and recovery from failures
d. Facilities for inter-application communication
 Examples of middleware services :
1. Communication : RPC is common communication services is used.
2. Transaction : Middleware offers services in an all-or-nothing fashion, commonly
referred to as an atomic transaction.
3. Service composition : Web-based middleware can help by standardizing the way web
services are accessed and providing the means to generate their functions in a specific
order.
4. Reliability : Reliability is the ability for a system to remain available over a period of
time. Reliable systems are those that can continuously perform their core functions
without service disruptions, errors, or significant reductions in performance.
Review Questions

1. Explain what is middleware and its need in distributed system.

SPPU : Dec. - 18, End sem, Marks 5

2. Define distributed system. Explain significant characteristics of distributed system.

SPPU : Oct. - 18, In sem, Marks 5

 1.2 Design Goals  SPPU : Oct. - 18, Dec. - 18, May - 19

 The main goal of a distributed system is to connect users and resources in a transparent,
open and scalable way.
1. Supporting resource sharing
2. Making distribution transparent
3. Open
4. Scalable

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (1 - 5) Introduction

 1.2.1 Supporting Resource Sharing

 Support user access to remote resources like printers, data files, web pages, CPU cycles etc.
and share them in a controlled and efficient way.
 Connecting users and resources also makes it easier to collaborate and exchange
information. For example : Internet for exchanging files, mail, documents, audio, and
video.
 Security is becoming increasingly important.
i. Little protection against eavesdropping or intrusion on communication.
ii. Tracking communication to build up a preference profile of a specific user.
 A good reason to build a distributed system is to make them distributed resources available
as they would belong to a single system. By making interaction possible between users and
resources, distributed systems are enablers of sharing, information exchange, collaboration.

 1.2.2 Making Distribution Transparent

 Transparency means that the complexity is hidden from the user.
 An important goal of a distributed system is to hide the fact that its processes and resources
are physically distributed across multiple computers, possibly separated by large distances.
In other words, it tries to make the distribution of processes and resources transparent, that
is, invisible, to end users and applications.
 The implication of transparency is a major influence on the design of the system software.
 8 forms of transparency are Access, Location, Concurrency, Replication, Failure, Mobility,
Performance and Scaling.
 The location and access transparencies together are sometimes referred as network
transparency.
1. Access transparency : Using identical operations to access local and remote resources
e.g. Hyperlink in web page.
2. Location transparency : Resources to be accessed without knowledge of their location,
e.g. URL.
3. Concurrency transparency : Several processed operate concurrently using shared
resources without interference with between them.
4. Replication transparency : Multiple instances of resources to be used to increase
reliability and performance without knowledge of the replicas by users or application
programmers, e.g. Web cache.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge
Distributed Systems (1 - 6) Introduction

5. Failure transparency : Users and applications to complete their tasks despite the
failure of hardware and software components, e.g. email.
6. Mobility transparency : Movement of resources and clients within a system without
affecting the operation of users and programs, e.g. mobile phone.
7. Performance transparency : Allows the system to be reconfigured to improve
performance as loads vary.
8. Scaling transparency : Allows the system and applications to expand in scale without
change to the system structure or the application algorithms.
Sr. No. Transparency Description
1. Access Hide differences in data representation and how a resource is
accessed.
2. Location Hide where a resource is located.
3. Migration Hide that a resource may move to another location.
4. Relocation Hide that a resource may be moved to another location while
in use.
5. Replication Hide that a resource may be shared by several competitive
users.
6. Concurrency Hide that a resource may be shared by several competitive
users.
7. Failure Hide the failure and recovery of a resource.
8. Persistence Hide whether a (software) resource is in memory or on disk.

 1.2.3 Open
 Openness means that the system can be easily extended and modified. Openness refers to
the ability to plug and play. You can, in theory, have two equivalent services that follow the
same interface contract, and interchange one with the other.
 The integration of new components means that they have to be able to communicate with
some of the components that already exist in the system. Openness and distribution are
related. Distributed system components achieve openness by communicating using well-
defined interfaces.
 If the well-defined interfaces for a system are published, it is easier for developers to add
new features or replace sub-systems in the future.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (1 - 7) Introduction

 Open systems can easily be extended and modified. New components can be integrated
with existing components.
 Differences in data representation or interface types on different processors have to be
resolved. Openness and distribution are related to each other. System components need to
have well-defined and well-documented interfaces.
 It can be constructed from heterogeneous hardware and software. Openness is concerned
with extensions and improvements of distributed systems. Detailed interfaces of
components need to be published. New components have to be integrated with existing
components.
 The system needs to have a stable architecture so that new components can be easily
integrated while preserving previous investments.
 An open distributed system offers services according to standard rules that describe the
syntax and semantics of those services.

 1.2.4 Scalable
 A system is said to be scalable if it can handle the addition of users and resources without
suffering a noticeable loss of performance or increase in administrative complexity.
 The ability to accommodate any growth in the future be it expected or not. Distributed
system architectures achieve scalability through employing more than one host. Distributed
systems can be scalable because additional computers can be added in order to host
additional components.
1. In size : Dealing with large numbers of machines, users, tasks.
2. In location : Dealing with geometric distribution and mobility.
3. In administration : Addressing data passing through different regions of ownership.
 The design of scalable distributed systems presents the following challenges :
1. Controlling the cost of resources.
2. Controlling the performance loss.
3. Preventing software resources from running out.
4. Avoiding performance bottlenecks.
 Controlling the cost of physical resources i.e. servers and users.
 Controlling the performance loss : DNS hierarchic structures scale better than linear
structures and save time for access structured data.
 Preventing software resources running out : Internet 32 bits addresses run out soon. 128 bits
one gives extra space in messages.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge
Distributed Systems (1 - 8) Introduction

 Avoiding performance bottlenecks : DNS name table was kept in a single master file
partitioning between servers.
 Example : File system scalability is defined as the ability to support very large file systems,
large files, large directories and large numbers of files while still providing I/O
performance. Google file system aims at efficiently and reliably managing many extremely
large files for many clients, using commodity hardware.
 Various techniques such as replication, caching and cache memory management and
asynchronous processing help to achieve scalability.
 Scaling techniques
1. Hiding communication latencies : Examples would be asynchronous communication
as well as pushing code down to clients (e.g. Java applets and Javascript).
2. Distribution : Taking a component, splitting into smaller parts, and subsequently
spreading them across the system.
3. Replication : Replicating components increases availability, helps balance the load
leading to better performance, helps hide latencies for geographically distributed
systems. Caching is a special form of replication.

 1.2.5 Pitfalls
 False assumptions made by first time developer :
a) The network is reliable. b) The network is secure.
c) The network is homogeneous. d) The topology does not change.
e) Latency is zero. f) Bandwidth is infinite.
g) Transport cost is zero. h) There is one administrator.
Review Questions

1. Explain what is scalability in distributed system ? What are the challenges to design
scalable distributed system ? SPPU : May - 19, End sem, Marks 5

2. Define transparency in distributed system with its type.

SPPU : May - 19, End sem, Marks 5

3. Explain the openness of distributed system in detail.

SPPU : Dec. - 18, End sem, Marks 5

4. Explain transparency in distributed system with examples.

SPPU : Oct. - 18, In sem, Marks 5

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (1 - 9) Introduction

 1.3 Types of Distributed Systems

 Following are the types of distributed systems.
1. Distributed computing systems
2. Distributed information systems
3. Distributed pervasive systems
Types of distributed systems Examples
Distributed computing systems Clusters
Grids
Clouds
Distributed information systems Transaction processing systems
Enterprise application integration
Distributed pervasive systems Home systems
Health care systems
Sensor networks

 1.3.1 Distributed Computing Systems

 Distributed systems used for high-performance computing task.
 1. Cluster computing

 Hardware consists of a collection of similar workstations or PCs, closely connected by

means of a high-speed local-area network; each node runs the same operating system.
Computer cluster is a group of linked computers, working together closely thus in many
respects forming a single computer.
 Clustering allows us to run applications on several parallel servers. The load is
distributed across different servers and even if any of the servers fails, the application is
still accessible via other cluster nodes. Fig. 1.3.1 shows cluster computing system.
 In virtually all cases, cluster computing is used for parallel programming in which a
single program is run in parallel on multiple machines. Example of a cluster computer :
Linux-based Beowulf clusters.
 Cluster architectures are quite flexible and as a result, it is possible to mix both shared
and distributed storage when necessary. Such an architecture would strongly suit an
enterprise with a corporate headquarters where large data warehouses are managed and
with offices around the globe that operate autonomously on a day-to-day basis.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (1 - 10) Introduction

Fig. 1.3.1 : Cluster computing

 Functions of master node :

a) Master node handles the allocation of nodes to a particular parallel program.
b) It also maintains a batch queue of submitted jobs.
c) System interface is provided to the user.
d) The master runs the middleware needed for the execution of programs and
management of the cluster.
 2. Grid computing

 Grid computing is a distributed computing

system where a group of computers are
connected to create and work as one large
virtual computing power, storage,
database, application and service.
 Fig. 1.3.2 shows the grid protocol
architecture.
 Application layer : This layer consists of
the applications that operate within a virtual
organization and which make use of the
grid computing environment. Fig. 1.3.2 : Grid protocol architecture

 Connectivity : Core communication and authentication protocols required for grid-

specific network transactions.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (1 - 11) Introduction

 Resource : Secure negotiation, initiation, monitoring, control, accounting and payment

of sharing operations on individual resources.
 Collective : Protocols and services of global nature to capture interactions across
collections of resources.
 Fabric layer : Fabric layer provides interfaces to local resources at a specific site.
Interfaces are tailored to allow sharing of resources within a virtual organization. It also
provide functions for querying the state and capabilities of a resource, along with
functions for actual resource management (e.g., locking resources).
 The grid computing middleware software will manage and execute all the activities
related to identification, allocation, de-allocation and consolidation of all the computing
resources to the end-users transparently, as in the case of a geographical distributed
resources system.

 1.3.2 Distributed Information Systems

 Web services are a form of distributed information systems.
 Typical examples of distributed computing and information systems are systems that
automate the operations of commercial enterprises such as banking and financial
transaction processing systems, warehousing systems and automated factories.
 The basic components of a transaction processing system can be found in single user
systems. The evolution of these systems provides a convenient framework for introducing
their various features. Decreased cost of hardware and communication make it possible to
distribute components of transaction processing system. Client-server organization
generally used.
 A transaction manager allows the application programmer to group the set of actions,
requests, messages and computations into a single operation that is "all or nothing" it either
happens or is automatically aborted by the system. The programmer is provided with
COMMIT and ABORT verbs that declare the outcome of the transaction.
 Transactions provide the ACID property. ACID characteristic properties of transactions :
1. Atomic : To the outside world, the transaction happens indivisibly.
2. Consistent : The transaction does not violate system invariants.
3. Isolated : Concurrent transactions do not interfere with each other.
4. Durable : Once a transaction commits, the changes are permanent.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (1 - 12) Introduction

 Nested transaction extends the transaction model by allowing transactions to be composed

of other transactions. The outermost transaction in a set of nested transactions is called the
top-level transaction. Transactions other than the top-level transaction are called sub-
transactions.
 If the top-level transaction commits, then all of the sub-transactions that have provisionally
committed can commit too, provided that none of their ancestors has aborted.
 Early enterprise middleware systems handled distributed (or nested) transactions using a
transaction processing monitor or TP monitor for integrating applications at the server or
database level. Its main task was to allow an application to access multiple
server/databases by offering it a transactional programming model.
 A coordinator need simply ensure that if one of the nested transactions aborts, that all other
sub-transactions abort as well. Likewise, it should coordinate that all of them commit when
each of them can. To this end, a nested transaction should wait to commit until it is told to
do so by the coordinator .
 TP monitor in distributed systems
 A TP Monitor is a subsystem that groups together sets of related database updates and
submits them together to a relational database. The result is that the database server does
not need to do all of the work of managing the consistency/correctness of the database; the
TP Monitor makes sure that groups of updates take place together or not at all.
 Fig. 1.3.3 show TP monitor system.

Fig. 1.3.3 : TP monitor

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (1 - 13) Introduction

 Enterprise Application Integration

 The more applications became decoupled from the databases they were built upon, the
more evident it became that facilities were needed to integrate applications independent
from their databases.
 Application components should be able to communicate directly with each other and not
merely by means of the request/reply behavior that was supported by transaction processing
systems.
 Result : Middleware as a communication facilitator in enterprise application integration.
 Several types of communication middleware :
1. Remote Procedure Calls (RPC) : An application component can send a request to
another application component by doing a local procedure call, which results in the
request being packaged as a message and sent to the callee. The result will be sent back
and returned to the application as the result of the procedure call.
2. Remote Method Invocations (RMI) : An RMI is the same as an RPC, except that it
operates on objects instead of applications.

 1.3.3 Distributed Pervasive Systems

 Networking has become a pervasive resource and devices can be connected at any time and
any place. The modern Internet is collection of various computer networks.
 Computer network are of different types. Example of network includes a wide range of
wireless communication technologies such as WiFi, WiMAX, Bluetooth and third-
generation mobile phone networks.
 Programs running on the computers connected to it interact by passing messages,
employing a common means of communication.
 The internet is a collection of large number of computer networks of many different types.
Internet communication mechanism is big technical achievement and it is possible by using
passing of messages.
 The internet is a very large distributed system. The web is not equal to the internet. The
implementation of the internet and services that it supports has entailed the development of
practical solutions to many distributed system issues.
 Internet service providers are companies that provide modem links and other types of
connection to individual users and small organizations, enabling them to access services
anywhere. It also provides local services such as email and web hosting.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (1 - 14) Introduction

 A device must be continually aware of the fact that its environment may change at any
time. Many devices in pervasive system will be used in different ways by different users.
Devices generally join the system in order to access information and information should
then be easy to read, store, manage and share.
 Pervasive systems are all around us and ideally should be able to adapt to the lack of human
administrative control :
1. Automatically connect to a different network ;
2. Discover services and react accordingly ;
3. Automatic self configuration
 Electronic Health Care Systems
 New devices are being developed to monitor the well-being of individuals and to
automatically contact physicians when needed. Major goal is to prevent people from being
hospitalized.
 Personal health care systems equipped with various sensors organized in a Body-Area
Network (BAN). Such a network should at worst only minimally hinder a person.
 A central hub is part of the BAN and collects data as needed. Data is then offloaded to a
larger storage device. The BAN is continuously hooked up to an external network through a
wireless connection, to which it sends monitored data.
 Sensor Network
 A sensor network consists of tens to hundreds or thousands of relatively small nodes, each
equipped with a sensing device. Most sensor networks use wireless communication and the
nodes are often battery powered.
 Their limited resources, restricted communication capabilities, and constrained power
consumption demand that efficiency be high on the list of design criteria.
 The relation with distributed systems can be made clear by considering sensor networks as
distributed databases. To organize a sensor network as a distributed database, there are
essentially two extremes :
1. Sensors do not cooperate but simply send their data to a centralized database located at
the operator's site.
2. Forward queries to relevant sensors and to let each compute an answer, requiring the
operator to sensibly aggregate the returned answers.
Disadvantages : Limited resources including power, restricted communication capabilities.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (1 - 15) Introduction

 1.4 Architectural Styles

 Architectural model is an abstract view of a distributed system. Models are constructed to
simplify reasoning about the system.
 A model of a distributed system is expressed in terms of components, placement of
components and interactions among components.
 An Architectural models describe a system in terms of the computational and
communication tasks performed by its computational elements; the computational elements
being individual computers or aggregates of them supported by appropriate network
interconnections.
 An Architectural model defines the way in which the components of systems interact with
one another and the way in which they are mapped onto an underlying network of
computers.
 An architectural model of a distributed system first simplifies and abstracts the functions of
the individual components of a DS and then it considers :
1. The placement of the components across a network of computers
2. The interrelationships between the components.
 Software architecture is a logical organization of distributed systems into software
components. Software component is a modular unit with well defined, required and
provided interfaces that is replaceable within its environment.
 Important styles of architecture for distributed systems :
a) Layered architectures
b) Object - based architectures
c) Data - centered architectures
d) Event - based architectures

 1.4.1 Layered Architectures

 Layered architecture style is simple.
 In this method, any complex system is divided into number of layers. Each layer performed
it's given task and it also provides services to below and above layers. A given layer
therefore offers a software abstraction, with higher layers being unaware of implementation
details.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (1 - 16) Introduction

 Components are organized in a layered fashion where a component at layer Li is allowed to

call components at the underlying layer Li – 1, but not the other way around.

 Fig. 1.4.1 shows layered architecture.

Fig. 1.4.1 : Layered architecture

 Layered architecture is widely adopted by the networking community.

 Layer : A group of related functional components.
 Service : Functionality provided to the next layer.
 The lowest - level hardware and software layers are often referred to as a platform for
distributed systems and applications. These low - level layers provide services to the layers
above them, which are implemented independently in each computer.
 These low - level layers bring the system's programming interface up to a level that
facilitates communication and coordination between processes.

 1.4.2 Object-Based Architectures

 This architectural style is based on an
arrangement of loosely coupled objects, it is
less structured. Each object corresponds a
component. Components are connected
through a (remote) procedure call mechanism.
 Fig. 1.4.2 shows Object based architectural
style.

Fig. 1.4.2 : Object based architectural style

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (1 - 17) Introduction

 Object - based architectures are attractive because they provide a natural way to encapsulate
data and the operations that can be performed on that data in a single entity.
 The interface provided by an object hides implementation details, meaning that at first we
can consider an object completely independent of its environment.

 1.4.3 Data - Centered Architectures

 In a data - centered architecture processes communicate through a common repository.
Fig. 1.4.3 shows data - centered architecture.

Fig. 1.4.3 : Data-centered architecture

 For instance many networked applications use a shared distributed file system in which
communication takes place through files.
 Web - based distributed systems use shared web - based data services.
 Example : Wealth of networked applications has been developed that rely on a shared
distributed file system in which virtually all communication takes place through files.
 Web - based distributed systems are largely data - centric : processes communicate through
the use of shared Web - based data services.

 1.4.4 Event - Based Architectures

 Fig. 1.4.4 shows event - based architecture.
 Components communicate by using events that can carry data. Processes communicate
through the propagation of events.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (1 - 18) Introduction

Fig. 1.4.4 : Event - based architecture

 For instance, publish/subscribe systems are event-based systems. Components are loosely
coupled.
 Processes publish events after which the middleware ensures that only those processes that
subscribed to those events will receive them.
 In principle, they need not explicitly refer to each other. This is also referred to as being
decoupled in space, or referentially decoupled.

 1.5 Middleware Organization

 1. Wrapper

 Wrapper is a special component that offers an interface acceptable to a client

application, of which the functions are transformed into those available at the
component. Wrapper is also called adapter. It solves problem of incompatible interface.
 An object adapter is a component that allows applications to invoke remote object.
 2. Interceptors

 It is software construct that will break the usual flow of control and allow other code to
be executed.
 Interceptor allows node owners to specify quantitative constraints on the share allocated
to P2P applications for each node resource, and enforces them by means of a set of
resource-limitation mechanisms.
 Interceptor is a software layer, placed on top of the local operating system, that
intercepts the resource access requests issued by P2P applications, and controls them, in
order to
(a) Provide application segregation for P2P applications
(b) Maximize their performance without violating the above limitations.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (1 - 19) Introduction

 3. Modifiable Middleware

 The modifiability requirement can be refined into three lower-level requirements :

flexibility, ease of modification, and consistency maintenance.
 Flexibility relates to the range of possible changes that can be supported by a platform.
 Ease of modification relates to the effort of performing required changes; and
consistency maintenance relates to the possibility that modifications mayintroduce
inconsistency.

 1.6 System Architecture

 Deciding on software components and their interaction, placement leads to an instance of a
software architecture is known as a system architecture.

 1.6.1 Centralized Organizations

 1. Simple Client-server Architecture

 Fig. 1.6.1 shows client server model. Distributed services which are called on by clients.
Servers that provide services are treated differently from clients that use services.
Processes divided into two groups : Server and client process.

Fig. 1.6.1 : Client Server model

 Client process : The process that requires a service.

 Server process : The process that provides the required service.
 The client requires a service, the server provides the service and makes available the
results to the client.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (1 - 20) Introduction

 The client-server model is usually based on a simple request/reply protocol,

implemented with send/receive primitives or using Remote Procedure Calls (RPC) or
Remote Method Invocation (RMI).
 A server can itself request services from other servers; thus, in this new relation, the
server itself acts like a client. The Server can be iterative or concurrent. Concurrent
servers handle client requests in parallel. Client could be iterative or concurrent,
depending if it can make requests to different servers sequentially or concurrently.
 Requests are sent in messages from clients to a server. When a client sends a request for
an operation to be carried out, we say that the client invokes an operation upon the
server. Replies are sent in messages from the server to the clients.
 HTTP (Web) server is a server process to its client processes (web browsers). HTTP
server may be a client of a database server. Service may be provided by multiple
servers, as is most often the case within a large enterprise.
 Cache is a repository of recently accessed objects (files, graphics) that is physically
closer to the client than the server from which it originated. Proxy server sits in between
clients and servers and can play many mitigation roles.
 Clients are active and servers are passive. Server runs continuously, whereas clients last
only as long as the applications of which they form a part.
 Client-server architecture is a common way of designing distributed systems
 2. Multi-tier Architecture

 In the three-tier architecture, process between Server and client (intermediary) process
is :
a) Separate the clients and servers.
b) Cache frequently accessed server data to ensure better performance and scalability.
c) Performance can be increased by having the intermediary process to distribute client
requests to several servers such that requests execute in parallel.
d) The intermediary can also act as a translation service by converting requests and
replies to and from a mainframe format, or as a security service that grants server-
access only to trusted clients.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (1 - 21) Introduction

 Fig. 1.6.2 shows three-tier architecture.

Fig. 1.6.2 : Three-tier architecture

 Consider the functional decomposition of a given application, as follows :

a. Presentation logic : It handles user interaction and updating the view of the application
as presented to the user. It also concerned with presenting the results of a computation to
the system users and with collecting user inputs.
b. Application logic : Which is concerned with the detailed application specific processing
associated with the application.
c. Data logic : It is related with the presistent storage of the application, typically in a
database management system.
 Advantages of three-tier architecture :
1. It is easier to modify or replace any tier without affecting the other tiers.
2. Separating the application and database functionality means better load balancing.
3. Adequate secuirty policies can be enforced within the server tiers without hindering the
clients.

 1.6.2 Decentralized

 1.6.2.1 Peer-to-Peer Process

 Peer-to-Peer systems offer an alternative to traditional client-server systems for some
application domains.
 P2P network is a distributed network composed of a large number of distributed,
heterogeneous, autonomous, and highly dynamic peers in which participants share a part of
their own resources such as processing power, storage capacity, software's and files
contents.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (1 - 22) Introduction

 All processes play similar role. Processes interact without particular distinction between
clients and servers. The pattern of communication depends on the particular application.
Napster is a system for sharing files, usually audio files, between different systems. These
systems are peers of each other in that any of them may request a file hosted by another
system.
 Fig. 1.6.3 shows peer-to-peer communication. All peers run the same program and offer the
same set of interfaces to each other.

Fig. 1.6.3 : Peer-to-peer model

 An interesting P2P system is BitTorrent. Now we are interested in downloading a movie

from a server without overwhelming the server when multiple clients are downloading the
same movie. So the movie is divided in segments and the clients become servers
themselves for the segments they have downloaded.
 A large number of data objects are shared; any individual computer holds only a small part
of the application database. Processing and communication loads for access to objects are
distributed across many computers and access links. This is the most general and flexible
model.
 In a peer-to-peer system, all components of the system contribute some processing power
and memory to a distributed computation.
 Each computer stores a small portion of the data and there may be multiple copies of the
same data spread over different computers. When a computer fails, the data that was on it
can be restored from other copies and put back when a replacement arrives.
 Skype, the voice-and video-chat service, is an example of a data transfer application with a
peer-to-peer architecture.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (1 - 23) Introduction

Examples of pure P2P systems Examples of hybrid P2P systems

1. Workgroups in Microsoft Windows Network 1. Usenet
2. Gnutella v0.4 2. Napster
3. Freenet 3. Gnutella v0.6
4. eDonkey2000
5. BitTorrent
 Benefits of P2P :
1. No need for dedicated application and database servers
2. Improved scalability and reliability (no single point of failure)
 Problems with peer-to-peer : High complexity due to
1. Cleverly place individual objects
2. Retrieve the objects
3. Maintain potentially large number of replicas.
4. Security is poor.
5. Lack of centralized control

 1.6.2.2 Difference between Client-Server and Peer-to-Peer Model

Client-server model Peer-to-peer model
Data flow between client and server is Data flows between peer is symmetric.
asymmetric.
Client and server uses different networking Uses same type of software.
software.
Client is in active role. Any peer node is in active role.
Server is in passive role. Any peer node is in passive role.
One process assumes the role of a service provider In a peer-to-peer model the interacting
while the other assumes the role of a service processes can be a client, server or
consumer. both.

 1.7 Example Architectures : Network File System

 It is developed by Sun Microsystems to provide a distributed file system independent of the
hardware and operating system.
 NFS provides transparent access to remote files on a LAN, for clients running on UNIX
and other operating systems.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge
Distributed Systems (1 - 24) Introduction

 It is a client/server application that provides shared file storage for clients across a network.
 NFS is stateless. All client requests must be self-contained. Each procedure call contains all
the information necessary to complete the call. Server maintains no "between call"
information.
 It uses an External Data Representation (XDR) specification to describe protocols in a
machine and system independent way.
 NFS is implemented on top of a Remote Procedure Call package (RPC) to help simplify
protocol definition, implementation, and maintenance.
 NFS is not so much a true file system, as a collection of protocols that together provide
clients with a model of a distributed file system.
 Goals of NFS design :
1. Compatibility : NFS should provide the same semantics as a local Unix file system.
Programs should not need or be able to tell whether a file is remote or local.
2. Easy deployable : Implementation should be easily incorporated into existing systems
remote files should be made available for local programs without these having to be
modified or re-linked.
3. Machine and OS independence : NFS Clients should run in non-Unix platforms.
4. Efficiency : NFS should be good enough to satisfy users, but did not have to be as fast
as local FS. Clients and Servers should be able to easily recover from machine crashes
and network problems.

 1.7.1 NFS Architecture

 Fig. 1.7.1 shows NFS architecture.
 The Virtual File System (VFS) interface is implemented using a structure that contains the
operations that can be done on a file system.
 Likewise, the vnode interface is a structure that contains the operations that can be done on
a node (file or directory) within a file system.
 There is one VFS structure per mounted file system in the kernel and one vnode structure
for each active node. Using this abstract data type implementation allows the kernel to treat
all file systems and nodes in the same way without knowing which underlying file system
implementation it is using.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (1 - 25) Introduction

Fig. 1.7.1 : NFS architecture

 Each vnode contains a pointer to its parent VFS and a pointer to a mounted-on VFS. This
means that any node in a file system tree can be a mount point for another file system.
 A root operation is provided in the VFS to return the root vnode of a mounted file system.
This is used by the pathname traversal routines in the kernel to bridge mount points.
 The root operation is used instead of keeping a pointer so that the root vnode for each
mounted file system can be released.
 Server Side
 Because the NFS server is stateless, when servicing an NFS request it must commit any
modified data to stable storage before returning results.
 The implication for UNIX based servers is that requests which modify the file system must
flush all modified data to disk before returning from the call.
 For example, on a write request, not only the data block, but also any modified indirect
blocks and the block containing the inode must be flushed if they have been modified.
 Client Side
 The Sun implementation of the client side provides an interface to NFS which is transparent
to applications.
 To make transparent access to remote files work we had to use a method of locating remote
files that does not change the structure of path names.
 Transparent access to different types of file systems mounted on a single machine is
provided by a new file system interface in the kernel.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (1 - 26) Introduction

 Each "filesystem type" supports two sets of operations : the Virtual Filesystem (VFS)
interface defines the procedures that operate on the filesystem as a whole; and the Virtual
Node (vnode) interface defines the procedures that operate on an individual file within that
filesystem type.
 The ability of the client to simply retry the request is due to an important property of most
NFS requests: they are idempotent.
 An operation is called idempotent when the effect of performing the operation multiple
times is equivalent to the effect of performing the operation a single time.
 Working :
 When a user is accessing a file, the kernel determines whether the file is a local file or an
NFS file. The kernel passes all references to local files to the local file access module and
all references to the NFS files to the NFS client module.
 The NFS client sends RPC requests to the NFS server through its TCP/TP module.
Normally, NFS is used with UDP, but newer implementations can use TCP. Then the NFS
server receives the requests on port 2049.
 Next, the NFS server passes the request through its local file access routines, which access
the file on server's local disk.
 After the server gets the results back from the local file access routines, the NFS server
sends back the reply in the RPC reply format to the client.
 While the NFS server is handling the client's request, the local file system needs some
amount of time to return the results to the server. During this time the server does not want
to block other incoming client requests.
 To handle multiple client requests, NFS servers are multithreaded or there are multiple
servers running at the same time.

 1.8 Multiple Choice Questions

Q.1 Distributed OS works on the __________ principle.
a File foundations b Single system image

c Multisystem image d Networking image

Q.2 Definition of distributed systems has the following significant consequences :

a Concurrency b No global clock

c Independent failures d All of these

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (1 - 27) Introduction

Q.3 IPv6 addresses are __________ bits in length.

a 32 b 64 c 128 d 255

Q.4 __________ transparency allows the movement of resources and clients within a
system without affecting the operation of users or programs.
a Location b Access c Mobility d Replication

Q.5 URLs are __________ transparent because the part of the URL that identifies a web
server domain name refers to a computer name in a domain, rather than to an
Internet address.
a Mobility b replication

c security d location

Q.6 Each Internet host has __________ logical ports.

a 512 b 1024 c 65535 d 65534

Q.7 What is range of well-known port ?

a 0 to 1023 b 0 to 1024

c 1024 to 49151 d 49152 to 65535

Q.8 Which of the following is goals of distributed system ?

a Transparency b Openness

c Scalability d All of these

Q.9 Distributed OS works on the __________ principle.

a File foundations b Single system image

c Multi system image d Networking image

Q.10 Wrapper is also called __________.

a object b Interface c API d adapter

Q.11 Distributed system have __________.

a high security b better resource sharing

c better system utilization d low system overhead

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (1 - 28) Introduction

Q.12 A distributed system is a collection of __________ computers that appears to its

users as a single coherent system.
a dependent b standalone

c independent d All of these

Q.13 An __________ distributed system is a system that offers services according to

standard rules that describe the syntax and semantics of those services.
a close b cyclic c transparent d open

Q.14 The DNS name space is hierarchically organized into a tree of domains, which are
divided into nonoverlapping __________.
a zones b subzones c area d location

Q.15 In peer-to-peer systems, the processes are organized into an __________ network.
a static b dynamic c overlay d All of these

 Answer Keys for Multiple Choice Questions

Q.1 b Q.2 d Q.3 c Q.4 c Q.5 d
Q.6 c Q.7 a Q.8 d Q.9 b Q.10 d
Q.11 b Q.12 c Q.13 d Q.14 a Q.15 c



TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

UNIT - II

2 Communication

Syllabus

Introduction : Layered Protocols, Types of Communication, Remote Procedural Call-

Basic RPC Operation, Parameter Passing, RPC-based application support, Variations
on RPC, Example: DCE RPC, Remote Method Invocation. Message Oriented
Communication : Simple Transient Messaging with Sockets, Advanced Transient
Messaging, Message Oriented Persistent Communication, Examples. Multicast
Communication : Application Level Tree-Based Multicasting, Flooding-Based
Multicasting, Gossip-Based Data Dissemination.

Contents

2.1 Layered Protocols

2.2 Remote Procedure Call .............. Oct. - 18 ... ...................................... Marks 5

2.3 Remote Method Invocation ......... Dec. - 18 .. ...................................... Marks 5

2.4 Message Oriented Communication

2.5 Multicast Communication

2.6 Multiple Choice Questions

(2 - 1)
Distributed Systems (2 - 2) Communication

 2.1 Layered Protocols

 Protocols
 Protocol is a well-known set of rules and formats used for communication between
processes to perform a given task. It is implemented by a pair of software modules located
in the sending and receiving computers.
 Protocol software modules are arranged in a hierarchy of layers. A complete set of protocol
layers is referred to as a protocol suite or protocol stack.
 Protocol layering brings benefits in simplifying and generalizing the software interface for
access to the communication services, but it also carries significant performance costs.
 Data is sent to sender through different layers. Each layer of network software
communicates with the layers above and below it.
 The protocol types of the above layers are included in the packets sent by the sender to
enable the protocol stack at the receiver for selecting the correct software component to
unpack the packets.
 The application, presentation, and session layers are not distinguish in the internet protocol
stack :
a. The application and presentation layers are implemented as a single middleware layer.
b. The session layer is integrated with the transport layer.
 Fig. 2.1.1 shows conceptual layering of protocol software.

Fig. 2.1.1 : Conceptual layering of protocol software

 2.1.1 OSI Model

 A complete set of protocols is referred to as protocol suites or protocol stack, reflecting the
layered structure.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (2 - 3) Communication

 Seven layer reference model for open systems interconnection (OSI) adopted by
International Organization for Standardization (ISO) to encourage the development of
protocol standards that would meet the requirements for open systems.
 An open system is a model that allows any two different systems to communicate
regardless of their underlying architecture (hardware or software). The OSI model is not a
protocol; it is model for understanding and designing a network architecture that is flexible,
robust and interoperable.
 The OSI model is a layered framework for the design of network systems that allows for
communication across all types of computer systems. Fig. 2.1.2 shows OSI model.

Fig. 2.1.2 : OSI model

 The OSI model is built of seven ordered layers :

1. Physical layer : The physical layer coordinates the functions required to transmit a bit
stream over a physical medium. It also defines the procedures and functions that
physical devices and interfaces have to perform for transmission occur.
2. Data link : The data link layer transforms the physical layer, a raw transmission facility,
to a reliable link and is responsible for node-to-node delivery. It makes the physical
layer appear error free to the upper layer.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (2 - 4) Communication

3. Network layer : The network layer is responsible for the source-to-destination delivery
of a packet possible across multiple networks.
4. Transport layer : The transport layer is responsible for process-to-process delivery of
the entire message. The network layer oversees host-to-destination delivery of
individual packets; it does not recognize any relationship between those packets. The
transport layer ensures that the whole message arrives intact and in order, overseeing
both error control and flow control at the process-to-process level.
5. Session layer : The session layer is the network dialog controller. It was designed to
establish, maintain, and synchronize the interaction between communicating devices.
6. Presentation layer : The presentation layer was designed to handle the syntax and
semantics of the information exchanged between the two systems. It was designed for
data translation, encryption, decryption, and compression.
7. Application layer : The application layer enables the user to access the network. It
provides user interfaces and support for services such electronic email, remote file
access, WWW, etc.
 The task of dividing messages into packets before transmission and reassembling them at
receiving computer is performed in the transport layer. The transport layer is responsible
for delivering messages to destinations with transport addresses.
 A transport address is composed of the network address number of a host computer and a
port number. Ports are software-definable destination points for communication within a
host computer. In the internet there are typically several ports at each host computer with
well-known numbers, each allocated to a given internet service.
 Routing
 Routing is a function that is required in all networks except that LANs such as ethernet that
provide the direct connection between all pairs of attached hosts.
 The best route for communication between points in the network is re-evaluated
periodically to take into account the current traffic and any faults in the network : adaptive
routing. Packets delivery to their destinations is the collective responsibility of the routers
located at connection points.
 Routing algorithm is implemented by a program in the network layer at each point, has two
functions :
1. Decide the routes for packets transmission : Whenever a virtual circuit or connection is
established in case of circuit-switched and frame-relay network layers.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge
Distributed Systems (2 - 5) Communication

2. Update its knowledge of the network based on traffic monitoring and the detection of
failures.
 A simple algorithm for routing discussed here is "distance vector" algorithm which is the
basis for link-state algorithm that is used by internet. In this algorithm each router has a
table contains a single entry for each possible destination showing the next hope that packet
must take toward its destination. Cost field in the table is simple calculation of vector
distance or number of hopes for a given destination. See the next slide that shows routing
tables for the previous network.
 Fig. 2.1.3 shows routing in wide area network. For a packet addressed to C, when it arrives
at the router at A, the algorithm uses routing table in A and choose the row starting with C
therefore forwards the packet to link labeled 1. When the packet arrives at B same
procedure is followed and link 2 will be selected.

Fig. 2.1.3 : Routing in wide area network

 When packet arrives at C, routing table entry shows local that means packet should be
delivered to a local host. The routing tables will be built up and maintained whenever faults
occur in the network.
 RIP Routing Algorithm
 Each router exchanges and modifies information of its routing table by using router
information protocol (RIP) routing algorithm, which does the following high level actions :
1. Periodically and when the local routing changes each router sends the table to all
accessible neighbors. The summary of table is sent in a RIP packet.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (2 - 6) Communication

2. When a table is received from a neighboring router if received table shows a route to a
new destination or lower cost route to an existing destination then it updates the local
table with the new route.
 Internetworking
 Many subnets based on many network technologies are integrated to build an internetwork.
To make this possible, the following are needed :
a. A unified internetwork addressing scheme enables packets to be addressed to any host
connected to any subnets (provided by IP addresses in the internet).
b. A protocol defining the format of internetwork packets and giving rules of handling
them (IP protocol in the internet).
c. Interconnecting components that route packets to their destination in terms of
internetwork addresses (performed by internet routers in the internet).
 To build an integrated network (an internetwork) many subnets of different network
technologies are integrated. Internet made this possible by providing the following items :
a. IP addresses b. IP protocol c. Intrernet routers
 The routers are in fact the general purpose computers that serve as firewalls. They may be
interconnected through the subnets or direct connection. In any case they are responsible
for forwarding the internetwork packets and maintaining routing tables.
1. Hub : A common connection point for devices in a network. Hubs are commonly used
to connect segments of a LAN. A hub contains multiple ports. When a packet arrives at
one port, it is copied to the other ports so that all segments of the LAN can see all
packets.
2. Switch : A device that filters and forwards packets between LAN segments. It can
interconnect two or more workstations, but like a bridge, it observes traffic flow and
learns. When a frame arrives at a switch, the switch examines the destination address
and forwards the frame out the one necessary connection.
3. Bridge : A bridge is a device that connects two segments of the same network. The two
networks being connected can be alike or dissimilar. Unlike routers, bridges are
protocol-independent. They simply forward packets without analyzing and re-routing
messages.
4. Router : A router is a device that connects two distinct networks. Routers are similar to
bridges, but provide additional functionality, such as the ability to filter messages and
forward them to different places based on various criteria. The internet uses routers
extensively to forward packets from one host to another.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (2 - 7) Communication

 2.1.2 Internet Protocols

 Internet emerged from the development of ARPANET computer network and TCP/IP
protocol suites. TCP stands for Transmission Control Protocol and IP for Internet Protocol.
Many application services and application-level protocols now exist based on TCP/IP
including :
a. The Web (HTTP). b. E-mail (SMTP, POP).
c. File transfer (FTP). d. Net News (NNTP).
e. Telnet (telnet).
 IP : Abbreviation of Internet Protocol.
 IP specifies the format of packets, also called datagrams and the addressing scheme. It is
responsible for moving packet of data from node to node (a processing location). It also
forwards each packet based on a four byte destination address.
 TCP is one of the main protocols in TCP/IP networks. Whereas the IP protocol deals only
with packets, TCP enables two hosts to establish a connection and exchange streams of
data. TCP guarantees delivery of data and also guarantees that packets will be delivered in
the same order in which they were sent.
 TCP is responsible for verifying the correct delivery of data from client to server. Data can
be lost in the intermediate network. TCP adds support to detect errors or lost data and to
trigger retransmission until the data is correctly and completely received.
 TCP is a transport protocol that can be used to support applications directly or additional
protocols can be layered on it to provide additional features. TCP is a reliable
connection-oriented protocol used to transport streams of data.
 Another transport protocol (User Datagram Protocol) is used to meet traditional
message-based communication.
 IP is the underlying network protocols that provide the basic transmission mechanism for
the internet and other subnets. Success of TCP/IP is based on their independence of
underlying transmission technology enabling internetworks to build up from many
heterogeneous networks and data links.
 IP Addressing
 The IP address size is 32-bit. The 32-bit numeric identifier contains a unique network
identifier within the internet, allocated by the Internet Network Information Center (NIC).
A unique host identifier within that network, assigned by its manager.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (2 - 8) Communication

 The version of IP currently using is IPv4. New version is IPv6 that designed to overcome
addressing limitation of IPv4.
 IP address written as a sequence of four decimal numbers separated by dots. Has equivalent
symbolic domain name represented in a hierarchy. IP address has five classes :
a. Class A : Reserved for very large networks (224 hosts on each).
b. Class B : Allocated for organization networks contain more than 255 hosts.
c. Class C : Allocated to all other networks (less than 255 hosts on each).
d. Class D : Reserved for multicasting but this is not supported by all routers.
e. Class E : Unallocated addresses reserved for future requirements.
 When an IP datagram (up to 64 Kbytes) is longer than the Maximum Transfer Unit (MTU)
of the underlying network :
a. It is broken into smaller packets at the source and reassembled at its final destination.
b. Each packet has a fragment identifier to enable out-of-order fragments to be collected.

Fig. 2.1.4 : IP address

 2.2 Remote Procedure Call  SPPU : Oct. - 18

 Remote Procedure Call (RPC), originally developed by Sun Microsystems and currently
used by many UNIX-based systems, is an Application Programming Interface (API)
available for developing distributed applications.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (2 - 9) Communication

 It allows programs to execute subroutines on a remote system. The caller program, which
represents the client instance in the client/server model sends a call message to the server
process, and waits for a reply message.
 The call message includes the subroutine's parameters, and the reply message contains the
results of executing the subroutine.
 RPC also provides a standard way of encoding data passed between the client servers in a
portable fashion called External Data Representation (XDR).
 Traditionally the calling procedure is known as the client and the called procedure is known
as the server.
 When making a remote procedure call :
1. The calling environment is suspended, procedure parameters are transferred across the
network to the environment where the procedure is to execute, and the procedure is
executed there.
2. When the procedure finishes and produces its results, its results are transferred back to
the calling environment, where execution resumes as if returning from a regular
procedure call.
 The main goal of RPC is to hide the existence of the network from a program. As a result,
RPC doesn't quite fit into the OSI model :
a. The message passing nature of network communication is hidden from the user. The
user doesn't first open a connection, read and write data, and then close the connection.
Indeed, a client often does not even know they are using the network.
b. RPC often omits many of the protocol layers to improve performance. Even a small
performance improvement is important because a program may invoke RPCs often. For
example, on (diskless) Sun workstations, every file access is made via an RPC.
 RPC is especially well suited for client-server (e.g., query-response) interaction in which
the flow of control alternates between the caller and callee.
 Conceptually, the client and server do not both execute at the same time. Instead, the thread
of execution jumps from the caller to the callee and then back again
 The procedure call (same as function call or subroutine call) is a well-known method for
transferring control from one part of a process to another, with a return of control to the
caller.
 Associated with the procedure call is the passing of arguments from the caller (the client) to
the callee (the server).
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge
Distributed Systems (2 - 10) Communication

 In most current systems the caller and the callee are within a single process on a given host
system. This is what we called "local procedure calls".
 In a RPC, a process on the local system invokes a procedure on a remote system. The
reason we call this a "procedure call" is because the intent is to make it appear to the
programmer that a normal procedure call is taking place.
 We use the term "request" to refer to the client calling the remote procedure, and the term
"response" to describe the remote procedure returning its result to the client.

 2.2.1 RPC Model

 RPC model is similar to the well known and well understood procedure call model used for
transfer of control and data within the program.
 RPC mechanism is an extension of the procedure call mechanism in the sense that it
enables a call to be made to a procedure that does not reside in the address space of the
calling process. The called procedure may be on the same computer as the calling process
or on a different computer.
 In case of RPC, the caller and the callee processes have disjoint address spaces, the remote
process procedure has no access to data and variables of the caller's environment.
 RPC method uses message passing schemes for information exchange between the caller
and the callee processes. Fig. 2.2.1 shows the typical model of remote procedure call.

Fig. 2.2.1 : RPC model

1. The client process send request message to the server process and waits for a reply
message. The request message contains the remote procedure's parameters and other
things.
2. Server process executes the procedure and then returns the result of procedure execution
in a reply message to the client process.
3. Once the reply message is received, the result of procedure execution is extracted, and
the caller's execution is resumed.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge
Distributed Systems (2 - 11) Communication

 2.2.2 Transparency of RPC

 A transparent RPC mechanism is one in which local procedures and remote procedures are
indistinguishable to programmers. RPC uses two types of transparency : Syntactic
transparency and semantic transparency.
 Syntactic transparency means that a remote procedure call should have exactly the same
syntax as a local procedure call. Semantic transparency means that the semantics of a
remote procedure call are identical to those of a local procedure call.
 The remote procedure calls differ from local procedure calls in the following ways :
1. The use of global variables is not possible as the server has no access to the caller
program's address space.
2. Performance may be affected by the transmission times.
3. User authentication may be necessary.
4. The location of the server must be known.

 2.2.3 Implementing RPC Mechanism

 The idea behind RPC is to make a remote procedure call look as much as possible like a
local one. In the simplest form, to call a remote procedure, the client program must be
bound with a small library procedure called the client stub that represents the server
procedure in the client's address space. Similarly, the server is bound with a procedure
called the server stub. These procedures hide the fact that the procedure call from the client
to the server is not local.
 Basically, a client-side stub is a procedure that looks to the client as if it were a callable
server procedure. A server-side stub looks to the server as if it's a calling client.
 The client program thinks it is calling the server; in fact, it's calling the client stub. The
server program thinks it's called by the client; in fact, it's called by the server stub. The
stubs send messages to each other to make the RPC happen.
 Fig. 2.2.2 shows the steps in RPC.
1. The client calls a local procedure, called the client stub. It appears to the client that the
client stub is the actual server procedure that it wants to call. The purpose of the stub is
to package the arguments for the remote procedure, possibly put them into some
standard format and then build one or more network messages. The packaging of the
client's arguments into a network message is termed marshaling.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (2 - 12) Communication

Fig. 2.2.2 : Steps in RPC

2. These network messages are sent to the remote system by the client stub. This requires
a system call to the local kernel.
3. The network messages are transferred to the remote system. Either a
connection-oriented or a connection-less protocol is used.
4. A server stub procedure is waiting on the remote system for the client's request. It
unmarshals the arguments from the network message and possibly converts them.
5. The server stub executes a local procedure call to invoke the actual server function,
passing it the arguments that it received in the network messages from the client stub.
6. When the server procedure is finished, it returns to the server stub with return values.
7. The server stub converts the return values, if necessary, and marshals them into one or
more network messages to send back to the client stub.
8. The messages get transferred back across the network to the client stub.
9. The client stub reads the network messages from the local kernel.
10. After possibly converting the return values, the client stub finally returns to the client
function. This appears to be a normal procedure return to the client.

 2.2.4 Stub Generation

 Stubs can be generated in two ways : Manually and Automatically.
 In manually method, user can construct the stub using set of translation function provided
by RPC implementer. This method is simple to implement and can handle very complex
parameters types.
 More commonly used method for stub generation is automatic method. It uses Interface
Definition Language (IDL). IDL is used to define the interface between client and a server.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (2 - 13) Communication

 2.2.5 RPC Messages

 The RPC protocol can be implemented on any transport protocol. In the case of TCP/IP, it
can use either TCP or UDP as the transport vehicle. When using UDP, it does not provide
reliability. Thus, it is the responsibility of the caller program to employ any needed
reliability (using time-outs and retransmissions, usually implemented in RPC library
routines). Note that even with TCP, the caller program still needs a time-out routine to deal
with exceptional situations, such as a server crash or poor network performance.
1. RPC call message : Each remote procedure call message contains the following
unsigned integer fields to uniquely identify the remote procedure. Fig. 2.2.3 shows the
RPC call message format.
Message Message Client Arguments
identifier type identifier Remote procedure identifier
Program Version Procedure
number number number

Fig. 2.2.3 : RPC call message format

a. Message identifier field consists of a sequence number.

b. Message type field that is used to distinguish call messages from reply messages.
c. Client identification field that may be used for two purposes.
2. RPC Reply Message : The RPC protocol for a reply message varies depending on
whether the call message is accepted or rejected by the network server. The reply
message to a request contains information to distinguish the following conditions :
 RPC executed the call message successfully.
 The remote implementation of RPC is not protocol version 2. The lowest and highest
supported RPC version numbers are returned.
 The remote program is not available on the remote system.
 The remote program does not support the requested version number. The lowest and
highest supported remote program version numbers are returned.
 The requested procedure number does not exist. This is usually a caller-side protocol or
programming error.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (2 - 14) Communication

 2.2.6 Marshalling Arguments and Result

 Parameters must be marshalled into a standard representation. Parameters consist of simple
types (e.g., integers) and compound types (e.g., C structures or Pascal records). Moreover,
because each type has its own representation, the types of the various parameters must be
known to the modules that actually do the conversion. For example, 4 bytes of characters
would be uninterrupted, while a 4 - byte integer may need to the order of its bytes reversed
 Marshalling is the packing of procedure parameters into a message packet. The RPC stubs
call type-specific procedures to marshal or unmarshal all of the parameters to the call.
 On the client side, the client stub marshals the parameters into the call packet; on the server
side the server stub unmarshals the parameters in order to call the server's procedure.
 On the return, the server stub marshals return parameters into the return packet; the client
stub unmarshals return parameters and returns to the client.

 2.2.7 Server Management

 Servers are of two types : Stateful server and stateless server. Classification of server is
based on implementations of the server.
 Stateful server : It maintains the client's state information for one remote procedure call to
the next. Following are the file operation supported by this server.
1. Open (filename, mode) : Used to open the a file identified by filename in the specified
mode.
2. Read (fid, n, buffer) : Used to get n bytes of data from the file identified by fid into the
buffer named buffer.
3. Write (fid, n, buffer) : After executing this operation , the server takes n bytes of data
from the specified buffer.
4. Seek (fid, position) : This operation causes the server to change the value of the read
write pointer of the file identified by fid to the new value specified as position.
5. Close (fid) : This statements causes the server to delete from its file table the file state
information of the file identified by fid.
 Stateless server : This server does not maintain any client state information. Following are
the file operations supported by this server.
1. Read (filename, position, n, buffer) : On execution of this statement, the server returns
to the n bytes of data of the file identified by filename.
2. Write (filename, position, n, buffer) : On execution of this statement, it takes n bytes
of data from the specified buffer and writes it into the file identified by filename.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (2 - 15) Communication

 Client-Server binding
 Binding is the process of connecting the client and server. The server, when it starts up,
exports its interface, identifying itself to a network name server and telling the local
runtime its dispatcher address.
 The client, before issuing any calls, imports the server, which causes the RPC runtime to
lookup the server through the name service and contact the requested server to setup a
connection. The import and export are explicit calls in the code.

 2.2.8 RPC Problems

 RPC is works really well if all the machines are homogeneous.
 Complications arise when the two machines use different character encodings,
e.g. EBCDIC or ASCII.
 Byte-ordering is also a problem : Intel machines are little-endian and Sun Sparc's are
big-endian.
 Extra mechanisms are required to be built into the RPC mechanism to provide for these
types of situations - this adds complexity.

 2.2.9 Call Semantics

 An RPC implementation may support more than one set of semantics for the RPC call.
Which call semantics are used by a developer depends on the requirements of the
application.
 A client makes an RPC to a service at a given server. After a time-out expires, the client
may decide to resend the request. If after several tries there is no success, what may have
happened depends on the call semantics :
 1. Maybe call semantics

 After a RPC time-out (or a client crashed and restarted), the client is not sure if the RP
may or may not have been called.
 This is the case when no fault tolerance is built into RPC mechanism.
 Clearly, maybe semantics is not desirable.
 2. At-least-once call semantics

 With this call semantics, the client can assume that the RP is executed at least once.
 Can be implemented by retransmission of the (call) request message on time-out.
 Acceptable only if the server's operations are idempotent. That is f(x) = f(f(x)).
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge
Distributed Systems (2 - 16) Communication

 3. At-most-once call semantics

 When a RPC returns, it can assume that the Remote Procedure (RP) has been called
exactly once or not at all.
 Implemented by the server's filtering of duplicate requests and caching of replies.
 This ensures the RP is called exactly once if the server does not crash during execution
of the RP.
 When the server crashes during the RP's execution, the partial execution may lead to
erroneous results.
 In this case, we want the effect that the RP has not been executed at all.
 At-most-once call semantics are for those RPC applications which require a guarantee
that multiple invocations of the same RPC call by a client will not be processed on the
server.
 Such applications usually maintain state information on the server and more than one
invocation of the same RPC call must be detected in order to avoid corruption of the
state information.

 2.2.10 Lightweight Remote Procedure Call

 Lightweight Remote Procedure Call (LRPC) is a communication facility designed and
optimized for communication between protection domains on the same machine.
 In contemporary small-kernel operating systems, existing RPC systems incur an
unnecessarily high cost when used for the type of communication that predominates
between protection domains on the same machine.
 By reducing the overhead of same-machine communication, LRPC encourages both safety
and performance.
 LRPC combines the control transfer and communication model of capability systems with
the programming semantics and large-grained protection model of RPC.
 Server S exports interface to remote procedures and client C on same machine imports
interface. OS Kernel creates data structures including an argument stack shared between
server and client.
 RPC execution
1. Push arguments onto stack.
2. Trap to Kernel.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (2 - 17) Communication

3. Kernel changes memory map of client to server address space.

4. Client thread executes procedure.
5. Thread traps to kernel upon completion.
6. Kernel changes the address space back and returns control to client

 2.2.11 DEC RPC

 Distributed Computing Environment originally developed by Open Software Foundation,
which is now called the open group. It provides a set of tools and services which simplify
and support the development and operation of distributed applications. It is an example of
middleware.
 DEC is based on the client-server model.
 DEC RPC is the fundamental communications mechanism. It allows direct calls to
procedures on remote systems as if they were local procedure calls.
 Simplifies the development of distributed applications by eliminating the need to explicitly
program the network communications between the client and server
 It also masks differences in data representations on different hardware platforms, allowing
distributed programs to work transparently across heterogeneous systems.
Review Question

1. Describe with diagram the role of client and server stub procedures in RPC in the
context of a procedural language. SPPU : Oct. - 18, In Sem, Marks 5

 2.3 Remote Method Invocation  SPPU : Dec. - 18

 RMI stands for Remote Method Invocation, is an object-oriented implementation of the

remote procedure call model. It is an API for Java programs only.
 RMI allows developers to invoke object methods and execute them on remote Java virtual
machines. Under RMI entire objects can be passed and returned as parameters.
 Remote method invocation is the object-oriented equivalent of remote method calls. In this
model, a process invokes the methods in an object, which may reside in a remote host. As
with RPC, arguments may be passed with the invocation.
 RMI also supports a registry, which allows clients to perform lookups for a particular
service. The object provides remote methods, which can be invoked in client programs.
 Remote object must be written in Java. Remote object are accessed through their interfaces
only.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (2 - 18) Communication

 Java RMI is the Java distributed object model for facilitating communications among
distributed objects. RMI is a higher-level API built on top of sockets.
 Socket-level programming allows you to pass data through sockets among computers.
 RMI enables you not only to pass data (parameters and return values) among objects on
different systems, but also to invoke methods in a remote object.
 Distributed object
 Objects consist of a set of data and its methods. Objects provide methods, through the
invocation of which an application obtains access to services.
 Distributed object systems may adopt the client server architecture. Here objects are
managed by severs and their clients invoke their methods using RMI.
 Remote and local method invocations
 Fig. 2.3.1 shows the remote and local method invocations.
 Remote method invocation : Method invocation between objects in different processes,
whether in the same computer or not.
 Remote object reference : Identifier to refer to a certain remote object in a distributed
system, e.g. B's must be made available to A.

Fig. 2.3.1 : Remote and local method invocations

 Remote object and its remote interface

 Fig. 2.3.2 shows remote object and its interface.

Fig. 2.3.2 : Remote object and its interface

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (2 - 19) Communication

 Remote interface : Every remote object has one that specifies which methods can be
invoked remotely. e.g. B and F specify what methods in remote interface.
 Server interface : Server provides a set of procedure that is available for use by client. File
server provide reading and writing files.
 Remote interface : Class of remote object implements the methods of its remote interface.
Object in other processes can only invoke methods that belong to its remote interface.
However, local object can invoke remote interface methods as well as other methods
implemented by remote object.
 Define the remote interface by extending an interface named remote.
 There are three processes that participate in supporting remote method invocation.
1. The client is the process that is invoking a method on a remote object.
2. The server is the process that owns the remote object. The remote object is an ordinary
object in the address space of the server process.
3. The object registry is a name server that relates objects with names. Objects are
registered with the object registry. Once an object has been registered, one can use the
object registry to obtain access to a remote object using the name of the object.
 There are two kinds of classes that can be used in Java RMI.
1. A remote class is one whose instances can be used remotely. An object of such a class
can be referenced in two different ways :
a. Within the address space where the object was constructed, the object is an ordinary
object which can be used like any other object.
b. Within other address spaces, the object can be referenced using an object handle.
While there are limitations on how one can use an object handle compared to an
object, for the most part one can use object handles in the same way as an ordinary
object.
c. For simplicity, an instance of a remote class will be called a remote object.
2. A serializable class is one whose instances can be copied from one address space to
another. An instance of a serializable class will be called a serializable object. In other
words, a serializable object is one that can be marshaled.
 Remote classes and interfaces
 A remote class has two parts : The interface and the class itself.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (2 - 20) Communication

 The remote interface must have the following properties :

1. The interface must be public.
2. The interface must extend the interface java.rmi.Remote.
3. Every method in the interface must declare that it throws java.rmi.RemoteException.
Other exceptions may also be thrown.
 The remote class itself has the following properties :
1. It must implement a remote interface.
2. It should extend the java.rmi.server.UnicastRemoteObject class. Objects of such a class
exist in the address space of the server and can be invoked remotely. While there are
other ways to define a remote class, this is the simplest way to ensure that objects of a
class can be used as remote objects. See the documentation of the java.rmi.server
package for more information.
3. It can have methods that are not in its remote interface. These can only be invoked
locally.

 2.3.1 Design Issues for RMI

1. RMI invocation semantics : Invocation semantics depend upon implementation of
request-reply protocol used by RMI. The semantics are maybe, at-least-once, at-most-
once etc.
2. Level of transparency : Should remote invocations be transparent to the programmer ?
 Remote invocations should be made transparent in the sense that syntax of a remote
invocation is the same as the syntax of local invocation but programmers should be able to
distinguish between remote and local objects by looking at their interfaces. For example : In
Java RMI, remote objects implement the remote interface.

 2.3.1.1 Stub and Skeleton in RMI

 The Stub/Skeleton layer is the interface between the application layer and the rest of the
system. Stubs and skeletons are generated using the RMIC compiler. This layer transmits
data to the remote reference layer via the abstraction of marshal streams. This layer doesn't
deal with the specifics of any transport.
 A stub for a remote object is the client-side proxy for the remote object. Such a stub
implements all the interfaces that are supported by the remote object implementation. Stub
initiating a call to the remote object by calling the remote reference layer. Marshaling
arguments is to a marshal stream which is obtained from the remote reference layer.
Fig. 2.3.3 shows RMI flow diagram.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge
Distributed Systems (2 - 21) Communication

Fig. 2.3.3 : RMI flow diagram

 Stub informs the remote reference layer that the call should be invoked. Then
un-marshaling the return value or exception from a marshal stream and informing the
remote reference layer that the call is complete.
 Client stub responsible for :
1. Initiate remote calls
2. Marshal arguments to be sent
3. Inform the remote reference layer to invoke the call
4. Unmarshaling the return value
5. Inform remote reference the call is complete
 Server skeleton responsible for :
1. Unmarshaling incoming arguments from client
2. Calling the actual remote object implementation
3. Marshaling the return value for transport back to client
 The remote reference layer
 The remote reference layer deals with the lower level transport interface and is responsible
for carrying out a specific remote reference protocol which is independent of the client
stubs and server skeletons.
 The remote reference layer has two cooperating components : The client-side and the
server-side components.
 The client-side component contains information specific to the remote server and
communicates via the transport to the server-side component. During each method
invocation, the client and server-side components perform the specific remote reference
semantics.
 For example, if a remote object is part of a replicated object, the client-side component can
forward the invocation to each replica rather than just a single remote object.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (2 - 22) Communication

 In a corresponding manner, the server-side component implements the specific remote

reference semantics prior to delivering a remote method invocation to the skeleton.
 The transport layer
 A low-level layer that ships serialized objects between different address spaces
 Responsible for :
1. Setting up connections to remote address spaces.
2. Managing the connections.
3. Listening to incoming calls.
4. Maintaining a table of remote objects that reside in the same address space.
5. Setting up connections for an incoming call.
6. Locating the dispatcher for the target of the remote call.

 2.3.2 RMI Invocation Semantics

 Maybe, at-least-once and at-most-once can suffer from crash failures when the server
containing the remote object fails.
1. Maybe semantics : If no reply, the client does not know if method was executed or not.
This semantic is useful in applications where failed invocations are acceptable.
2. At-least-once semantics : The at-least-once model works by clients resending request
messages. If the client receives a reply message, it can assume the method was executed
at least once. If a reply message is not received, the client cannot assume that the
method was / was not executed. At-least-once semantics is inappropriate for non-
idempotent methods. An idempotent method is one which can be executed multiple
times with the same effect as if it had been performed exactly once.
3. At-most-once semantics : Achieving at-most-once semantics requires that in addition
to retransmitting requests, replies can also be resent and that duplicate messages are
filtered. Resending reply messages implies a need to maintain a history of replies on the
server. If the client receives a reply message, it knows that the method was executed
exactly once. If the client does not receive a reply, it can assume the message was
executed either once or not at all.

 2.3.3 Passing Parameters

 When a client invokes a remote method with parameters, passing parameters are handled
under the cover by the stub and the skeleton.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (2 - 23) Communication

 Let us consider three types of parameters :

1. Primitive data type : A parameter of primitive type such as char, int, double, and
boolean is passed by value like a local call.
2. Local object type : A parameter of local object type such as java.lang. String is also
passed by flatten value. This is completely different from passing object parameter in a
local call. In a local call, an object parameter is passed by reference, which corresponds
to the memory address of the object. In a remote call, there is no way to pass the object
reference because the address on one machine is meaningless to a different Java VM.
Any object can be used as a parameter in a remote call as long as the object is
serializable. The stub serializes the object parameter and sends it in a stream across the
network. The skeleton deserializes stream into an object.
3. Remote object type : Remote objects are passed differently from the local objects.
When a client invokes a remote method with a parameter of some remote object type,
the reference of the remote object is passed. The server receives the reference and
manipulates the parameter through the reference.

 2.3.4 RMI Registry

 How does a client locate the remote object ? RMI registry provides the registry services for
the server to register/export the object and for the client to locate/lookup the object.
 A simple directory service called the RMI registry, rmiregistry, which is provided with the
Java Software Development Kit (SDK). The RMI registry is a service whose server, when
active, runs on the object server's host machine, by convention and by default on the TCP
port 1099.

 2.3.5 Developing RMI Applications

 1. Step 1 : Define server object interface
 Define a server object interface that serves as the contract between the server and its
clients, as shown in the following outline :
public interface ServerInterface extends Remote
{
public void service1(...) throws RemoteException;
// Other methods
}
 A server object interface must extend the java.rmi.Remote interface.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (2 - 24) Communication

 2. Step 2 : Define Server Implementation object

 Define a class that implements the server object interface, as shown in the following
outline :
public class ServerInterfaceImpl extends UnicastRemoteObject

implements ServerInterface

public void service1(...) throws RemoteException {

// Implement it

// Implement other methods

}
 The server implementation class must extend the java.rmi.server.UnicastRemoteObject
class. The UnicastRemoteObject class provides support for point-to-point active object
references using TCP streams.
 3. Step 3 : Create and register server object

 Create a server object from the server implementation class and register it with an RMI
registry :
ServerInterface obj= new ServerInterfaceImpl(...);

Registry registry = LocateRegistry.getRegistry();

registry.rebind("RemoteObjectName", obj);
 4. Step 4 : Develop client program

 Develop a client that locates a remote object and invokes its methods, as shown in the
following outline :
Registry registry = LocateRegistry.getRegistry(host);

ServerInterface server = (ServerInterfaceImpl)

registry.lookup("RemoteObjectName");

server.service1(...);

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (2 - 25) Communication

 2.3.6 Advantages of RMI

1. Simple and clean to implement that leads to more robust, maintainable and flexible
applications.
2. Distributed systems creations are allowed while decoupling the client and server objects
simultaneously.
3. It is possible to create zero-install client for the users.
4. No client installation is needed except Java capable browsers.
5. At the time of changing the database, only the server objects are to be recompiled but
not the server interface and the client remain the same.

 2.3.7 Disadvantages of RMI

1. Less efficient than socket objects.
2. Assuming the default threading will allow ignoring the coding, being the servers are
thread- safe and robust.
3. Cannot use the code out of the scope of Java.
4. Security issues need to be monitored more closely.
Review Question

1. Explain the uses of RMI mechanism for inter-process communication in distributed

system. SPPU : Dec. - 18, End Sem, Marks 5

 2.4 Message Oriented Communication

 In distributed system, communication is hiding from user by using RPC and RMI. But it is
true that any mechanism is not proper.
 Message-oriented communication is a way of communicating between processes.
 Message-oriented communications are of two types : synchronous or asynchronous
communication, and transient or persistent communication.

 2.4.1 Persistence and Synchronicity in Communication

 Let us consider the computer network where the applications are executed on the host
machine. Host machine is connected to the network of the communication server. This
server is responsible for passing messages between the hosts.
 Persistence is a prerequisite for certain forms of communication and sharing.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (2 - 26) Communication

 E-mail system is an example of persistent communication.

 Each host runs an application by which a user can compose, send, receive and read
messages. Each host is connected to a mail server and every message is first stored in one
of the output buffers of the local mail server.
 The server removes messages from its buffers and sends them to their destination. The
target mail server stores the message in an input buffer for the designated receiver. The
interface at the receiving host offers a service to the receiver's user agent by which the latter
can regularly check for incoming mail.
 In synchronous communication, the sender blocks waiting for the receiver to engage in
the exchange. Asynchronous communication does not require both the sender and the
receiver to execute simultaneously. So, the sender and recipient are loosely-coupled.
 Client/Server computing is generally based on a model of synchronous communication:
Client and server have to be active at the time of communication. Client issues request and
blocks until it receives reply.
 Server essentially waits only for incoming requests, and subsequently processes them.
 Drawbacks of synchronous communication
1. Client cannot do any other work while waiting for reply.
2. Failures have to be dealt with immediately (the client is waiting).
3. In many cases the model is simply not appropriate (mail, news).
 The amount of time messages are stored determines whether the communication is transient
or persistent. Transient communication stores the message only while both partners in the
communication are executing. If the next router or receiver is not available, then the
message is discarded. It works like a traditional store-and-forward router.
 Persistent communication, on the other hand, stores the message until the recipient
receives it.
 A typical example of asynchronous persistent communication is Message-Oriented
Middleware (MOM). Message-oriented middleware is also called a message-queuing
system, a message framework, or just a messaging system.
 MOM can form an important middleware layer for enterprise applications on the internet.
In the publish and subscribe model, a client can register as a publisher or a subscriber of
messages. Messages are delivered only to the relevant destinations and only once, with
various communication methods including one-to-many or many-to-many communication.
The data source and destination can be decoupled under such a model.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (2 - 27) Communication

 Persistent Asynchronous Communication : Each message is either persistently stored in

a buffer at the local host or at the first communication server. The e-mail system is an
example.
 Persistent Synchronous Communication : Messages can be persistently stored at the
receiving host and a sender is blocked until this happens.
 Transient Asynchronous Communication : The message is temporarily stored at a local
buffer at the sending host, after which the sender immediately continues. UDP is an
example for this type.
 Transient Synchronous Communication : The sender is blocked until the message is
stored in a local buffer at the receiving host, or until the message is delivered to the receiver
for further processing, or until it receives a reply message from the other side.

 2.4.2 Message Oriented Transient Communication

 1. Socket

 Socket interface is a protocol independent interface to multiple transport layer

primitives. In order to write applications which need to communicate with other
applications.
 Socket is an abstraction that is provided to an application programmer to send or receive
data to another process.
 Data can be sent to or received from another process running on the same machine or a
different machine.
 It is like an endpoint of a connection. It exists on either side of connection and identified
by IP Address and Port number.
 Sockets works with UNIX I/O services just like files, pipes and FIFO.
 API stands for Application Programming Interface. It is an interface to use the network.
Socket API defines interface between application and transport layer.
 The API defines function calls to create, close, read and write to/from a socket.

Advantages of using socket interface

 Syntax of the API functions is independent of the protocol being used. Ex:- TCP/IP and
UNIX domain protocols can be used by applications using a common set of functions.
 Gives way to better portability of applications across protocol suites.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (2 - 28) Communication

 Hides the finer details of the protocols from application programs thereby yielding faster
and bug free application development
 Sockets are referenced through socket descriptors which can be passed directly to UNIX
system I/O calls. File I/O and socket I/O are exactly similar from the programmer
perspective.

Sockets versus file I/O

 Working with sockets is very similar to working with files. The socket ( ) and accept ( )
functions both return handles (file descriptor) and reads and writes to the sockets
requires the use of these handles (file descriptors).
 In Linux, sockets and file descriptors also share the same file descriptor table. That is, if
you open a file and it returns a file descriptor with value say 8, and then immediately
open a socket, you will be given a file descriptor with value 9 to reference that socket.
 Even though sockets and files share the same file descriptor table, they are still very
different. Sockets have addresses associated with them whereas files do not; notice that
this distinguishes sockets form pipes, since pipes do not have addresses with which they
associate.
 You cannot randomly access a socket like you can a file with lseek ( ). Sockets must be
in the correct state to perform input or output.

Socket abstraction

 Socket is the basic abstraction for network communication in the socket API. Socket
defines an endpoint of communication for a process.
 Operating system maintains information about the socket and its connection. Fig. 2.4.1
shows the socket and process.

Fig. 2.4.1 : Socket and process

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (2 - 29) Communication

Socket Creation

int socket (int family, int type, int protocol);

Parameters :
1. family : AF_INET or PF_INET
(These are the IP4 family)
2. type : SOCK_STREAM (for TCP) or SOCK_DGRAM (for UDP)
3. protocol : IPPROTO_TCP (for TCP) or
IPPROTO_UDP (for UDP) or use 0
 If successful, socket ( ) returns a socket descriptor, which is an integer, and – 1 in the
case of a failure.
 An example call :
if ((sd = socket(AF_INET, SOCK_DGRAM, 0) < 0)

printf(socket() failed.);

exit(1);

 Creating a socket is in some ways similar to opening a file. This function creates a file
descriptor and returns it from the function call. You later use this file descriptor for
reading, writing and using with other socket functions.
 Remember that the sockets API are generic. There must be a generic way to specify
endpoint addresses. TCP/IP requires an IP address and port number for each endpoint
address. Other protocol suites (families) may use other schemes.

Elementary TCP sockets

 In UNIX, whenever there is a need for IPC within the same machine, we use mechanism
like signals or pipes. When we desire a communication between two applications
possibly running on different machines, we need Sockets.
 Sockets are treated as another entry in the UNIX open file table.
 Sockets provide an interface for programming networks at the transport layer.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (2 - 30) Communication

 Network communication using sockets is very much similar to performing file I/O. In
fact, socket handle is treated like file handle.
 Socket-based communication is programming language independent.
 To the kernel, a socket is an endpoint of communication. To an application, a socket is a
file descriptor that lets the application read/write from/to the network.
 A server (program) runs on a specific computer and has a socket that is bound to a
specific port. The server waits and listens to the socket for a client to make a connection
request.
 To review, there are five significant steps that a program which uses TCP must take to
establish and complete a connection. The server side would follow these steps :
1. Create a socket.
2. Listen for incoming connections from clients.
3. Accept the client connection.
4. Send and receive information.
5. Close the socket when finished, terminating the conversation.
 In the case of the client, these steps are followed :
1. Create a socket.
2. Specify the address and service port of the server program.
3. Establish the connection with the server.
4. Send and receive information.
5. Close the socket when finished, terminating the conversation.
 Only steps two and three are different, depending on if it's a client or server application.
 Fig. 2.4.2 shows a timeline of the typical scenario that takes place between a TCP client
and server.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (2 - 31) Communication

Fig. 2.4.2 : Socket function for elementary TCP client server

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (2 - 32) Communication

 2. Message passing interface

 MPI primarily addresses the message-passing parallel programming model : data is
moved from the address space of one process to that of another process through
cooperative operations on each process.
 A communicator is a collection of processes that can send messages to each other.
There is a default communicator whose group contains all initial processes, called
MPI_COMM_WORLD. A process is identified by its rank in the group associated with
a communicator.
 MPI is hardware independent. It is designed for parallel applications.
 Reasons for using MPI :
1. Standardization : MPI is the only message passing library which can be considered a
standard.
2. Portability : There is little or no need to modify your source code when you port your
application to a different platform that supports the MPI standard.
3. Performance Opportunities : Vendor implementations should be able to exploit
native hardware features to optimize performance.
4. Functionality : There are over 440 routines defined in MPI-3, which includes the
majority of those in MPI-2 and MPI-1.
5. Availability : A variety of implementations are available, both vendor and public
domain.
 Messages are sent with an accompanying user-defined integer tag, to assist the receiving
process in identifying the message. Messages can be screened at the receiving end by
specifying a specific tag, or not screened by specifying MPI_ANY_TAG as the tag in a
receiver.
 Collective operations are called by all processes in a communicator. MPI_BCAST
distributes data from one process to all others in a communicator. MPI_REDUCE
combines data from all processes in communicator and returns it to one process.
 In many numerical algorithms, SEND/RECEIVE can be replaced by BCAST/REDUCE,
improving both simplicity and efficiency.
 The group communication primitive must be called by all processes in the
communicator group. The group communication primitive is synchronously.
1. MPI_Bcast (sendbuf, sendcount, sendtype, rootID, communicator) : Send the
message in "sendbuf" to all processes.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (2 - 33) Communication

2. MPI_Scatter (sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, rootID,

communicator) : Sender rootID distributes (scatters) the message in "sendbuf" among
all processes in the communication group.
3. MPI_Gather (sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, rootID,
communicator) : Processor rootID receives messages from all other processors and
concatenate the output in the "recvbuf".
4. MPI_Reduce (sendbuf, recvbuf, count, type, op, rootID, communicator) : Processor
rootID receives messages from all other processors and performs the operation op on
the received data. The final outcome is stored in the "recvbuf" variable.

 2.4.3 Message Oriented Persistent Communication

 Message queuing systems or Message-Oriented Middleware (MOM) are the example of
message oriented middleware service.
 Message queuing model
 It supports asynchronous persistent communication. It provides an intermediate storage for
message while sender/receiver is inactive. Example application is email system.
 It communicates by inserting messages in queues. The sender is only guaranteed that
message will be eventually inserted in recipient's queue.
 It is example of loosely coupled communication. Sender and receiver can execute
completely independently of each other. Four combination of loosely coupled
communications using queues are as follows :
1. Sender running and receiver running

Fig. 2.4.3

2. Sender running and receiver passive

Fig. 2.4.4

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (2 - 34) Communication

3. Sender passive and receiver running

Fig. 2.4.5

4. Sender passive and receiver passive

Fig. 2.4.6

 Message queuing allows distributed applications to communicate asynchronously by

sending messages between the applications. The messages from the sending application are
stored in a queue and are retrieved by the receiving application. The applications send or
receive messages through a queue by sending a request to the message queuing system.
 Sending and receiving applications can use the same message queuing system or different
ones, allowing the message queuing system to handle the forwarding of the messages from
the sender queue to the recipient queue.
 Queued messages can be stored at intermediate nodes until the system is ready to forward
them to the next node. At the destination node, the messages are stored in a queue until the
receiving application retrieves them from the queue.
 Message delivery is guaranteed even if the network or application fails. This provides for a
reliable communication channel between the applications.
 General architecture of message queuing system
 Fig. 2.4.7 shows general architecture of message queuing system.

Fig. 2.4.7 : Message queuing system architecture

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge
Distributed Systems (2 - 35) Communication

 Source queue : It is on source machine or same machine. Message can be read only from
local queue.
 Destination queue : Message is stored on the queue where queue contains specification of
the destination.
 Queue is managed by queue managers. Queue manager directly interact with application.
 Router or relay is the special manager used for forwarding the messages.
 It works at the application level.
 Message broker
 A message queue broker provides delivery services for a message queue messaging system.
Message delivery relies upon a number of supporting components that handle connection
services, message routing and delivery, persistence, security, and logging.
 A message server can employ one or more broker instances. Broker components are shown
in Fig. 2.4.8.

Fig. 2.4.8 : Message broker components

 Message delivery in a message queue messaging system from producing clients to

destinations, and then from destinations to one or more consuming clients is performed by a
broker.
 To perform message delivery, a broker must set up communication channels with clients,
perform authentication and authorization, route messages appropriately, guarantee reliable
delivery, and provide data for monitoring system performance.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (2 - 36) Communication

 To perform this complex set of functions, a broker uses a number of different internal
components, each with a specific role in the delivery process.
 The message router component performs the key message routing and delivery service, and
the others provide important support services upon which the Message Router depends.
 Main broker service components and functions
1. Message Router : Manages the routing and delivery of messages.
2. Connection Services : Manages the physical connections between a broker and clients,
providing transport for incoming and outgoing messages.
3. Persistence Manager : Manages the writing of data to persistent storage so that system
failure does not result in failure to deliver messages.
4. Security Manager : Provides authentication services for users requesting connections to
a broker and authorization services for authenticated users.
5. Monitoring Service : Generates metrics and diagnostic information that can be written to
a number of output channels that an administrator can use to monitor and manage a
broker.

 2.4.4 IBM MQSeries

 IBM MQSeries is a message queuing system based on a model of message queue clients
and message queue servers. The applications run either on the server node where the queue
manager and queues reside, or from a remote client node. Applications can send or retrieve
messages only from queues owned by the queue manager to which they are connected.
 MQSeries is a middleware product from IBM that runs on multiple platforms and enables
applications to send messages to other applications. Basically, the sending application
PUTs a message on a queue, and the receiving application GETs the message from the
Queue. The sending and receiving applications do not have to be on the same platform, and
do not have to be executing at the same time.
 Terms used in IBM MQSeries :
1. Message queues are storage areas for messages exchanged between applications.
2. Message queue interface (MQI) is an application programming interface for applications
that want to send or receive messages through IBM MQSeries queues.
3. MQSeries client configuration is an MQSeries configuration where the queue manager
and message queues are located on a different computer or node than the application
software.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge
Distributed Systems (2 - 37) Communication

4. MQSeries serverconfiguration is an MQSeries configuration where the queue manager

and message queues are located on the same (local) computer or node as the application
software.
5. Queue manager provides the message queuing facilities that applications use, and
manages the queue definitions, configuration tables, and message queues.
6. Triggers is an MQSeries feature that enables an application to be started automatically
when a message event, such as the arrival of a message, occurs.
 Application-specific messages are put into, and removed from queues. Queues always
reside under the regime of a queue manager. Processes can put messages only in local
queues, or through an RPC mechanism.
 Message transfer :
 Messages are transferred between queues.
 Message transfer between queues at different processes, requires a channel.
 At each endpoint of channel is a Message Channel Agent (MCA). It is used for setting up
channels using lower-level network communication facilities and wrapping messages
from/in transport-level packets.
 Channels are inherently unidirectional. MQSeries provides mechanisms to automatically
start MCAs when messages arrive, or to have a receiver set up a channel . Any network of
queue managers can be created; routes are set up manually.
 A channel provides a communication path between queue managers. There are two types of
channels - Message Channels and MQI channels (also called Client channels). Message
channels provide a communication path between two queue managers on the same, or
different, platforms. A message channel can transmit messages in one direction only. If
two-way communication is required between two queue managers, two message channels
are required.
 MQI channels connect an MQSeries client to a queue manager on a server machine. It is
used for transfer of MQI calls and responses only and is bi-directional.

 2.5 Multicast Communication

 Multicast communication allows a process to send the same message to a group of
processes. As multicast operations can provide the programmer with delivery guarantees
that are difficult to realize for the application programmer using ordinary unicast
operations.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (2 - 38) Communication

 Group communication that simplifies building reliable efficient distributed systems. Most
current distributed operating systems are based on Remote Procedure Call. The idea is to
hide the message passing and make the communication look like an ordinary procedure
call. Fig. 2.5.1 shows multicast communication.

Fig. 2.5.1 : Multicast communication

 Multicast messages provides a useful infrastructure for constructing distributed systems

with the following characteristics :
1. Replicated services : A replicated service consists of a group of members. Client
requests are multicast to all the members of the group, each of which performs an
identical operation. Even when some of the members fail, clients can still be served.
2. Better performance : Performance of service is increase by using data replication.
User's computer is used for replication. Each time the data change, the new value is
multicast to the processes managing the replicas.
3. Propagation of event notifications : Multicast to a group may be used to notify
processes when something happens. For example, a news system might notify interested
users when a new message has been posted on a particular newsgroup.
 Group view is the lists of the current group members. When a membership change occurs,
the application is notified of the new membership.

 2.5.1 Application-level Tree-based Communication

 Here nodes are organized as overlay network, which is then used to disseminate
information to its members. Network routers are not involved in group membership.
 Design issue is the construction of the overlay network :
1. Nodes may organize themselves directly into a tree, meaning that there is a unique path
between every pair of nodes.
2. Nodes organize into a mesh network in which every node will have multiple neighbors
and, in general, there exist multiple paths between every pair of nodes.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge
Distributed Systems (2 - 39) Communication

 The quality of an application-level multicast tree is generally measured by three different

metrics: link stress, stretch, and tree cost.
 Link stress is defined per link and counts how often a packet crosses the same link. A link
stress greater than 1 comes from the fact that although at a logical level a packet may be
forwarded along two different connections, part of those connections may actually
correspond to the same physical link.

 2.5.2 Gossip-based Data Dissemination

 An increasingly important technique for disseminating information is to rely on epidemic
behavior. Epidemic algorithms are based on the theory of epidemics, which studies the
spreading of infectious diseases.
 Using the terminology from epidemics, a node that is part of a distributed system is called
infected if it holds data that it is willing to spread to other nodes.
 An updated node that is not willing or able to spread its data is said to be removed.
 A popular propagation model is that of anti-entropy. In this model, a node P picks another
node Q at random, and subsequently exchanges updates with Q. There are three approaches
to exchanging updates :
1. P only pushes its own updates to Q
2. P only pulls in new updates from Q
3. P and Q send updates to each other.

 2.6 Multiple Choice Questions

Q.1 RPCs offer _________ communication facilities, by which a client is blocked until the
server has sent a reply.
a asynchronous b synchronous

c parallel d serial

Q.2 Message-oriented middleware models generally offer _________ asynchronous

communication, and are used where RPCs are not appropriate.
a persistent b nonpersistent

c static d dynamic

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (2 - 40) Communication

Q.3 In synchronous communication, the _________ is blocked at least until a message

has been received.
a receiver b gateway c sender d router

Q.4 A message channel is a _________, reliable connection between a sending and

a receiving queue manager, through which queued messages are transported.
a bidirectional b one way

c two way d unidirectional

Q.5 A message broker acts as an _________ gateway in a message-queuing system.

a physical level b network level

c transport level d application level

Q.6 RPC stands for _________.

a Remote Procedure Call b Router Procedure Call

c Remote Packet Call d Remote Packet Cache

Q.7 With asynchronous RPCs, the _________ immediately sends a reply back to the
client the moment the RPC request is received, after which it calls the requested
procedure.
a router b gateway c client d server

Q.8 What is the job of server stub in RPC ?

a Server stub executes a local procedure call to start the function and pass the
parameters.
b Server stub executes a remote procedure call to start the function and pass the
parameters
c Server stub executes a local procedure call

d Server stub executes a remote procedure call

Q.9 LAN support _________.

a point to point b uni-casting

c broadcasting d all of these

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (2 - 41) Communication

Q.10 IP multicast uses :

a Reliable multicast b Unreliable multicast

c Atomic multicast d User-selectable reliability

Q.11 One to many is also known as _________ communication.

a unicast b multicast c datagram d stream

Q.12 The distributed object model is an extension of the local object model used in
_________ programming languages.
a function based b procedural based

c object based d generic

Q.13 RPC is a _________.

a synchronous operation b asynchronous operation

c time independent operation d none

Q.14 Persistent communication means _________.

a Message is stored by the communication system only as long as the sending and
receiving application are executing.
b A message sent is stored by the communication middleware until it is delivered to

the receiver.
c The sender keeps on executing after sending a message. The message should
be stored by the middleware.
d The sender blocks execution after sending a message and waits for response

until the middleware acknowledges transmission, or, until the receiver

acknowledges the reception, or, until the receiver has completed processing the
request.
Q.15 Important styles of architecture for distributed systems are _________.
a layered architectures b object-based architectures

c data-centered architectures d all of these

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (2 - 42) Communication

Q.16 Event based architectures can be combined with _________ architecture yielding
what is also known as shared data spaces.
a layered b object based

c data centered d none

Q.17 A multithreaded server is an example of a _________ server.

a iterative b concurrent

c iterative and concurrent d none

Q.18 In the case of an _________ server, the server itself handles the request and, if
necessary, returns a response to the requesting client.
a multithreaded b concurrent

c iterative d none

 Answer Keys for Multiple Choice Questions

Q.1 b Q.2 a Q.3 c Q.4 d Q.5 d
Q.6 a Q.7 d Q.8 a Q.9 c Q.10 b
Q.11 b Q.12 c Q.13 a Q.14 b Q.15 d
Q.16 c Q.17 b Q.18 c



TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

UNIT - III

3 Synchronization

Syllabus

Clock Synchronization : Physical Clocks, Clock Synchronization Algorithms. Logical

Clocks - Lamport’s Logical clocks, Vector Clocks. Mutual Exclusion : Overview,
Centralized Algorithm, Distributed Algorithm, Token-Ring Algorithm, Decentralized
Algorithm. Election Algorithms : Bully Algorithm, Ring Algorithm. Location Systems :
GPS, Logical Positioning of nodes, Distributed Event Matching. Gossip-Based
Contribution : Aggregation, A Peer-Sampling Service, Gossip-Based Overlay
Construction.

Contents

3.1 Clock Synchronization .................... Dec. - 18, May - 19 ..................... Marks 6

3.2 Clock Synchronization Algorithms
3.3 Logical Clocks ................................ Oct. - 18, Dec. - 18, May - 19 ..... Marks 5
3.4 Mutual Exclusion ............................ Oct. - 18 ...................................... Marks 5
3.5 Election Algorithm........................... Oct. - 18, Dec. - 18 ..................... Marks 5
3.6 Location System : GPS
3.7 Distributed Event Matching ..................... Dec.-16, April-17, May-17
3.8 Gossip Based Coordination
3.9 Multiple Choice Questions

(3 - 1)
Distributed Systems (3 - 2) Synchronization

 3.1 Clock Synchronization  SPPU : Dec. - 18, May - 19

 There is no common universal time but the speed of light is constant for all observers
irrespective of their velocity.
 Timers in computers are based on frequency of oscillation of a quartz crystal. Each
computer has a timer that interrupts periodically.
 Time is also an important theoretical construct in understanding how distributed executions
unfold. But time is problematic in distributed systems. Each computer may have its own
physical clock, but the clocks typically deviate, and we cannot synchronize them perfectly.
 Needs for precision time :
a. Stock market buy and sell orders stock market buy and sell orders
b. Secure document timestamps
c. Distributed network gaming and training
d. Aviation traffic control and position reporting
e. Multimedia synchronization for real-time teleconferencing
f. Event synchronization and ordering
g. Network monitoring measurement and control.
 Each computer in a DS has its own internal clock
1. Used by local processes to obtain the value of the current time
2. Processes on different computers can timestamp their events
3. But clocks on different computers may give different times
4. Computer clocks drift from perfect time and their drift rates differ from one another.

 3.1.1 Absence of a Global Clock

 Due to asynchronous message passing there are limits on the precision with which
processes in a distributed system can synchronize their clocks.
 In a distributed system there are as many clocks as there are systems. The clocks are
coordinated to keep them somewhat consistent but no one clock has the exact time.
 Even if the clocks were somewhat in sync, the individual clocks on each component may
run at a different rate or granularity leading to them being out of sync only after one local
clock cycle. Time is only known within a given precision. At frequent intervals, a clock
may synchronize with a more trusted clock. However, the clocks are not precisely the same
because of time lapses due to transmission and execution.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (3 - 3) Synchronization

 Consider a group of people going to a meeting. Each person has a watch. Each watch has a
similar, but different time. Even with the error in time, the group is able to meet and
conduct business. This is how distributed time works. It is difficult to make temporal order
of events and difficult to collect up-to-date information on the state of the entire system.
 Algorithm for designing and debugging of distributed system is more difficult than
centralized systems.

 3.1.2 Clock and Event

 How to order the events that occur at a single processor ?
 A distributed system is defined as a collection P of N processes pi, i = 1, 2,… N. Each
process executes on a single processor and the processors do not share memory. Each
process pi has a state si consisting of its variables.
 Processes communicate only by messages (via a network). We can view each process pi as
to execute a sequence of actions that fall in one of the following categories :
1. Sending a message;
2. Receiving a message;
3. Performing a computation that alters its state si.
 We define an event to be the execution of a single action by pi. Event is the occurrence of a
single action that a process carries out as it executes. For example : Send, Receive, change
state. The sequence of events within a single process pi can be totally ordered, which is
generally denoted by e i e if and only if the event e occurs before e' at pi.
 We can then define the history of process pi to be the sequence of events that take place
within :
0 1 2
history (pi) = hi = < e i  e i  e i  …. >
 3.1.3 Physical Clock
 Computer physical clocks are electronic devices that count oscillations occurring in a
crystal at a definite frequency. This count can be stored in a counter register. The clock
output can be read by software and scaled into a suitable time unit. This value can be used
to timestamp any event experienced.
 Most computers today keep track of the passage of time with a battery-backed-up CMOS
clock circuit, driven by a quartz resonator. This allows the time keeping to take place even
if the machine is powered off. When on, an operating system will generally program a timer
circuit to generate an interrupt periodically. The interrupt service procedure simply adds
one to a counter in memory.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge
Distributed Systems (3 - 4) Synchronization

 While the best quartz resonators can achieve an accuracy of one second in 10 years, they
are sensitive to changes in temperature and acceleration and their resonating frequency can
change as they age.
 The only problem with maintaining a concept of time is when multiple entities attempt to
do it concurrently. Two watches hardly ever agree. Computers have the same problem :
A quartz crystal on one computer will oscillate at a slightly different frequency than on
another computer, causing the clocks to tick at different rates.
 The phenomenon of clocks ticking at different rates, creating a ever widening gap in
perceived time is known as clock drift. The difference between two clocks at any point in
time is called clock skew and is due to both clock drift and the possibility that the clocks
may have been set differently on different machines.
 Fig. 3.1.1 shows skew with two clocks.

Fig. 3.1.1 : Clock drift and clock skew

 Consider two clocks A and B, where clock B runs slightly faster than clock A by
approximately two seconds per hour. This is the clock drift of B relative to A. At one point
in time, the difference in time between the two clocks is approximately 4 seconds. This is
the clock skew at that particular time.
 Successive events will correspond to different timestamps only if the clock resolution is
smaller than the rate at which events can occur. The rate at which events occur depends on
such factors as the length of the processor instruction cycle.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (3 - 5) Synchronization

 Applications running at a given computer require only the value of the counter to
timestamp events. The date and time-of-day can be calculated from the counter value.
Clock drift may happen when computer clocks count time at different rates.
 Co-ordinated Universal Time (UTC) is an international standard that is based on atomic
time. UTC signals are synchronized and broadcast regularly from land-based radio stations
and satellites.
 If the computer clock is behind the time service's, it is OK to set the computer clock to be
the time service's time. However, when the computer clock runs faster, then it should be
slowed down for a period instead of set back to the time service's time directly.
 The way to cause computer's clock run to slow for a period can be achieved in software
without changing the rate of the hardware clock. Also called timer, usually a quartz crystal,
oscillating at a well-defined frequency.
 A timer is associated with two registers : a counter and a holding register, and counter
decreasing one at each oscillation. When the counter gets to zero, an interruption is
generated and is called one clock tick.
 Crystals run at slightly different rates, the difference in time value is called a clock skew.
Clock skew causes time-related failures. Fig. 3.1.2 shows working of computer clock.

Fig. 3.1.2 : Working of computer clock

 Working :
1. Oscillation at a well-defined frequency
2. Each crystal oscillation decrements the counter by 1

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (3 - 6) Synchronization

3. When counter gets 0, its value reloaded from the holding register
4. When counter is 0, an interrupt is generated, which is call a clock tick
5. At each clock tick, an interrupt service procedure add 1 to time stored in memory
 Synchronization of physical clocks with real-world clock :
1. TAI (International Atomic Time) : Cs133 atomic clock
2. UTC (Universal Co-ordinated Time) : Modern civil time, can be received from WWV
(shortwave radio station), satellite, or network time server.
3. ITS (Internet Time Service) NTS (Network Time Protocol)
 Some definitions :
1. Transit of the sun : The event of the sun's reaching its highest apparent point in the sky.
2. Solar day : The interval between two consecutive transits of the sun is called the solar
day.
3. Coordinated Universal Time (UTC) : The most accurate physical clocks known use
13
atomic oscillators, whose accuracy is about one part in 10 . The output of these atomic
clocks is used as the standard for elapsed real time, known as International Atomic
Time. Co-ordinated universal time is an international standard that is based on atomic
time, but a so-called leap second is occasionally inserted or deleted to keep in step with
astronomical time.
Review Questions

1. Explain the problems in the area of physical clock synchronization.

SPPU : May - 19, End Sem, Marks 5

2. Explain the following terms :

i) Drift rate

ii) Clock skew

iii) Resynchronization interval SPPU : Dec. - 18, End Sem, Marks 6

 3.2 Clock Synchronization Algorithms

 A quartz crystal on one computer will oscillate at a slightly different frequency than on
another computer, causing the clocks to tick at different rates. The phenomenon of clocks
ticking at different rates, creating ever widening gap in perceived time is known as clock
drift.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (3 - 7) Synchronization

 The difference between two clocks at any point in time is called clock skew and is due to
both clock drift and the possibility that the clocks may have been set differently on different
machines.
 Fig. 3.2.1 shows the drift rate of clocks.

Fig. 3.2.1 : Drift rate of clocks

 If a clock is fast, it simply has to be made to run slower until it synchronizes. If a clock is
slow, the same method can be applied and the clock can be made to run faster until it
synchronizes.
 The operating system can do this by changing the rate at which it requests interrupts. For
example, suppose the system requests an interrupt every 17 milliseconds and the clock run
a bit too slowly. The system can request interrupts at a faster rate, say every 16 or
15 milliseconds, until the clock catches up. This adjustment changes the slope of the system
time and is known as a linear compensating function.

 3.2.1 Berkeley Algorithm

 Time server is a active machine.
 The server polls each machine periodically, asking it for the time. The time at each machine
may be estimated by using Cristian's method to account for network delays.
 When all the results are in, the master computes the average time (including its own time in
the calculation).
 Instead of sending the updated time back to the slaves, which would introduce further
uncertainty due to network delays, it sends each machine the offset by which its clock
needs adjustment. The operation of this algorithm is shown in Fig. 3.2.2.
 Three machines have times of 3:00, 3:25, and 2:50. The machine with the time of 3:00 is
the server. It sends out a synchronization query to the other machines in the group.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (3 - 8) Synchronization

Fig. 3.2.2 : Berkeley algorithm

 Each of these machines sends a timestamp as a response to the query. The server now
averages the three timestamps : The two it received and its own, computing
= (3:00 + 3:25 + 2:50)/3 = 3:05
 Now it sends an offset to each machine so that the machine's time will be synchronized to
the average once the offset is applied. The machine with a time of 3:25 gets sent an offset
of – 0:20 and the machine with a time of 2:50 gets an offset of + 0:15. The server has to
adjust its own time by + 0:05.
 The algorithm also has provisions to ignore readings from clocks whose skew is too great.
The master may compute a fault-tolerant average i.e. averaging values from machines
whose clocks have not drifted by more than a certain amount. If the master machine fails,
any other slave could be elected to take over.

 3.3 Logical Clocks  SPPU : Oct. - 18, Dec. - 18, May - 19

 For a certain class of algorithms, it is the internal consistency of the clocks that matters.
The convention in these algorithms is to speak of logical clocks.
 Lamport showed clock synchronization need not be absolute. What is important is that all
processes agree on the order in which events occur.
 A logical clock Cp of a process p is a software counter that is used to timestamp events
executed by p so that the happened-before relation is respected by the timestamps.

 3.3.1 Event Ordering

 A person with one watch knows what time it is. But a person with two or more watches is
never sure.
 Lamport defined a relation called happens before, represented by .
 The relation  on a set of events of a system is the relation satisfying three conditions :

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (3 - 9) Synchronization

 Conditions of happens before

1. If a and b are events in the same process, and a comes before b, then a  b.
2. If a is the sending event of a message msg by one process, and b is the receipt event of
msg, then a  b.
3. If a  b, b  c, then a  c.
 The order of events occurring at different processes can be critical in a distributed
application.
 To determine the order of the events that occur at different processes in a distributed
application, two obvious points are :
o If two events occurred at the same process, then they occurred in the order in which it
observes them.
o Whenever a message is sent between processes, the event of sending the message
occurred before the event of receiving the message.
 The ordering obtained by generalizing these two relationships is called happen-before
relation.
 Lamport developed a "happens before" notation to express the relationship between events :
a ? b means that a happens before b.
 If a represents the timestamp of a message sent and b is the timestamp of that message
being received, then a?b must be true; a message cannot be received before it is sent. This
relationship is transitive.
 If a  b and b  c then a  c. If a and b are events that take place in the same process the
a  b is true if a occurs before b.
 The importance of measuring logical time is in assigning a time value to each event
such that everyone will agree on the final order of events. That is, if a  b then
clock(a) < clock(b) since the clock must never run backwards.
 If a and b occur on different processes that do not exchange messages then a  b is not
true. These events are said to be concurrent : there is no way that a could have influenced b.
 We write x  p y if two events x and y occurred at a single process p and x occurred before
y. Using this restricted order we can define the happened-before relation, denoted by  ,
as follows :
1. HB1 : If  process p : x  py, then x  y.
2. HB2 : For any message m, send(m)  rcv(m),

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (3 - 10) Synchronization

where send(m) is the event of sending the message, and rcv(m) is the event of receiving it.
3. HB3 : If x, y and z are events such that x  y and y  z, then x  z.
 If x  y, then we can find a series of events occurring at one or more processes such that
either HB1 or HB2 applies between them. The sequence of events need not be unique.
 If two events are not related by the  relation (i.e., neither a  b nor b  a), then they are
concurrent (a || b).
 a  b  c  d f ; e  f but a || e.
 Example : Event ordering

Fig. 3.3.1 : Event ordering

 Possible event ordering :

1. e1  e2  e3
2. e1  e4  e5  e6  e3
3. e2  e7  e8
 Event ordering is not possible in this condition :
e2  e4  and e5  e7
 Logical clock condition :
1. For any events a and b, if a  b then C(a) < C(b)
2. From the definition, the clock condition is satisfied if the following two conditions
hold :
i. Condition 1 : If a and b are events in Pi, and a comes before b, then Ci(a) < Ci(b).
ii. Condition 2 : If a is the sending of a msg by Pi and b is the receipt of the msg by Pj,
then Ci(a) < Cj(b).
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge
Distributed Systems (3 - 11) Synchronization

 3.3.2 Lamport Timestamp

 Lamport algorithm is used to synchronize the logical clock. It uses happens-before
relationship to provide a partial ordering of events :
1. All processes use a counter (clock) with initial value of zero.
2. The counter is incremented by and assigned to each event, as its timestamp.
3. A send (message) event carries its timestamp.
4. For a receive (message) event the counter is updated by Max (receiver-counter,
message-timestamp) + 1
 Goal of the lamport algorithm is to assign totally ordered timestamps to events.
 Requirements
a. if a  b then T(a) < T(b)
b. T(a)  T(b) for any two different events a and b
 Happens-before relation is a transitive, so if a  b and b  c then a  c. If two events , x
and y happen in different processes that do not exchange messages, then x  y is not true.
 Algorithm
a. If Ci (a)  Cj (b), then T(a) = Ci(a) and T(b) = Ci(b)
b. If Ci (a) = Cj (b) and i < j, then T(a) < T(b)
 Consider three processes where each process runs on different machines with its own clock
and speed. It is shown in Fig. 3.3.2.

Fig. 3.3.2 : Clock timestamp

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (3 - 12) Synchronization

 The processes run on different machines, each with its own clock and running with own
speed.
 When the clock has ticked 6 times in process P1 , it has ticked 8 times in process P2 and 10
times in process P3. Each clock runs at a constant rate, but rate varies according to the
crystals.
 At time 6, process P1 sends message m1 to process P2 . The clock in process 2 reads 16
when it arrives. Process 2 will conclude that it took 10 ticks to reach from process 1 to
process 2.
 According to this reasoning, message m2 from process 2 to process 3 takes 16 ticks.
 In Fig. 3.3.2 (a) , message m3 from process 3 to process 2 leaves at 60 and arrives at 56.
Similarly, message m4 from process 2 to process 2 leave at 64 and arrive at 54. These
values are not possible.
 Lamport solution is given in Fig. 3.3.2 (b) which uses happen-before relation. Since
message m3 left at 60, it must arrive at 61 or later. So each message carry its sending time
according to the sender's clock.
 When message arrives and the receiver's clock shows a value prior to the time the message
was sent, the receiver fast forwards its clock to be one more than sending time.
 Totally ordered multicasting :
 We can use the logical clocks satisfying the clock condition to place a total ordering on the
set of all system events. Simply order the events by the times at which occur.
 To break the ties, lamport proposed the use of any arbitrary total ordering of the processes,
i.e. process id
 Using this method, we can assign a unique timestamp to each event in a distributed system
to provide a total ordering of all events. Very useful in distributed system for solving the
mutual exclusion problem
 We sometimes need to guarantee that concurrent updates on a replicated database are seen
in the same order everywhere :
 P1 adds ` 100 to an account (initial value: ` 1000)
 P2 increments account by 1 %
 There are two replicas. Fig. 3.3.3 shows the updating a replicated database and leaving it in
an inconsistent state.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (3 - 13) Synchronization

Fig. 3.3.3 : Updating replicated database

Result : In absence of proper synchronization :

replica#1 ` : 1111, while replica #2 ` : 1110.
 Examples for Lamport's timestamp
Initial condition : C(P) = 0, C(Q) = 2, C(R) = 0
Processor ids : pid(P) = 0, pid(Q) = 1, pid(R) = 2

Fig. 3.3.4

 Partially ordered Lamport's timestamps :

1. p1 = 1, p2 = 4, p3 = 5, p4 = 8
2. q1 = 3, q2 = 4, q3 = 5, q4 = 6, q5 = 7
3. r1 = 1, r2 = 2, r3 = 7
 Totally ordered Lamport's timestamps :
1. p1 = 10, p2 = 40, p3 = 50, p4 = 80
2. q1 = 31, q2 = 41, q3 = 51, q4 = 61, q5 = 71
3. r1 = 12, r2 = 22, r3 = 72
 Limitation of Lamport's logical clock
 If a  b then C(a) < C(b), but if C(a) < C(b) then
a  b is not necessary true, i.e. concurrency information is lost.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (3 - 14) Synchronization

 3.3.3 Vector Timestamp

 With Lamport's clocks, one cannot directly compare the timestamps of two events to
determine their precedence relationship.
 The main problem is that a simple integer clock cannot order both events within a process
and events in different processes. Collin Fidge developed an algorithm that overcomes this
problem.
 A vector clock system is a mechanism that associates timestamps with events (local states)
such that comparing two events' timestamps indicates whether those events (local states)
are causally related.
 Fidge's clock is represented as a vector [C1 , C2 , …, Cn] with an integer clock value for
each process (Ci contains the clock value of process i) .
 Each process Pi has an array VCi [1..n], where VCi [j] denotes the number of events that
process Pi knows have taken place at process Pj .
 When Pi sends a message m, it adds 1 to VCi [i], and sends VCi along with m as vector
timestamp vt(m). Result : upon arrival, recipient knows Pi 's timestamp.
 When a process Pj delivers a message m that it received from Pi with vector timestamp
ts(m), it updates each VCj [k] to max{VCj [k], ts(m)[k]} and increments VCj [j] by 1.
 We can now ensure that a message is delivered only if all causally preceding messages have
already been delivered.
 Adjustment
 Pi increments VCi [i] only when sending a message, and Pj "adjusts" VCj when receiving a
message (i.e., effectively does not change VCj [j]).
 In the time-stamping system, each process Pi has a vector of integers VCi[1..n] (initialized
to [,0,...,0]) that is maintained as follows :
 Each time process Pi produces an event (send, receive, or internal), it increments its vector
clock entry VCi[i] (VCi[i] : = VCi[i]+1) to indicate that it has progressed.
 When a process Pi sends a message m, it attaches to it the current value of VCi. Let m.VC
denote this value.
 When Pi receives a message m, it updates its vector clock as
 Note that VCi[i] counts the number of events that Pi has so far produced.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (3 - 15) Synchronization

 VCi[j] represents the number of events Pj produced that belong to the current causal past of
Pi. When a process Pi produces an event e, it can associate with that event a vector
timestamp whose value equals the current value of VCi.
 Example : Assign the Lamport's clock values for all the events in the above timing
diagram. Assume that each process's logical clock is set to 0 initially.

Fig. 3.3.5

Solution : Lamport clocks :

2 < 5 since b  h,
3 < 4 but c  g.

Fig. 3.3.6

Review Questions

1. Explain happened before relationship in a distributed system for logical clock.

SPPU : May - 19, End sem, Marks 5

2. What is vector clock ? How vector can be implemented in brief.

SPPU : Dec. - 18, End sem, Marks 5

3. How clock synchronization is achieved in distributed system using logical clock ?

SPPU : Oct. - 18, In sem, Marks 5

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (3 - 16) Synchronization

 3.4 Mutual Exclusion  SPPU : Oct. - 18

 Mutual exclusion ensures that concurrent processes make a serialized access to shared
resources or data. It requires that the actions performed by a user on a shared resource must
be atomic.
 In a distributed system neither shared variables nor a local kernel can be used in order to
implement mutual exclusion. Thus, mutual exclusion has to be based exclusively on
message passing, in the context of unpredictable message delays and no complete
knowledge of the state of the system.
 Mutual exclusion : Makes sure that concurrent process access shared resources or data in a
serialized way. If a process, say Pi, is executing in its critical section, then no other
processes can be executing in their critical sections.
 Example : Updating a DB or sending control signals to an I/O device
 Problem of mutual exclusion frequently arises in distributed systems whenever concurrent
access to shared resources by several sites is involved.
 Mutual exclusion is the fundamental issue in the design of distributed systems.
 Entry section : The code executed in preparation for entering the critical section
 Critical section : The code to be protected from concurrent execution
 Exit section : The code executed upon leaving the critical section
 Remainder section : The rest of the code
 Each process cycles through these sections in the order : remainder, entry, critical, exit.

 3.4.1 Requirement of Mutual Exclusion

1. Freedom from deadlocks : Two or more sites should not endlessly wait for message that
will never arrive.
2. Freedom from starvation : A site should not wait indefinitely while other sites
repeatedly access the CS.
3. Strict fairness : Requests are served in the (logical) order in which they arrive.
4. Fault tolerance : An algorithm should be able to detect and recover from failures.

 3.4.2 Algorithm for Mutual Exclusion

 Distributed mutual exclusion algorithms must deal with unpredictable message delays and
incomplete knowledge of the system state.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (3 - 17) Synchronization

 System model
 The system consists of N sites, S1, S2, ...., SN. We assume that a single process is running
on each site. The process at site Si is denoted by Pi.

 At any instant, a site may have several requests for critical section. A site queues up these
requests and serves them one at a time.
 Site may in one of the three states :
1. Requesting CS
2. Executing CS
3. Neither requesting nor executing requests for CS
 Classification of Mutual Exclusion
 Different types of algorithm are used to solve problem of mutual exclusion in distributed
system. But these algorithms differ in their communication topology. Topology may be
ring, bus, star etc. They also maintain different types of information.
 These algorithms are divided into two classes :
1. Non-token based : Require multiple rounds of message exchanges for local states to
stabilize
2. Token based : Permission passes around from one site to another. Site is allowed to
enter its critical section if it possesses the token and it continues to hold the token until
the execution of the critical section is over.

 3.4.3 Performance Metrics for Mutual Exclusion Algorithms

1. Message complexity : The number of messages required per CS execution by a site.
2. Synchronization delay : After a site leaves the CS, it is the time required before the next
site enters the CS.
3. Response time : The time interval a request waits for its CS execution to be over after its
request messages have been sent out.
4. System throughput : The rate at which the system executes requests for the CS.
System throughput = 1/(SD + E)
where SD is the synchronization Delay and E is the average critical section execution
time.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (3 - 18) Synchronization

 Fig. 3.4.1 shows the synchronization delay.

Fig. 3.4.1 : Synchronization delay

 Synchronization delay (sd) : A leaves  B enters

 Response time : A requests  A leaves
 Throughput : CS requests handled per time unit
= 1 / (SD + E) where E is average execution time of CS.
 Performance of mutual exclusion algorithm depends upon the loading condition of the
system.
 Performance may depend on whether load is low or high. Best case, worst case, and
average cases are all of interest.
 If the load is high then there is always pending requests for mutual exclusion.
 Fig. 3.4.2 shows the response time.

Fig. 3.4.2 : Response time

 3.4.4 Central Server Algorithm

 The simplest way to ensure mutual exclusion is through the use of a centralized server.
Here server grants the permission to enter the critical section. Fig. 3.4.3 shows concept of
centralized server algorithm.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (3 - 19) Synchronization

Fig. 3.4.3 : Centralized server algorithm

 There is a conceptual token; processes must be in possession of the token in order to

execute the critical section.
 The centralized server maintains ownership of the token. To request the token; a process
sends a request to the server. If the server currently has the token it immediately responds
with a message, granting the token to the requesting process.
 When the process completes the critical section it sends a message back to the server,
relinquishing the token.
 If the server doesn't have the token, some other process is "currently" in the critical section.
In this case the server queues the incoming request for the token and responds only when
the token is returned by the process directly ahead of the requesting process in the queue.
 Given our assumptions that no failures occur it is straight forward to see that the central
server algorithm satisfies the safety and liveness properties.
 ME1 : (safety) At most one is executing in the critical section at a time. (OK)
 ME2 : (liveness) Requests eventually succeed. (OK)
 ME3 : ( ordering) Happened-before is granted at entries for requests. (NO)
 Properties
 Satisfies ME1 and ME2, but not ME3 because of network delays may reorder requests.
There are two messages per request, one per exit, exit does not delay process.
 Problem with performance and availability of server are the bottlenecks for this algorithm.

 3.4.5 Ring-Based Algorithm

 A simple way to arrange for mutual exclusion without the need for a master process, is to
arrange the processes in a logical ring. The ring may ofcourse bear little resemblance to the
physical network or even the direct links between processes.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (3 - 20) Synchronization

 Token passes in one direction through the ring. The token passes around the ring
continuously. When a process receives the token from its neighbour, if it does not require
access to the critical section it immediately forwards on the token to the next neighbour in
the ring.
 If it requires access to the critical section, the process :
1. Retains the token
2. Performs the critical section and then :
3. To relinquish access to the critical section
4. Forwards the token on to the next neighbour in the ring.
 Fig. 3.4.4 shows ring based algorithm.

Fig. 3.4.4 : Ring based algorithm

 Once again it is straight forward to determine that this algorithm satisfies the safety and
liveness properties. However once again we fail to satisfy the fairness property.
 Suppose again we have two processes P1 and P4 consider the following events
1. Process P1 wishes to enter the critical section but must wait for the token to reach it.
2. Process P1 sends a message m to process P4.
3. The token is currently between process P1 and P4 within the ring, but the message m
reaches process P4 before the token.
4. Process P4 after receiving message m wishes to enter the critical section
5. The token reaches process P4 which uses it to enter the critical section before process
P1.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (3 - 21) Synchronization

 Performance
 Constant bandwidth consumption
 Entry delay between 0 and N message transmission times
 Synchronization delay between 1 and N message transmission times

 3.4.6 Algorithm using Multicast and Logical Clocks

 Ricart-Agrawala algorithm is an optimization on Lamport algorithm.
 Ricart-Agrawala algorithm uses only two types of messages : REQUEST and REPLY.
 It is assumed that all processes keep a logical clock which is updated according to the clock
rules.
 The algorithm requires a total ordering of requests. Requests are ordered according to their
global logical timestamps; if timestamps are equal, process identifiers are compared to
order them.
 The process that requires entry to a CS multicasts the request message to all other processes
competing for the same resource. Process is allowed to enter the CS when all processes
have replied to this message. The request message consists of the requesting process’
timestamp (logical clock) and its identifier.
 Each process keeps its state with respect to the CS : released, requested, or held.
 Algorithm :
Requesting the critical section :
1. Request when Si wants to enter the critical section, it broadcast timestamped REQUEST
message to all site.
2. When a process receives a REQUEST message, it may be in one of three states :
Case 1 : The receiver is not interested in the critical section, send reply (OK) to sender.
Case 2 : The receiver is in the critical section; do not reply and add the request to a local
queue of requests.
Case 3 : The receiver also wants to enter the critical section and has sent its request. In
this case, the receiver compares the timestamp in the received message with the one that
it has sent out. The earliest timestamp wins. If the receiver is the loser, it sends a reply
(OK) to sender. If the receiver has the earlier timestamp, then it is the winner and does
not reply. Instead, it adds the request to its queue.
 Executing the critical section :
3. When site Si enters critical section after it has received REPLY message from all the
site in its request set.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge
Distributed Systems (3 - 22) Synchronization

 Releasing the critical section :

4. When site Si exits the CS, it sends RELEASE(i) messages to all sites in its request set
Ri.
5. When a site Sj receives the RELEASE(i) message from site, it sends a REPLY(j)
message to the next site waiting in the queue and deletes that entry from the queue. If
the queue is empty, then the site updates its state to reflect that the site has not sent out
any REPLY message.
 Optimization :
 Once site Si has received a REPLY message from a site Sj, the authorization implicit in this
message remains valid until Si sends a REPLY message to Sj.
 Fig. 3.4.5 shows operation of Ricart-Agrawala algorithm.
 Step 1 : Site S1 and S2 are making request for critical section.

Fig. 3.4.5(a)
 Step 2 : Site S2 enters the critical section.

Fig. 3.4.5(b)

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (3 - 23) Synchronization

 Step 3 : Site S2 exits the CS and sends a REPLY messages to S1.

Fig. 3.4.5(c)

 Step 4 : Site S1 enters the critical section

Fig. 3.4.5(d)

 3.4.7 Maekawas Voting Algorithm

 The main idea of the algorithm is to let the process that want to enter the critical section to
compete for votes. Every process P has a voting district Sp. It is required that for all i and j,
Si and Sj have at least one common element.
 When a process wishes to enter the critical section, it sends a vote request to every member
of its voting district. When the processor receives replies from all the members of the
district, it can enter the critical section. When a processor receives a vote request, it
responds with a "YES" vote if it has not already cast its vote. When a processor exits the
critical section, it informs the voting district, which can then vote for other candidates.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (3 - 24) Synchronization

 Each process Pi is associated with a voting set Vi of processes. The set Vi for the process Pi
is chosen such that :
1. Pi  Vi : A process is in its own voting set.
2. Vi  Vj  { } : There is at least one process in the overlap between any two voting sets.
3. |Vi| = |Vj |: All voting sets are the same size.
4. Each process Pi is contained within M voting sets.
 When a processor wants to enter a critical section, it sends a request to all members of its
district. It may enter, if it gets a grant from all members. When a processor receives a
request it answers with yes, if it has not already cast its vote. On exit it informs its district to
enable a new voting.
 As before each process maintains a state variable which can be one of the following :
1. Released : Does not have access to the critical section and does not require it.
2. Wanted : Does not have access to the critical section but does require it.
3. Held : Currently has access to the critical section.
 In addition each process maintains a boolean variable indicating whether or not the process
has "voted". Of course voting is not a one-time action. This variable really indicates
whether some process within the voting set has access to the critical section and has yet to
release it. To begin with, these variables are set to "Released" and False respectively.
Review Question

1. Explain Ricart and Agrawala's mutual exclusion algorithm.

SPPU : Oct. - 18, In Sem, Marks 5

 3.5 Election Algorithm  SPPU : Oct. - 18, Dec. - 18

 Many distributed algorithms require one process to act as co-ordinator, initiator, or

otherwise perform some special role. It does not matter which process take on this special
responsibility, but one of them has to do it.
 If all processes are exactly the same, with no distinguishing characteristics, there is no way
to select one of them to be special.
 In general, election algorithms attempt to locate the process with the highest process
number and designate it as co-ordinator.
 The goal of an election algorithm is to ensure that when an election starts, it concludes with
all processes agreeing on who the new co-ordinator is to be.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (3 - 25) Synchronization

 3.5.1 The Bully Algorithm

 When any process notices that the co-ordinator is no longer responding to requests, it
initiates an election. A process P, holds an election as follows :
1. P sends an ELECTION message to all processes with higher numbers.
2. If no one responds, P wins the election and becomes co-ordinator.
3. If one of the higher-ups answers, it takes over. P's job is done.
 Fig. 3.5.1 (a) shows the bully algorithm. The group consists of eight processes, numbered
from 0 to 7. Previously process 7 was the co-ordinator, but it has just crashed. Process 4 is
the first one to notice this, so it sends ELECTION messages to all the processes higher than
it, namely 5, 6, and 7.

Fig. 3.5.1 (a) : Bully algorithm

 Processes 5 and 6 both respond with OK, as shown in Fig. 3.5.1 (b).

Fig. 3.5.1 (b)

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (3 - 26) Synchronization

 Upon getting the first of these responses, 4 knows that its job is over. It knows that one of
these will take over and become co-ordinator.
 In Fig. 3.5.1 (c), both 5 and 6 hold elections, each one only sending messages to those
processes higher than itself.

Fig. 3.5.1 (c)

 In Fig. 3.5.1 (d) process 6 tells 5 that it will take over. At this point 6 knows that 7 is dead
and that it (6) is the winner.

Fig. 3.5.1 (d) Fig. 3.5.1 (e)

 If there is state information to be collected from disk or elsewhere to pick up where the old
co-ordinator left off, 6 must now do what is needed. When it is ready to take over,
6 announces this by sending a CO-ORDINATOR message to all running processes.
 When 4 gets this message, it can now continue with the operation it was trying to do when
it discovered that 7 was dead, but using 6 as the co-ordinator this time. In this way the
failure of 7 is handled and the work can continue.
 If process 7 is ever restarted, it will just send all the others a CO-ORDINATOR message
and bully them into submission.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (3 - 27) Synchronization

 3.5.2 A Ring Algorithm

 When any process notices that the co-ordinator is not functioning, it builds an ELECTION
message containing its own process number and sends the message to its successor.
 If the successor is down, the sender skips over the successor and goes to the next member
along the ring, or the one after that until a running process is located. If each step, the
sender adds its own process number to the list in the message effectively making itself a
candidate to be elected as co-ordinator.
 Eventually, the message gets back to the process that started it all. That process recognizes
this event when it receives an incoming message containing its own process number.
 At that point, the message type is changed to CO-ORDINATOR and circulated once again,
this time to inform everyone else who the co-ordinator is and who the members of the new
ring are. When which message has circulated once, it is removed and everyone goes back to
work.
 Fig. 3.5.2 shows election algorithm using a ring.

Fig. 3.5.2 : Election algorithm using a ring

 If two processes, 2 and 5 discover simultaneously that the previous co-ordinator, process 7
has crashed. Each of these builds an ELECTION message and each of them starts
circulating its message, independent of the other one.
 Both messages will go all the way around, and both 2 and 5 will convert them into
CO-ORDINATOR messages with exactly the same number and in the same order. When
both have gone around again, both will be removed. It does not harm to have extra message
circulating at worst it consumes a little bandwidth, but this not considered wasteful.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (3 - 28) Synchronization

 3.5.3 Comparison between Ring and Bully Algorithm

Parameters Ring Bully
Asynchronous Yes No
Allows processes to crash No Yes
Satisfies Safety Yes Yes/No
Dynamic process identifiers Yes No
Dynamic configuration of processes Maybe Maybe
Best case performance 2N N–1
2
Worst case performance 3N–1 O (N )
Review Questions

1. Explain in detail ring algorithm. SPPU : Dec. - 18, End sem, Marks 4

2. Explain in detail Bully algorithm. SPPU : Oct. - 18, In sem, Marks 5

 3.6 Location System : GPS

 Global Positioning System (GPS) is a satellite - based navigation system made up of a
network of satellites placed into orbit by the US Department of Defence. GPS was
originally intended for military applications, but in the 1980s the system was made
available for civilian use.
 GPS satellites circle the earth twice a day in an MEO orbit. GPS uses 29 satellites each
circulating in an orbit at a height of approximately 20,000 km.
 Each satellite has up to four atomic clocks, which are regularly calibrated from special
stations on Earth. A satellite continuously broadcasts its position, and time stamps each
message with its local time.
 This broadcasting allows every receiver on Earth to accurately compute its own position
using, in principle, only three satellites.
 GPS satellites broadcast a data stream at the primary frequency L1 of 1.57 GHz, which
carries the coarse - acquisition (C/A) encoded signal to be captured by the receiver's
antenna. The GPS receiver measures the time of arrival of the C/A code to a fraction of a
millisecond.
 Fig. 3.6.1 shows computing a position in a two - dimensional space.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (3 - 29) Synchronization

Fig. 3.6.1 : Computing a position in a two - dimensional space

 A satellite continuously broadcasts its position, and time stamps each message with its local
time. This broadcasting allows every receiver on Earth to accurately compute its own
position using, in principle, only three satellites.
 In order to compute a position, consider first the two - dimensional case, in which two
satellites are drawn, along with the circles representing points at the same distance from
each respective satellite.
 The y-axis represents the height, while the x-axis represents a straight line along the Earth's
surface at sea level. The intersection of the two circles is a unique point. Because the GPS
receiver does not carry atomic clocks, the measured distances between the receiver and
GPS satellites introduce errors originating from the clock error, and the distance is called
the pseudo range.
 The real distance is simply computed as :
2 2 2
Ri = (Xsi – X) + (Ysi – Y) + (Zsi – Z)

where
th
Ri is the real distance between the i satellite to the receiver P;

C is the speed of light;

th
tAi is the i satellite's transmission delay and other errors;
tu is the receiver clock's errors relative to GPS system time;
th
tsi is the i satellite's errors relative to GPS system time.

 Assuming the position of the satellite Si and the receiver P under the geocentric rectangular
coordinate system is (Xsi, Ysi, Zsi) and (X, Y, Z), respectively.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (3 - 30) Synchronization

 3.6.1 Global Positioning of Nodes

 When the number of nodes in a distributed system increases, then it becomes difficult for
any node to keep track of the others. Such knowledge may be important for executing
distributed algorithms such as routing, multicasting, data placement, searching, etc.
 In geometric overlay networks each node is given a position in an m - dimensional
geometric space, such that the distance between two nodes in that space reflects a
real - world performance metric.
 In other words, given two nodes P and Q, then the distance d (P, Q) reflects how long it
would take for a message to travel from P to Q and vice versa.
 Fig. 3.6.2 shows computing a node's position in a two-dimensional space.

Fig. 3.6.2
 In GPS, node P can compute is own coordinates (xp, yp) by solving the three equations with
the two unknowns xp and yp :
2 2
di = (xi – xp) + (yi + yp) (i = 1,2,3)

 3.7 Distributed Event Matching  SPPU : Dec.-16, April-17, May-17

 Publish-subscribe a system is also referred to as distributed event-based systems.
Publish-subscribe is a well-established communications paradigm that allows any number
of publishers to communicate with any number of subscribers asynchronously and
anonymously via an event channel.

Fig. 3.7.1

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (3 - 31) Synchronization

 Key components :
1. Publishers : Publishers generate event data and publishes them.
2. Subscribers : Subscribers submit their subscriptions and process the events received.
3. P/S service : It's the mediator/broker that filters and routes events from publishers to
interested subscribers.
 Publishers form a one-to-many relationship with their subscribers, but the publishers do not
know who is subscribed. Subscribers also do not need to know the publisher, as long as
they can specify which kind of messages they would like to receive.
 Event space is divided in topics, corresponding to logical channels. The participants
subscribe for a topic and publish on a topic.
 Publish-subscribe is also a key component of Google's infrastructure.
 Examples of Internet applications that can use a publish/subscribe system are multi-party
messaging, personal information management, information sharing, on-line news
distribution, service discovery and electronic auctions.
 Characteristics of publish-subscribe system
1. Asynchronous communication : Publishers and subscribers are loosely coupled.
2. Many to many interaction between publisher and scribers.
3. Content-based pub/sub very expressive
4. Heterogeneous : Distributed event-based system allows to connect heterogeneous
components across the Internet.

 3.7.1 Programming Model

 Fig. 3.7.2 shows publish subscribe system.
Operation Description
publish (event) A publisher publishes an event.
subscribe (filter) A subscriber subscribes to a set of events through a filter.
unsubscribe (filter) A subscriber unsubscribes from a set of events.
notify (event) Deliver events to its subscribers.
advertise (filter) A publisher declare the nature of the events they will produce.
unadvertise (filter) A publisher revokes the advertisement.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (3 - 32) Synchronization

Fig. 3.7.2 : Publish subscribe system

 1. Content Based :

 Publishers publish events to named channels and subscribers then subscribe to one of
these named channels to receive all events sent to that channel.
 In content-based publish/subscribe, notifications are not classified according to some
pre-defined external criterion (topic name).
 The advantage of a content-based system is its flexibility. More flexibility and power to
subscribers, by allowing more expression in arbitrary/customized query over the
contents of the event.
 In most content-based systems, events are viewed as sets of values of primitive types or
records and properties of events are viewed as fields of such structures.
 In a content-based distributed system, messages from publishers do not contain any
address; instead, they are routed through the system based on their content. A network
of brokers can be formed to create a content-based routing system.
 Advantages :
1. A notification that does not match any subscription is not sent to any client, saving
network resources.
2. Enable subscribers to describe runtime properties of the message objects they wish to
receive.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (3 - 33) Synchronization

 Disadvantages :
1. Expressive, but higher runtime overhead.
2. It requires complex protocols/implementation to determine the subscriber.
 2. Topic-based :

 Topic based is also known as subject based. Message belongs to one of a fixed set of
what are variously referred to as groups, channels or topics. Subscription targets a
group, channel or topic and the user receives all events that are associated with that
group.
 In a topic-based system, processes exchange information through a set of predefined
subjects which represent many-to-many distinct (and fixed) logical channels.
 For example, in a subject based system for stock trading, a participant could select one
or two stocks then subscribe based on stock name if that were one of valid groups.
 Advantages :
1. Efficient implementations
2. Routing is simple
 Disadvantages :
1. It is the very limited expressiveness it offers to subscribers.
2. Inefficient use of bandwidth : A subscriber has to subscribe to a topic even if he/she
is interested in certain specific criteria.

 3.7.2 Difference between Content Based and Topic Based

Content based Topic based
There is not classified according some Notifications are grouped in topics or
pre-defined external criterion but according to subjects.
properties of the notifications themselves.
Topics filtered according to events and type. Topics are structured in a hierarchy.
There is no formation of group. It forms group.
Subscription patterns used to identify the There is a well-defined path from
events of interest for a given subscriber and publisher to all interested subscribers.
propagate events accordingly.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (3 - 34) Synchronization

Content based Topic based

Examples : Elvin, Gryphon, Siena, Jedi. Examples: TIB/RV, SCRIBE, Bayeux,
The CORBA Notification Service.
Cannot determine recipients before Recipients are known a-priori.
publication.
Much more difficult to implement efficiently. Many efficient implementations exist.

 3.7.3 Implementation of Publish-Subscribe System

 Publish-subscribe system can be implemented by a centralized server or by a network of
message routers. In a distributed publish-subscribe, there are three main entities :
publishers, subscribers and brokers.
 Publishers send all data to a broker (or a network of brokers). Subscribers register their
interest in receiving certain data with a broker. The broker records all subscriptions in the
system, matches them against incoming publications and notifies the corresponding
subscribers.
 1. Centralized Broker model

 It consists of multiple publishers and multiple subscribers and centralized broker.

Subscribers/Publishers will contact 1 broker and does not need to have knowledge about
others.
 Every client application acting as a publisher, a subscriber, or both uses the system by
connecting to a single server, which then acts simply as a switch. Fig. 3.7.3 shows
network of broker.

Fig. 3.7.3 : Network of brokers

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (3 - 35) Synchronization

 Clients connect to one of several distributed access points. The access points are
themselves inter-connected through message routers that cooperate to form a distributed,
coherent communication service.
 For example : CORBA event services, JMS, JEDI etc.
 2. Peer-to-Peer model

 Each node can be publisher, subscriber or broker. Subscribers subscribe to publishers

directly and publishers notify subscribers directly. Therefore they must maintain
knowledge of each other.
 Complex in nature, mechanisms such as DHT and CHORD are employed to locate
nodes in the network.
 E.g. Java distributed event service

 3.7.4 Architecture of Publish-Subscribe Systems

 Fig. 3.7.4 show generic architecture of publish-subscribe systems.
 Matching layer : It ensures that events match a given subscription. Matching is pushed
down into the event routing mechanisms.

Fig. 3.7.4 : Generic architecture of publish-subscribe systems

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (3 - 36) Synchronization

 Principles behind content-based routing

 Routing Tables : Subscription-based routing tables are used to route notifications from
producers to consumers and the advertisement-based routing tables are used to route
subscriptions from consumers to producers. A routing table consist on a set of routing
entries with a pair (F, D) a filter F and a destination D.
 Flooding : It sends an event notification to all nodes in the network and then carrying out
the appropriate matching at the subscriber end. Flooding can be implemented using an
underlying broadcast or multicast facility.
 Event filtering (event selection) : The process which selects the set of subscribers that
have shown interest in a given event. Subscriptions are stored in memory and searched
when a publisher publishes a new event.
 Event routing (event delivery) : The process of routing the published events from the
publisher to all interested subscribers.

 3.7.5 Publish/Subscribe Benefits

1. It is event-based, so it is a natural fit for event-driven applications.
2. It decouples clients through a declarative publish/subscribe interface.
3. It performs content-based routing to support expressive queries.
4. Interactions and queries are declarative.
5. Content-based addresses allow for flexible deployment environments.
6. It is responsive since events are pushed to clients immediately.
7. Scalability-Suited to build distributed applications consisting a large number of entities.

 3.7.6 Disadvantages of Publish/Subscribe

1. No strong guarantee on broker to deliver content to subscriber.
2. Potential bottleneck in brokers when subscribers and publishers overload them.
3. Security an issue : Encryption hard to implement when the brokers has to filter out the
events according to context.
 Characteristics of distributed event - based systems :
 Distributed event-based systems extend the local event model by allowing multiple objects
at different locations to be notified of events taking place at an object.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (3 - 37) Synchronization

 Distributed event-based systems have two main characteristics :

1. Heterogeneous : When event notifications are used as a means of communication
between distributed objects, components in a distributed system that were not designed
to interoperate can be made to work together. All that is required is that
event-generating objects publish the types of events they offer, and that other objects
subscribe to events and provide an interface for receiving notifications.
2. Asynchronous : Notification is sent asynchronously by event-generating objects to all
the objects that have subscribed to them to prevent publishers needing to synchronize
with subscriber.

 3.8 Gossip Based Coordination

 The basic gossiping protocol is based on a symmetric information exchange between pairs
of nodes in a network. Each node has a local state, which is determined by the logic of the
application.
 Each node has a local state, which is determined by the logic of the application. For
instance, it could indicate the node’s load or cluster identifier. Whatever the state is, a node
periodically shares it with its neighbors, i.e., the nodes that are directly connected to it.
 1. Aggregation
 Every node p has two different threads, one active and one passive. The active thread
periodically initiates an information exchange with a random neighbor q, i.e., node p
sends a message containing its local state Sp to node q.
 The neighbor selection policy that implements the function GetNeighbor() depends on
the nature of the application. Node p then waits for a response from q, and when the
response is received, p will update its current state.
 The passive thread at each node, for example node p, waits for messages sent by other
nodes and as soon as it receives a message, it will send back a reply containing its own
state Sp, and also updates its state with the newly received value.

 2. Peer Sampling Services

 Peer Sampling Services (PSS) have been widely used in large scale distributed
applications, such as information dissemination, aggregation, and overlay topology
management.
 The main purpose of a PSS is to provide the participating nodes with uniformly random
sample of the nodes in the system. Gossiping algorithms are the most common approach
to implementing a PSS.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (3 - 38) Synchronization

 In a gossip-based PSS, protocol execution at each node is divided into periodic cycles.
In each cycle, every node selects a node from its partial view and exchanges a subset of
its partial view with the selected node.
 Subsequently, both nodes update their partial views. Implementations of a PSS vary
based on a number of different policies :
i. Node selection : Determines how a node selects another node to exchange
information with. It can be either randomly (rand), or based on the node’s age (tail).
ii. View propagation : Determines how to exchange views with the selected node. A
node can send its view with or without expecting a reply, called push-pull and push,
respectively.
iii. View selection : Determines how a node updates its view after receiving the nodes’
descriptors from the other node.

 3.9 Multiple Choice Questions

Q.1 To enforce ________ two functions are provided enter-critical and exit-critical, where
each function takes as an argument the name of the resource that is the subject of
competition.
a deadlock b synchronization

c mutual exclusion d starvation

Q.2 A synchronization subnet is _______.

a the collection of NTP servers on the internet.

b an IP multicast group used for clock synchronization

c the set of machines addressed by an NTP server operating in multicast mode

d the set of NTP servers with which you are currently synchronizing.

Q.3 IP multicast uses :

a Reliable multicast. b Unreliable multicast.

c Atomic multicast. d User-selectable reliability.

Q.4 The ring election algorithm works by ______.

a having all nodes in a ring of processors send a message to a coordinator who will

elect the leader.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (3 - 39) Synchronization

b sending a token around a set of nodes. Whoever has the token is the coordinator.

c sending a message around all available nodes and choosing the first one on the
resultant list.
d building a list of all live nodes and choosing the largest numbered node in the list.

Q.5 The Ricart and Agrawala distributed mutual exclusion algorithm is _________.
a more efficient and more fault tolerant than a centralized algorithm.

b more efficient but less fault tolerant than a centralized algorithm.

c less efficient but more fault tolerant than a centralized algorithm.

d less efficient and less fault tolerant than a centralized algorithm.

Q.6 A client has a time of 5:05 and a server has a time of 5:25. Using the Berkeley
algorithm, the client's clock will be set to :
a 5:15 b 5:20 c 5:25 d 5:30

Q.7 Which offers the most fault-tolerant message delivery ?

a Atomic multicast.

b Totally ordered reliable multicast.

c Causally ordered reliable multicast.

d Hardware multicast.

Q.8 A bully election algorithm :

a Picks the first process to respond to an election request.

b Relies on majority vote to pick the winning process.

c Assigns the role of coordinator to the processs holding the token at the time of
election.
d Picks the process with the largest ID.

Q.9 Which mutual exclusion algorithm works when the membership of the group is
unknown ?
a Centralized b Ricart-Agrawala

c Lamport d Token Ring.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (3 - 40) Synchronization

Q.10 What are global locks ?

a To local resources

b They synchronize access to global resources

c They synchronize access to local and global resources

d None of above

 Answer Keys for Multiple Choice Questions

Q.1 c Q.2 a Q.3 b Q.4 d Q.5 d
Q.6 a Q.7 a Q.8 d Q.9 a Q.10 d



TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

UNIT - IV

4 Naming and Distributed

File Systems

Syllabus

Names, Identifiers, Addresses, Flat Naming, Structured Naming, Attributed Based

Naming, Introduction to Distributed File Systems, File Service Architecture. Case study :
Suns Network file System, Andrew File System.

Contents

4.1 Names, Identifiers and Addresses

4.2 Flat Naming

4.3 Structured Naming

4.4 Attribute - based Naming

4.5 Introduction to Distributed File Systems

4.6 File Service Architecture

4.7 Case Study : Sun Network File System

4.8 Andrew File System

4.9 Multiple Choice Questions

(4 - 1)
Distributed Systems (4 - 2) Naming and Distributed File Systems

 4.1 Names, Identifiers and Addresses

 An entity can be identified by three types of references.
a) Name : A name is a set of bits or characters that references an entity. Names can be
human-friendly.
b) Address : Every entity resides on an access point, and access point has an address.
Addresses may be location-dependent.
c) Identifier : Identifiers are names that uniquely identify entities. A true identifier is a
name with the following properties :
1) An identifier refers to at-most one entity
2) Each entity is referred to by at-most one identifier
3) An identifier always refers to the same entity (i.e. it is never reused)

 4.1.1 Naming Overview

 A name in file systems is associated with an object. Name resolution refers to the process
of mapping a name to an object. Name space is a collection of names which may or may
not share an identical resolution mechanism.
 Name service is used by client processes to obtain attributes of resources or objects when
given their names.
 The entities named can be : users, computers, network domains, services and remote
objects. Names facilitate communication and resource sharing. Descriptive attributes are
another mean of identification. Client doesn't know the name of entity, but knows
information that describes it. Client requires a service rather than a particular entity that
implements it.
 Any process requiring access to a specific resource must poses a name or identifier for that
resource. For example : file names, URLs, remote object reference.
 A name in a distributed system is a string of bits or characters used to refer to an entity. To
resolve names a naming system is needed. Names are used to share resources, uniquely
identify entities and refer to locations. Naming system plays a very important role in
achieving the goal of location transparency in a distributed system.
 The naming facility of a distributed operating system enables users and programs to assign
character-string names to objects and subsequently use these names to refer to those
objects. The locating facility, which is an integral part of the naming facility, maps an
object's name to the object's location in a distributed system.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge
Distributed Systems (4 - 3) Naming and Distributed File Systems

 The naming and locating facilities jointly form a naming system that provides the users
with an abstraction of an object that hides the details of how and where an object is actually
located in the network.
 It provides a further level of abstraction when dealing with object replicas. Given an object
name, it returns a set of the locations of the object's replicas. The naming system plays a
very important role in achieving the goal of :
1. Location transparency,
2. Facilitating transparent migration and replication of objects,
3. Object sharing

 4.1.1.1 Desirable Features of a Good Naming System

 Features of good naming system are as follows :
1. Location transparency : This feature implies the name of an object should not reveal
any hint as to the physical location of the object, directly or indirectly. An object's
name should be independent of the physical connectivity or topology of the system, or
the current location of the object.
2. Location independency : For performance, reliability, availability and security
reasons, distributed systems provide the facility of object migration that allows the
movement and relocation of objects dynamically among the various nodes of a system.
Location independency means that the name of an object need not be changed when the
object's location changes. An object at any node can be accessed without the knowledge
of its physical location. An object at any node can issue an access request without the
knowledge of its own physical location.
3. Scalability : Distributed systems vary in size ranging from one with a few nodes to one
with many nodes. Distributed systems are normally open systems and their size
changes dynamically.
4. Uniform naming convention : In the most of the distributed systems, different naming
conventions are used for naming different types of objects. For example, file names
typically differ from user names and process names. Instead of using such non-uniform
naming conventions, a good naming system should use the same naming convention for
all types of objects in the system.
5. Multiple user-defined names for the same object : For a shared object, it is desirable
that different users of the object can use their own convenient names for accessing it.
Therefore, a naming system must provide the flexibility to assign multiple user-defined
names to the same object.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (4 - 4) Naming and Distributed File Systems

6. Group naming : A naming system should allow many different objects to be identified
by the same name. Such a facility is useful to support broadcast facility or to group
objects for conferencing or other applications.
7. Meaningful names : A name can be simply any character string identifying some
object. However, for users, meaningful names are preferred to lower level identifiers
such as memory pointers, disk block numbers or network addresses.
8. Performance : The performance measurement of a naming system is the amount of
time needed to map an object's name to its attributes, such as its location. Naming
system should be efficient in the sense that the number of messages exchanged in a
name-mapping operation should be as small as possible.
9. Fault tolerance : A naming system should be capable of tolerating, to some extent,
faults that occur due to the failure of a node or a communication link in a distributed
system network. That is, the naming system should continue functioning, perhaps in a
degraded form, in the event of these failures.
10. Replication transparency : In a
distributed system, replicas of an object
are generally created to improve
performance and reliability. A naming
system should support the use of
multiple copies of the same object in a
user-transparent manner.
 The cost is high if the object locating mechanism maps to node N3 instead of node N1.

 4.2 Flat Naming

 A name space is a collection of all valid names recognized by a particular service. It allows
simple but meaningful names to be used.
 A naming space can be represented as a graph with leaf nodes (containing entity
information) and directory nodes. Absolute and relative path names are related to a
directory node.
 A global name denotes the same entity in the system. A local name depends on where the
name is being used.
 Four types of name resolution mechanisms for flat names :
a) Broadcasting b) Forwarding pointers
c) Home-based approaches d) Distributed Hash Tables (DHTs)

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (4 - 5) Naming and Distributed File Systems

 Name space is classified into two types :

1. Flat name space
2. Partitioned name space
 Flat Name Space
 A name structure that uses only a single attribute for the names is called a flat naming
structure. Conceptually simple but unique naming is difficult to achieve without global
coordination.
 Flat names are suitable for use either for small name spaces having names for only a few
objects or for system-oriented names that need not be meaningful to the users.
 Advantage : Names were convenient and short.
 Disadvantages :
1. Single central name authority was overloaded.
2. Flat name spaces cannot generalize to large sets of machines because of the single set of
identifiers.
3. Frequent name address binding changes were costly and cumbersome.

 4.2.1 Broadcasting and Multicasting

 A computer that wants to access another computer for which it knows its IP address
broadcasts this address.
 The principle used in the Address Resolution Protocol (ARP) to find the data-link address
of a machine when given only an IP address. In essence, a machine broadcasts a packet on
the local network asking who is the owner of a given IP address.
 When the message arrives at a machine, the receiver checks whether it should listen to the
requested IP address. If so, it sends a reply packet containing, for example, its Ethernet
address.
 Broadcasting becomes inefficient when the network grows. Not only is network bandwidth
wasted by request messages, but, more seriously, too many hosts may be interrupted by
requests they cannot answer. One possible solution is to switch to multicasting.
 Multicasting is better when the network grows - send only to a restricted group of hosts.
Multicasting can also be used to locate the nearest replica - choose the one whose reply
comes in first.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (4 - 6) Naming and Distributed File Systems

 Forwarding Pointers
 To locate mobile entities, concept of forwarding pointers is used. Forwarding pointers
enable locating mobile entities. Mobile entities move from one access point to another.
 When an entity moves from place A to another place B, it leaves behind (at A) a reference
to its new location at B.
 Advantage
1. Simple : As soon as the first name is located using traditional naming service, the chain
of forwarding pointers can be used to find the current address.
 Drawbacks
1) The chain can be too long - locating becomes expensive.
2) All the intermediary locations in a chain have to maintain their pointers.
3) Vulnerability if links are broken.
 Hence, making sure that chains are short and that forwarding pointers are robust is an
important issue.

 4.2.2 Home-based Approaches

 Another approach to support mobile entities. A home keep track of where the entity is :
a) An entity's home address is registered at a naming service.
b) The home registers the foreign address of the entity.
c) Clients always contact the home first, and then continues with the foreign location.
 Problems with home-based approaches.
a) Home address has to be supported as long as the entity lives.
b) Home address is fixed, which means an unnecessary burden when the entity
permanently moves to another location.
c) Poor geographical scalability.

 4.2.3 Distributed Hash Tables

 Distributed hash tables are a form of a distributed database that can store and retrieve
information associated with a key in a network of peer nodes that can join and leave the
network at any time.
 A Distributed Hash Table (DHT) is a class of a decentralized distributed system that
provides a lookup service similar to a hash table: (key, value) pairs are stored in a DHT,
and any participating node can efficiently retrieve the value associated with a given key.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (4 - 7) Naming and Distributed File Systems

 Chord is a protocol and algorithm for a peer-to-peer distributed hash table. A distributed
hash table stores key-value pairs by assigning keys to different computers (known as
"nodes"); a node will store the values for all the keys for which it is responsible.
 Chord specifies how keys are assigned to nodes, and how a node can discover the value for
a given key by first locating the node responsible for that key. Chord assigns an m-bit
identifier (randomly chosen) to each node. A node can be contacted through its network
address.
 The Chord protocol supports just one operation : Given a key, it will determine the node
responsible for storing the key's value. Chord does not itself store keys and values.
 A node generates its identifier by picking a value randomly from the hash space. The node
joins the DHT and determines who its predecessor and successor are in the table.
 Predecessor(n) : The node with the highest identifier less than n's identifier, allowing for
wrapround.
 Successor(n) : The node with the lowest identifier greater than n's identifier, allowing for
wrapround.
 A node is then responsible for its own identifier and the identifiers between its identifier
and its predecessor's identifier. Fig. 4.2.1 shows identifier circle for 3 bit identifier.

Fig. 4.2.1 : Identifier circle for 3 bit identifier

 Internally, chord uses a consistent hash function for mapping keys to node locations. The
consistent hash function of chord is based on standard hash functions like SHA1 that
produces m bit output. The nodes are hashed based on their IP address, while key, value
pair is hashed based on their key.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge
Distributed Systems (4 - 8) Naming and Distributed File Systems

m
 An identifiers are arranged on a identifier circle modulo 2  Chord ring
 Key (k) is assigned to the node whose identifier is equal to or greater than the key's
identifier. This node is called successor(k) and is the first node clockwise from k. The
identifier ring is called Chord ring.
 Key k is assigned to the first node whose identifier is equal to or follows (the identifier of)
k in the identifier space. This node is the successor node of key k, denoted by successor(k).
 If each node knows only how to contact its current successor node on the identifier circle,
all node can be visited in linear order. Queries for a given identifier could be passed around
the circle via these successor pointers until they encounter the node that contains the key.
 Fig. 4.2.2 shows chord with finger table.

Fig. 4.2.2 : Chord with finger table

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (4 - 9) Naming and Distributed File Systems

th i–1 m
 The i entry of node n will contain successor (n + 2 ) mod 2 . The first entry of finger
table is actually the node's immediate successor.
 Every time a node wants to look up a key k, it will pass the query to the closest successor or
predecessor (depending on the finger table) of k in its finger table (the "largest" one on the
circle whose ID is smaller than k), until a node finds out the key is stored in its immediate
successor.
 With such a finger table, the number of nodes that must be contacted to find a successor in
an N-node network is O(log N)
 When a node n joins the network, certain keys previously assigned to n's successor now
become assigned to n. When node n leaves the network, all of its assigned keys are
reassigned to n's successor.
 Each node n maintains a routing table with upto m entries called finger table.
th
 The i entry in the table at node n contains the identity of the first node s that succeeds n
i–1
by at least 2 on the identifier circle.
i–1
 s = successor (n + 2 ).
th
Where s is called the i finger of node n, denoted by n.finger(i).
 A finger table entry includes both the Chord identifier and the IP address (and port number)
of the relevant node. The first finger of n is the immediate successor of n on the circle.
 DHT construction
 Use a logical name space, called the identifier space, consisting of identifiers
{0, 1, 2, …, N – 1}. Identifier space is a logical ring modulo N.
 Every node picks a random identifier though Hash H.
 Example : Space N = 16 {0,…,15}
 Five nodes a, b, c, d, e. H(a) = 6, H(b) = 5, H(c) = 0, H(d) = 11, H(e) = 2
 Fig. 4.2.3 shows chord ring and successor.
 The successor of an identifier is the first node met going in clockwise direction starting at
the identifier.
 succ(x) : is the first node on the ring with id greater than or equal x. Succ(12) = 0,
Succ(1) = 2 and Succ(6) = 6

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (4 - 10) Naming and Distributed File Systems

Fig. 4.2.3 : Chord ring and successor

 Each node points to its successor.

The successor of a node n is
succ(n+1). Fig. 4.2.4 shows node
and its successor.
i) The 0's successor is succ(1) = 2
ii) 2's successor is succ(3) = 5
iii) 5's successor is succ(6) = 6
iv) 6's successor is succ(7) = 11
v) 11's successor is succ(12) = 0
 The hashing scheme was designed to let
nodes enter and leave the network with
minimal disruption. Fig. 4.2.4 : Node and its successor
 To maintain the consistent hashing mapping when a node n joins the network, certain key
value pairs previously assigned to n's successor become assigned to n.
 When node n leaves the network, all of its assigned key value pairs are reassigned to n's
successor. No other changes in the assignment of key value pairs to nodes need occur.
 This hashing function is straightforward to implement in a centralized environment where
all machines are known.
 However, such a system does not scale. Therefore, the Chord protocol uses a distributed
hash function, in which each node maintains a small routing table.
th
 Each node (n) maintains a routing table with 160 entries, called the finger table. The i
entry in the table at node n contains a reference to the first node (s), that succeeds n by at
(i – 1) (i – 1)
least 2 on the identifier circle, i.e., s = successor (n + 2 ), where 1  i  160.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (4 - 11) Naming and Distributed File Systems

th
 The node s is called the i finger of node n, and denoted by n.finger[i].node. The first
finger of n is its immediate successor on the circle.
 When a node n does not know the successor of a key k, it sends a ``find successor'' request
to a intermediate node whose ID is closer to k.
 Node n finds the intermediate node by searching its finger table for the closest finger f
preceding k, and sends the find successor request to f.
 Node f looks in its finger table for the closest entry preceding k, and sends that back to n.
As a result n learns about nodes closer and closer to the target ID.
 Data Storing
 Fig. 4.2.5 shows data storing.

Fig. 4.2.5 : Data storing

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (4 - 12) Naming and Distributed File Systems

 Use globally known hash function, H.

 Each item <key, value> gets identifier H(key) = k.
 Example : H(''Ravi'') = 12, H(''Rupu'') = 2, H(''Iresh'') = 9, H(''Ganesh'') = 14, H(''Vilas'') = 4
 Store each item at its successor

 4.3 Structured Naming

 A name service stores a collection of one or more naming contexts - sets of bindings
between textual names and attributes for objects. It provides a general naming scheme for
entities (such as users and services) that are beyond the scope of a single service.
 Major operation is to resolve a name to look up attributes from a given name. Other
operations required: creating new binding, deleting bindings, listing bound names and
adding and deleting contexts.
 Name management is separated from other services.
a. Unification : It is often convenient for resources managed by different services to use
the same naming scheme.
b. Integration : It is not always possible to predict the scope of sharing in a distributed
system. Without a common name service, the administrative domains may use entirely
different naming conventions.
 General Name Service Requirements
 Handle arbitrary number of names and to serve arbitrary number of administrative
organizations.
a. A long lifetime b. High availability
c. Fault isolation d. Tolerance of mistrust
 Design issues for name services are : Name spaces, Name resolution and Domain name
system.

 4.3.1 Name Spaces

 A name space is a collection of all valid names recognized by a particular service. It allow
simple but meaningful names to be used.
 Structured name space allows similar sub-names without clashes and to group related
names.
 Name spaces are of two types : Flat name spaces and Hierarchical names.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (4 - 13) Naming and Distributed File Systems

 The name assigned to machines must be carefully selected from a name space with
complete control over the binding between the names and IP addresses.
 i) Flat name spaces :

 The original set of machines on the Internet used flat namespaces.

 These namespaces consisted of sequence of characters with no further structure.
 A name is assigned to an address.
 Advantage :
1. Names were convenient and short.
 Disadvantages :
1. Flat name spaces cannot generalize to large sets of machines because of the single set
of identifiers.
2. Single central name authority was overloaded.
3. Frequent name-address binding changes were costly and cumbersome.
 ii) Hierarchical names :

 The partitioning of a namespace must be defined in such a way that it :

- Supports efficient name mapping.
- Guarantees autonomous control of name assignment.
 Hierarchical namespaces provides a simple yet flexible naming structure.
 The namespace is partitioned at the top level.
 Authority for names in each partition are passed to each designated agent.
 The names are designed in an inverted-tree structure with the root at the top.
 The tree can have only 128 levels.
 The top level domains are divided into three areas :
1. Arpa is a special domain used for the address-to-name mappings.
2. The 3 character domains are called the generic domains.
3. The 2 character domains are based on the counter codes found in ISO 3166. These
are called the country domains.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (4 - 14) Naming and Distributed File Systems

 Fig. 4.3.1 shows the hierarchy of DNS.

Fig. 4.3.1 : Hierarchy of DNS

 List of some domain names :

1. Geographical domain names :
 AU Australia  BR Brazil
 CA Canada  DE Germany
 ES Spain  FI Finland
 FR France  GR Greece
 HU Hungary  IN India
 IT Italy  JP Japan
 MX Mexico  NL Netherlands
 NO Norway  NZ New Zealand
 SE Sweden  TR Turkey
 UK United Kingdom  US United States
2. Organizational domain names :
 COM US Commercial  EDU US Educational
 GOV US Government  INT International
 MIL US Military  NET Network
 ORG Non-Profit Organization
 ARPA Old Style Arpanet
 NATO Nato field
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge
Distributed Systems (4 - 15) Naming and Distributed File Systems

 In DNS, names are defined in an inverted tree structure with the root at the top. The tree
can have only 128 levels : Level 0 to Level 127.
 Each node in the tree has a label, which is a string with a maximum of 63 characters. The
root label is a null string , i.e. empty string.

Fig. 4.3.2 : Domain name and labels

 Each node in the tree has a domain name, A full domain name is a sequence of labels
separated by dots(.). Fig. 4.3.2 shows the domain names and labels.
 In fully qualified domain name, label is terminated by a null string. Fully Qualified Domain
Name (FQDN) contains the full name of host.
For example, sinhgad.it.edu
 If a label is not terminated by null switch it is called a Partially Qualified Domain Name
(PQDN). It starts from a node but not reach the root.
 Hierarchy of Name Servers
 To distribute the information among many computers, DNS servers are used. Creates many
domains as there are first level nodes. Fig. 4.3.3 shows hierarchy of name servers.
 In zone, a server is responsible and have some authority. The server makes database called
zone file and keeps all the information for every node under that domain.
 Domain and zone are same if server accepts responsibility for a domain and does not divide
the domain into subdomain.
 Domain and zone are different, if a server divides its domain into subdomains and delegates
part of its authority to other server.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (4 - 16) Naming and Distributed File Systems

Fig. 4.3.3 : Hierarchy of name server

 Fig. 4.3.4 shows zones and domains.

Fig. 4.3.4 : Domain and zone

 A root sever is a server whose zone consists of the whole tree. A root server usually does
not store any information about domains but delegates its authority to other servers.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge
Distributed Systems (4 - 17) Naming and Distributed File Systems

 Primary server : It stores a file about the zone for which it is an authority. It is responsible
for creating, maintaining and updating the zone file.
 Secondary server : It transfers the complete information about a zone from another server
and stores the files on its local disk. These servers neither creates nor updates the zone files.

 4.3.2 Name Resolution

 DNS is designed as a client server application. A host that needs to map an address to a
name or a name to an address calls a DNS client named a resolver.
 Working :
 Name resolving must also include the type of answer desired (specifying the protocol
family is optional).
 The DNS partitions the entire set of names by class (for mapping to multiple protocol
suites).
 Naming items is required since one cannot distinguish the names of subdomains from the
names of individual objects or their types.
 Mapping Domain Names to Addresses
a) The DNS also includes an efficient, reliable, general purpose, distributed system for
mapping names to addresses using an independent cooperative system called name
servers.
b) Names Servers are server programs that translate names-to-addresses (maps DN => IP
addresses) and usually executes on a dedicated processor.
c) Name Resolvers - Client software that uses one or more name servers in getting a
mapped name.
d) Domain name servers are arranged in a conceptual tree structure that corresponds to the
naming hierarchy.
 Recursive Resolution
 A client request complete translation.
 If the server is authority for the domain name, it checks its database and responds.
 If the server is not authority, it sends the request to another server and waits for the
response.
 When the query is finally resolved, the response travel back until it finally reaches the
requesting client. This is called recursive resolution.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge
Distributed Systems (4 - 18) Naming and Distributed File Systems

 Fig. 4.3.5 shows the recursive resolution.

Fig. 4.3.5 : Recursive resolution

 Iterative Resolution
 Only a single resolution is made and returned (not recursive).
 Client must now explicitly contact different name servers if further resolution is needed.
 If the server is an authority for the name, it sends the answer. If it is not, it returns the IP
address of the server that is thinks can resolve the query. The client is responsible for
repeating the query to this second server. This process is called iterative resolution because
the client repeats the same query to multiple servers.
 Fig. 4.3.6 shows iterative resolution.
 Conceptually, name resolution proceeds in a top-down fashion.
 Name resolution can occur in one of two different ways : Recursive resolution and Iterative
resolution.
 Name servers use name caching to optimize search costs.
 Time To Live (TTL) is used to determine a guaranteed name binding during it’s time
interval. When time expires, the cache name binding is no longer valid, so the client must
make a direct name resolution request once again.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge
Distributed Systems (4 - 19) Naming and Distributed File Systems

Fig. 4.3.6 : Iterative resolution

 Reverse Name Resolution

 Reverse name resolution is important task of DNS on the Internet or the translation of IP
addresses back to domain names. For example, servers can determine and record the full
domain name of machine connecting to them over the network.
 It is not efficient to use the same set of DNS records for reverse name resolution. Instead, a
separate domain called “IN-ADDR.ARPA” has been set aside to provide a hierarchy for
translating IP addresses into names.
 A DNS lookup of “barg.oo.msstate.edu” would reveal it has the IP address “130.19.60.10”.
If one has the IP address and wishes to know the name, one must perform a DNS lookup of
“10.60.19.130. in -addr.arpa”, which will return the name.
 Reverse name resolution fields use the PTR resource record, which points to the correct
position in the normal DNS space. The hierarchy under “IN-ADDR.ARPA” can be
delegated of course just like any other domain.
 To obtain the IP address of a named server, each host has a client protocol known as the
name resolver. On receipt of the name, the client application protocol posses it to the name
resolver using the standard interprocess communication primitive supported by the local
operating system.
 The resolver then creates a resolution request message in the standard message format of
the domain name server protocol.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (4 - 20) Naming and Distributed File Systems

 A resolver can have multiple request outstanding at any time. Hence the identification field
is used to relate a subsequent response message to an earlier request message.
 The name resolver passes the request message to its local domain name server using
TCP/IP. If the request is for a server on this network, the local domain name server obtains
the corresponding IP address from its DIB and returns it in a reply message.

 4.3.3 Name Servers

 To avoid the problems associated with having only a single source of information, the DNS
name space is divided into nonoverlapping zones. When a resolver has a query about a
domain name, it passes the query to one of the local name servers. If the domain being
sought falls under the jurisdiction of the name server, it returns the authoritative resource
records.
 An authoritative record is one that comes from the authority that manages the record and
is thus always correct. Authoritative records are in contrast to cached records, which may
be out of data.
 Resolver looks up a remote name :
 If the domain is remote and no information about the requested domain is available locally,
the name server sends a query message to the top level name server for the domain
requested.
 Consider the following example of Fig. 4.3.7. A resolver on flits.cs.vu.nl wants to know the
IP address of the host india.cs.stes.edu.

Fig. 4.3.7 : Resolver looks up a remote name

 Step 1 : It sends a query to the local name server, cs.vu.ne. This query contains the domain
name sought, the type (A) and the class (IN).
 Step 2 : The local name server has never had a query for this domain before and knows
nothing about it. It may ask a few other nearby name servers, but if none of them know, it
sends a UDP packet to the server for edu given in its database, edu-server.net.
 Step 3 : It is unlikely that this server knows the address of india.cs.stes.edu and probably
does not know cs.stes.edu either, but it must know all of its own children, so it forwards the
request to the name server for stes.edu.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge
Distributed Systems (4 - 21) Naming and Distributed File Systems

 Step 4 : In turn, this one forwards the request to cs.stes.edu, which must have the
authoritative resource records.
 Step 5 - 8 : Each request is from a client to a server, the resource record requested works its
way back.
 Once these records get back to the cs.vu.nl name server, they will be entered into a cache
there, in case they are needed later.

 4.3.4 Resource Records

 Different types of resource records are used in DNS. An IP address has a type of A and
PTR means pointer query.
 There are about 20 different types of resource records available. Some PR are listed below.
1) A = It defines an IP address. It is stored as a 32-bit binary value.
2) CNAME = “Canonical name”. It is represented as a domain name.
3) HINFO = Host information, two arbitrary character strings specifying the CPU and
Operating System (OS).
4) MX = Mail exchange records. It provide domain willing to accept e-mail.
5) PTR = Pointer record used for pointer queries. The IP address is represented as a domain
name in the in-addr.arpa domain.
6) NS = Name Server record. These specify the authoritative name server for a domain.
They are represented as domain names.
 Configuration of DNS :
 The DNS server can be configured manually by editing files in the default WINNT
installation path \% SYSTEM ROOT %\ SYSTEM 32 \ DNS. Administration is identical to
administration in traditional DNS. These files can be modified using a text editor. The DNS
service must then be stopped and restarted.

 4.4 Attribute - based Naming

 Attribute-based naming systems are also known as directory services.

 4.4.1 Directory Services

 A directory service is a software application or a set of applications that stores and
organizes information about a computer network's users and network resources, and that
allows network administrators to manage users' access to those resources.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (4 - 22) Naming and Distributed File Systems

 Sometimes users require a service, but they are not concerned with that system entity
supplies that service. Attributes may be used as values to be looked up.
 Directory service is a service that stores collections of bindings between names and
attributes and that looks up entries that match attribute-based specifications. Sometimes
called yellow pages services or attribute-based name services. A directory service returns
the sets of attributes of any objects found to match some specified attributes.
 Discovery Services
 Directory service that registers services provided in a spontaneous networking environment.
It provide an interface for automatically registering and de-registering services, as well as
an interface for clients to look up the services they require.
 Directory service is automatically updated as the network configuration changes and meets
the needs of clients in spontaneous networks. It also discovers services required by a client
(who may be mobile) within the current scope, for example, to find the most suitable
printing service for image files after arriving at a hotel.
 Examples of discovery services : Jini discovery service, the 'service location protocol', the
'simple service discovery protocol', the 'secure discovery service'.
 Example of discovery service : A printer may register its attributes with the discovery
service as follows :
'resourceClass = printer, type=laser, color=yes, resolution=600dpi, location=room101,
url=http://www.collegeNW.com/services/laserprinter'

 4.4.2 LDAP
 LDAP stands for Lightweight Directory Access Protocol. LDAP defines a standard method
for accessing and updating information in a directory. It has gained wide acceptance as the
directory access method of the Internet and is therefore also becoming strategic within
corporate intranets.
 LDAP is based on X.500. It is a fast growing technology for accessing common directory
information. Fig. 4.4.1 shows LDAP uses X.500.

Fig. 4.4.1 : LDAP uses X.500

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge
Distributed Systems (4 - 23) Naming and Distributed File Systems

 Why use LDAP ?

1. Centrally manage users, groups and other data.
2. Don't have to manage separate directories for each application.
3. Distribute management of data to appropriate people.
4. Allow users to find data that they need.
5. Not locked into a particular server.
6. Ability to distribute servers to where they are needed.
 A directory is a listing of information about objects arranged in some order that gives
details about each object. Common examples are a city telephone directory and a library
card catalog. In computer terms, a directory is a specialized database, also called a data
repository that stores typed and ordered information about objects.
 A directory is often described as a database. But it has special characteristics different from
general databases :
1. They are accessed much more than they are updated. Hence they are optimized for read
access.
2. They are not suited for information that changes rapidly (e.g. number of jobs in a printer
queue).
3. Many directory services don't support transactions.
4. Directories normally limits the type of information that can be stored.
5. Databases use powerful query languages like SQL but Directories normally use very
simple access methods.
6. Hence directories can be optimized to economically provide more applications with
rapid access.
 LDAP is well suited for,
1. Information that is referenced by many entities and applications.
2. Information that needs to be accessed from more than one location.
3. Information that is read more often than it is written.
 LDAP is not well suited for,
1. Information that changes often.
2. Information that is unstructured.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (4 - 24) Naming and Distributed File Systems

 Directories are usually accessed using the client/server model of communication. LDAP
defines a message protocol used by directory clients and directory servers but does not
define a programming interface for the client.
 X.500 organizes directory entries in a hierarchal name space capable of supporting large
amounts of information. It also defines powerful search capabilities to make retrieving
information easier. Because of its functionality and scalability.
 X.500 is often used together with add-on modules for interoperation between incompatible
directory services. It specifies that communication between the directory client and the
directory server uses the directory access protocol.
 LDAP defines a communication protocol. Every directory needs a namespace. The LDAP
namespace is the system used to reference objects in an LDAP directory. Each object must
have a name.
 Namespace hierarchy allows management control. DNS is by definition hierarchical in
nature. The LDAP name-space is hierarchical too. LDAP uses strings to represent data
rather than complicated structured syntaxes such as ASN.1
 LDAP defines a set of server operations used to manipulate the data stored by the directory.
LDAP uses TCP/IP for its communications. For a client to be able to connect to an LDAP
directory, it must open a TCP/IP session with the LDAP server.
 LDAP minimizes the overhead to establish a session allowing multiple operations from the
same client session. LDAP defines operations for accessing and modifying directory entries
such as :
1. Searching for entries meeting user-specified criteria.
2. Adding an entry.
3. Deleting an entry.
4. Modifying an entry.
5. Modifying the distinguished name or relative distinguished name of an entry (move).
6. Comparing an entry.

 4.5 Introduction to Distributed File Systems

 File is a collection of data with a user view (file structure) and a physical view (blocks).
 A Dstributed File System (DFS) is a file system with distributed storage and users. DFS
provides transparency of location, access, and migration of files. DFS systems use cache
replicas for efficiency and fault tolerance.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (4 - 25) Naming and Distributed File Systems

 A distributed file system enables programs to store and access remote files exactly as they
do local ones.
 Two distributed systems that have been in widespread used for a decade or more.
a. Sun Network File System (NFS).
b. Andrew File System (AFS).
 File systems are abstraction that enables users to read, manipulate and organize data.
Typically the data is stored in units known as files in a hierarchical tree where the nodes are
known as directories.
 The file system enables a uniform view, independent of the underlying storage devices
which can range between anything from floppy drives to hard drives and flash memory
cards. Since file systems evolved from stand-alone computers the connection between the
logical file system and the storage device was typically a one-to-one mapping.
 Even software RAID that is used to distribute the data on multiple storage devices is
typically implemented below the file system layer.
 Distributed file system is a resource management component of a distributed operating
system. Distributed file system is a part of distributed system that provides a user with a
unified view of the files on the network. A machine that holds the shared files is called a
server, a machine that accesses the files is called a client.
 The file systems in the 1970s were developed for centralized computer systems, where the
data was only accessed by one user at a time. When multiple users and processes were to
access files at the same time a notion of locking was introduced. There are two kinds of
locks, read and write.
 Goals of distributed file systems are as follows :
1. Network transparency : Clients should be able to access remote files using the same
operations that apply to local files.
2. High availability : Users should have the same easy access to files, irrespective of their
physical location.

 4.5.1 Characteristics of File Systems

 File system is responsible for organization, storage, retrieval, naming, sharing and
protection of files. Files are stored on disks or other non-volatile storage media.
 A file contains both data and attributes. Data consists of a sequence of data items and
attributes are held as a single record containing information such as the length of the file,
timestamps and file type etc.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge
Distributed Systems (4 - 26) Naming and Distributed File Systems

 Fig. 4.5.1 shows file attributes record structure.

Fig. 4.5.1 : File attributes record structure

 File system are designed to store and manage large numbers of files. it is also responsible
for creating, naming and deleting files. A directory is a file that provides a mapping from
text names to internal file identifier.
 File system module consists of following modules :
a. Directory module : Relates file names to file IDs.
b. File module : Relates file IDs to particular files.
c. Access control module : Checks permission for operation requested.
d. File access module : Reads or writes file data or attributes.
e. Block module : Accesses and allocates disk blocks.
f. Device module : Disk I/O and buffering.

 4.5.2 Distributed File System Requirements

 A good distributed file system should have the following features :
1. Transparency : Following are the desirable transparency :
a. Structure transparency : DFS normally uses multiple file servers for performance,
scalability and reliability reasons. Each file server is normally user process or
sometimes a kernel process that is responsible for controlling a set of secondary
storage devices of the node on which it runs. In multiple file servers, the multiplicity
of file servers should be transparent to the clients of a DFS. Client should not know
the number or locations of the file server and storage devices.
b. Access transparency : Both local and remote files should be accessible in the same
way.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (4 - 27) Naming and Distributed File Systems

c. Naming transparency : The name of file should give no lint as to where the file is
located.
d. Replication transparency : The clients do not need to know the existence or
locations of multiple file copies.
2. User mobility : User should not force to work on a specific node but should have the
flexibility to work on different nodes at different times.
3. Performance : The performance of the file system is usually measured as the average
amount of time needed to satisfy client requests.
4. Scalability : A good distributed file system should be designed to easily cope with the
growth of nodes and users in the system.
5. High availability : DFS should continue to function even when partial failures occur
due to the failure of one or more components, such as a communication link failure, a
machine failure or a storage device crash.
6. High reliability : In a good distributed file system, the probability of loss of stored data
should be minimized as far as practicable.
7. Security : DFS should be secure so that its users can be confident of the privacy of their
data.

 4.6 File Service Architecture

 An architecture that offers a clear separation of the main concerns in providing access to
files is obtained by structuring the file service as three components : flat file service,
directory service and client module.
 The relevant modules and their relationship are shown in Fig. 4.6.1.

Fig. 4.6.1 : File service architecture

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (4 - 28) Naming and Distributed File Systems

 1. Flat file service :

 Concerned with the implementation of operations on the contents of file.

 Unique File Identifiers (UFIDs) are used to refer to files in all requests for flat file
service operations.
 UFIDs are long sequences of bits chosen so that each file has a unique among all of the
files in a distributed system.
 When the flat file service receives a request to create file, it generates a new UFID for it
and return the UFID to the requester.
 2. Directory service :

 Provides mapping between text names for the files and their UFIDs.
 Clients may obtain the UFID of a file by quoting its text name to directory service.
 Directory service supports functions needed generate directories, to add new files to
directories.
 3. Client module :

 It runs on each computer and provides integrated service (flat file and directory) as a
single API to application programs. For example, in UNIX hosts, a client module
emulates the full set of UNIX file operations.
 It holds information about the network locations of flat-file and directory server
processes; and achieves better performance through implementation of a cache of
recently used file blocks at the client.
 Flat file service operations :
1. Read(FileId, i, n) -> Data - throws if 1  i  Length(File) : Reads a sequence of upto
BadPosition n items from a file starting at item i and returns it
in Data.
2. Write(FileId, i, Data) - throws if 1  i  Length(File) + 1 : Write a sequence of
BadPosition Data to a file, starting at item i, extending the file
if necessary.
3. Create ( ) -> FileId Creates a new file of length0 and delivers a UFID
for it.
4. Delete(FileId) Removes the file from the file store.
5. GetAttributes(FileId) -> Attr Returns the file attributes for the file.
6. SetAttributes(FileId, Attr) Sets the file attributes

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (4 - 29) Naming and Distributed File Systems

 Access control : In distributed implementations, access rights checks have to be performed

at the server because the server RPC interface is an otherwise unprotected point of access to
files.
 Hierarchic file system : A hierarchic file system such as the one that UNIX provides
consists of a number of directories arranged in a tree structure.
 File group : A file group is a collection of files that can be located on any server or moved
between servers while maintaining the same names. A similar construct is used in a UNIX
file system. It helps with distributing the load of file serving between several servers. File
groups have identifiers which are unique throughout the system.

 4.7 Case Study : Sun Network File System

 It is developed by Sun Microsystems to provide a distributed file system independent of the
hardware and operating system.
 NFS provides transparent access to remote files on a LAN, for clients running on UNIX
and other operating systems.
 It is a client/server application that provides shared file storage for clients across a network.
 NFS is stateless. All client requests must be self-contained. Each procedure call contains all
the information necessary to complete the call. Server maintains no "between call"
information.
 It uses an External Data Representation (XDR) specification to describe protocols in a
machine and system independent way.
 NFS is implemented on top of a Remote Procedure Call package (RPC) to help simplify
protocol definition, implementation, and maintenance.
 NFS is not so much a true file system, as a collection of protocols that together provide
clients with a model of a distributed file system.
 Goals of NFS design :
1. Compatibility : NFS should provide the same semantics as a local Unix file system.
Programs should not need or be able to tell whether a file is remote or local.
2. Easy deployable : Implementation should be easily incorporated into existing systems
remote files should be made available for local programs without these having to be
modified or re-linked.
3. Machine and OS independence : NFS Clients should run in non-Unix platforms.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (4 - 30) Naming and Distributed File Systems

4. Efficiency : NFS should be good enough to satisfy users, but did not have to be as fast
as local FS. Clients and Servers should be able to easily recover from machine crashes
and network problems.

 4.7.1 NFS Architecture

 Fig. 4.7.1 shows NFS architecture.

Fig. 4.7.1 : NFS architecture

 The Virtual File System (VFS) interface is implemented using a structure that contains the
operations that can be done on a file system.
 Likewise, the vnode interface is a structure that contains the operations that can be done on
a node (file or directory) within a file system.
 There is one VFS structure per mounted file system in the kernel and one vnode structure
for each active node. Using this abstract data type implementation allows the kernel to treat
all file systems and nodes in the same way without knowing which underlying file system
implementation it is using.
 Each vnode contains a pointer to its parent VFS and a pointer to a mounted-on VFS. This
means that any node in a file system tree can be a mount point for another file system.
 A root operation is provided in the VFS to return the root vnode of a mounted file system.
This is used by the pathname traversal routines in the kernel to bridge mount points.
 The root operation is used instead of keeping a pointer so that the root vnode for each
mounted file system can be released.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (4 - 31) Naming and Distributed File Systems

 Server Side
 Because the NFS server is stateless, when servicing an NFS request it must commit any
modified data to stable storage before returning results.
 The implication for UNIX based servers is that requests which modify the file system must
flush all modified data to disk before returning from the call.
 For example, on a write request, not only the data block, but also any modified indirect
blocks and the block containing the inode must be flushed if they have been modified.
 Client Side
 The Sun implementation of the client side provides an interface to NFS which is transparent
to applications.
 To make transparent access to remote files work we had to use a method of locating remote
files that does not change the structure of path names.
 Transparent access to different types of file systems mounted on a single machine is
provided by a new file system interface in the kernel.
 Each "filesystem type" supports two sets of operations : the Virtual Filesystem (VFS)
interface defines the procedures that operate on the filesystem as a whole; and the Virtual
Node (vnode) interface defines the procedures that operate on an individual file within that
filesystem type.
 The ability of the client to simply retry the request is due to an important property of most
NFS requests: they are idempotent.
 An operation is called idempotent when the effect of performing the operation multiple
times is equivalent to the effect of performing the operation a single time.
 Working :
 When a user is accessing a file, the kernel determines whether the file is a local file or an
NFS file. The kernel passes all references to local files to the local file access module and
all references to the NFS files to the NFS client module.
 The NFS client sends RPC requests to the NFS server through its TCP/TP module.
Normally, NFS is used with UDP, but newer implementations can use TCP. Then the NFS
server receives the requests on port 2049.
 Next, the NFS server passes the request through its local file access routines, which access
the file on server's local disk.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (4 - 32) Naming and Distributed File Systems

 After the server gets the results back from the local file access routines, the NFS server
sends back the reply in the RPC reply format to the client.
 While the NFS server is handling the client's request, the local file system needs some
amount of time to return the results to the server. During this time the server does not want
to block other incoming client requests.
 To handle multiple client requests, NFS servers are multithreaded or there are multiple
servers running at the same time.

 4.7.2 Communication
 In NFS, all communication between a client and server proceeds along the open network
computing RPC protocol. ONC RPC is similar to other RPC systems.
 Every NFS operations can be implemented as a single remote procedure call to a file server.
Up until NFS version 4, the client was made responsible for making the server life as easy
as possible by keeping requests relatively simple.
 For example, in order to read data from a file for the first time, a client normally first has to
look up the file handle using the lookup operation, after which it can issue a read request.
 This approach requires two successive RPCs. In a wide-area system the drawback is that
the extra latency of a second RPC may lead to a performance degradation.
 NFS version 4 supports compound procedures by which several RPCs can be grouped into
a single request. In the previous example, the client combines the lookup and read request
into a single RPC.
 In the case of version 4, it is also necessary to open the file before reading can take place.
There are no transactional semantics associated with compound procedures.
 The operations are simply handled in the order as requested. If there are concurrent
operations from other clients then no measures are taken to avoid conflicts.

 4.7.3 Naming and Mounting

 Workstations are designated as clients or file servers. All workstations are treated as peers.
 Servers export whole file systems, but clients can mount any sub-directory of a remote file
system on top of a local file system, or on top of another remote file system.
 In fact, a remote file system can be mounted more than once, and can even be mounted on
another copy of itself. This means that clients can have different "names" for file systems
by mounting them in different places.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (4 - 33) Naming and Distributed File Systems

 The NFS naming model provides complete transparent access to a remote file system as
maintained by a server. This transparency is achieved by letting a client be able to mount a
remote file system into its own local file system.
 Each client maintains a table which maps the remote file directories to servers.
 Instead of mounting an entire file system, NFs allows clients to part of a file system. A
server is said to export directory when it makes that directory and its entries available to
clients.
 The mount protocol is used to establish the initial logical connection between a server and a
client. A mount operation includes the name of the remote directory to be mounted and the
name of the server machine storing it.
 The server maintains an export list which specifies local file system that it exports for
mounting along with the permitted machine names.
 An NFS server can itself mount directories that are exported by other servers. However, it
is not allowed to export those directories to its own clients.
 Instead, a client will have to explicitly mound such a directory from the server that
maintains it.
 There is a problem with this model that has to do with deciding when a remote file system
should be mounted. To deal with the problem NFS implements on demand mounting of a
remote file system that is handled by an automounter, which runs as a separate process on
the client's machine.
 Fig. 4.7.2 shows simple automounter in NFS.

Fig. 4.7.2 : Automounter in NFS

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge
Distributed Systems (4 - 34) Naming and Distributed File Systems

 To access a file, a client must first look up its name in a naming service and obtain the
associated file handle. A file handle is a reference to a file within a file system.
 It is independent of the name of the file it refers to. A file handle is created by the server
that is hosting the file system and is unique with respect to all file systems exported by the
server.
 It is created when the file is created. The client is kept ignorant of the content of a file
handle. In version 4, file handles can have a variable length up to 128 bytes.
 The automounter was added to the UNIX implementation of NFS in order to mount a
remote directory dynamically whenever an 'empty' mount point is referenced by a client.
 Automounter has a table of mount points with a reference to one or more NFS servers listed
against each. It sends a probe message to each candidate server and then uses the mount
service to mount the file system at the first server to respond.
 Automounter keeps the mount table small. Automounter provides a simple form of
replication for read-only file systems.
 An NFS file has a number of associated attributes. With NFS version 4, the set of file
attributes has been split into a set of mandatory attributes that every implementation must
support (type, size, change, FSID), a set of recommended attributes that should be
preferably supported, and an additional set of named attributes.
 Named attributes are actually not part of the NFS protocol, but are encoded as an array of
(attribute, value)-pairs in which an attribute is represented as a string, and its value as an
un-interpreted sequence of bytes. They are stored along with the file (or directory) and NFS
provides operations to read and write attribute values.
 The mount protocol is used to establish the initial logical connection between a server and a
client. A mount operation includes the name of the remote directory to be mounted and the
name of the server machine storing it.
 The server maintains an export list which specifies local file system that it exports for
mounting along with the permitted machine names.
 UNIX uses /etc./exports for this purpose. The list has a maximum length, NFS is limited in
scalability. Any directory within an exported file system can be mounted remotely on a
machine. When the server receives a mount request, it returns a file handle to the client.
 File handle is basically a data-structure of length 32 bytes. It serves as the key for further
access to files within the mounted system.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (4 - 35) Naming and Distributed File Systems

 In UNIX term, the file handle consists of a file system identifier that is stored in super
block and an inode number to identify the exact mounted directory within the exported file
system.
 In NFS, one new field is added in inode that is called the generic number. Mount can be is
of three types -
1. Soft mount : A time bound is there.
2. Hard mount : No time bound.
3. Automount : Mount operation done on demand.

 4.7.4 Caching and Replication

 The caching NFS client has been designed to address several performance issues relating to
the implementation of NFS. It also provides minor support for file replication.
 Each client can have a memory cache that contains data previously read from the server. In
addition, there may also be a disk cache that is added as an extension to the memory cache,
using the same consistency parameters.
 Typically, clients cache file data, attributes, file handles, and directories. Different
strategies exist to handle consistency of the cached data, cached attributes, and so on.
 NFS version 4 supports two approaches for caching file data. The simplest approach is
when a client opens a file and caches the data it obtains from the server as the result of
various read operations.
 Fig. 4.7.3 shows client side caching in NFS.

Fig. 4.7.3 : Client side caching in NFS

 In addition, write operations can be carried out in the cache as well. When the client closes
the file, NFS requires that if modifications have taken place, the cached data must be
flushed back to the server. This approach corresponds to implementing session semantic.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge
Distributed Systems (4 - 36) Naming and Distributed File Systems

 Once a file has been cached, a client can keep its data in the cache even after closing the
file. Also, several clients on the same machine can share a single cache.
 Blocks that are read from a NFS server are kept in a disk cache. As blocks of a file are read
they are added to the cache for this file. Once the file is complete it is marked as being
persistent and can survive client crashes.
 This is possible because once the whole file is cached kernel data structures are no longer
necessary to gain information about which blocks of the file are present.
 If the cache becomes full a process runs through the cache and removes persistent objects
with preference for least recently used objects.
 Partially cached files are not eligible for cleaning as the additional complexity of updating
the kernel data structures associated with these files during cache cleaning is tedious at best.
 As this is mechanism favors files that are complete in the cache a background process runs
to collect uncached blocks of partially cached files.
 The client also supports an RPC lookup cache that holds recently requested information
about file and directory attributes. These attribute requests actually contribute a large
amount of the RPC traffic associated with NFS.
 This cache is however limited in its usefulness as the cache must expire after a time in the
order of 100 ms to maintain NFS semantics.
 If the cache is held for longer then the client may no longer hold an accurate view of the
attributes held on the server and there are no conflict resolution procedures in the NFS
protocol to handle such a situation.
 Finally the client supports asynchronous writing of files. Buffering of writes of a file to a
server avoids traffic in the situation where a file is modified many times in succession.
 This is of limited importance in the HTTP server application but becomes much more
poignant when serving files to a group of workstations.

 4.7.5 Advantages and Disadvantages of NFS

 Advantages of network file systems
1. Easy to share if files available on multiple machines
2. Easier to administer server than clients
3. Access way more data that fits on your local disk

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (4 - 37) Naming and Distributed File Systems

 Disadvantages
1. Network slower than local disk
2. Network or server may fail even when client OK
3. Complexity, security issues

 4.8 Andrew File System

 Andrew File System (AFS) is a distributed, networked file system that enables efficient file
sharing between clients and servers. AFS was originally developed for a computer network
running BSD UNIX and Mach.
 AFS is a global file system that appears as a branch of a traditional UNIX file system at
each workstation. It is based on the client-server model and relies on RPCs for
communication. There are three versions of AFS: AFS1, AFS2 and AFS3.
 In AFS1 and AFS2, the cache was updated only at the file level. Block level updates were
not possible. It means that, users working locally on a file would not have to wait long for
network access once the file is opened.
 AFS3 was also upgraded to support block level caching, which allows large files to be
manipulated by clients with small caches. AFS3 divides the file into 64 Kilobyte chunks
and caches individual chunks separately.
 AFS server participates actively in client cache management.
 The basic organizational unit of AFS is the cell. An AFS cell is an independently
administered collection of server and client machines.
 If multiple clients happen to concurrently open, modify, and close the same file, the update
resulting from the last close will overwrite previous changes. That is, the AFS
implementation does not handle concurrent updates, instead delegating this responsibility to
the applications.
 AFS uses advisory file locking. Byte ranges cannot be locked, only whole files. For these
reason, AFS is not suited for situations where many users simultaneously need to write to a
file, for instance in database applications.
 AFS has two unusual design characteristics :
1. Whole-file serving : The entire contents of directories and files are transmitted to client
computers by AFS servers.
2. Whole-file caching : When file transferred to client it will be stored on that client's
local disk.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (4 - 38) Naming and Distributed File Systems

 Andrew scenario
 User process issues an open and there is not a current copy of the file in the client cache.
 The client sends a request to the server for the whole file and stores it as a local file in a
local file system (client cache).
 An open is then performed on the local file.
 Subsequent operations work on the local file.
 When the client issues a close, the entire file is written back, but the copy is still kept on the
client machine.
 Implementation of AFS
 The key software components in AFS are :
1. Vice : The server side process that resides on top of the UNIX kernel, providing shared
file services to each client. Collection of servers is referred to as vice.
2. Venus : Client workstations are called Venus. The client side cache manager which acts
as an interface between the application program and the Vice.
 Fig. 4.8.1 shows distribution of processes in the Andrew File System. All the files in AFS
are distributed among the servers. The set of files in one server is referred to as a volume.
In case a request cannot be satisfied from this set of files, the vice server informs the client
where it can find the required file.

Fig. 4.8.1 : Distribution of processes in the Andrew file system

 Venus interacts with the kernels virtual file system (VFS) which provides the abstraction of
a common file system at each client and is responsible for all distributed file operation. The
files available to user processes running on clients are either local or shared. Local files are

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (4 - 39) Naming and Distributed File Systems

handled as normal UNIX files. They are stored on a client disk and are available only to
local user processes. Shared files are stored on servers, and copies of them are cached on
the local disks of clients.
 The client-side component of AFS is the cache manager. The responsibilities of the cache
manager include retrieving files from servers, maintaining a local file cache, translating file
requests into remote procedure calls, and storing callbacks.
 The cache manager redirects all read and write calls to the cached copy. When the client
closes the file, the cache manager flushes the changes to the server.
 When the cache manager fetches the file from the server, the server also supplies a callback
associated with the data. The callback is a promise that the data is valid. If another client
modifies the file and writes the changes back to the server, the server notifies all clients
holding callbacks for the file. This is called breaking the callback.
 The basic file operations :
1. Open a file : Venus traps application generated file open system calls, and checks
whether it can be serviced locally before requesting Vice for it. It then returns a file
descriptor to the calling application. Vice, along with a copy of the file, transfers a
callback promise, when Venus requests for a file.
2. Read and Write : Reads/Writes are done from/to the cached copy.
3. Close a file : Venus traps file close system calls and closes the cached copy of the file.
If the file had been updated, it informs the Vice server which then replaces its copy with
the updated one, as well as issues callbacks to all clients holding callback promises on
this file. On receiving a callback, the client discards its copy, and works on this fresh
copy.
 The server wishes to maintain its states at all times, so that no information is lost due to
crashes. This is ensured by the Vice which writes the states to the disk. When the server
comes up again, it also informs all the servers about its crash, so that information about
updates may be passed to it. The callback mechanism implies a stateful server.
 System call interception in AFS
 Fig. 4.8.2 shows system call interception in AFS.
 Venus intercepts two system calls sent to the OS : open ( ) and close ( ). On an open ( )
request, it investigates the filename to determine whether it lies in AFS space.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (4 - 40) Naming and Distributed File Systems

Fig. 4.8.2 : System call interception in AFS

 If the filename does not lie in AFS space, then the file is a file on the local hard drive, and
Venus simply passes the system call on to the regular open( ) system call handler to handle
as normal. But if it lies in AFS space, then Venus has some work to do.
 Other Issues of AFS :
 AFS presents a location-transparent UNIX file name space to client, using a set of trusted
servers. Directories are cached in their entirety, while files are cached in 64 KB chunks. All
updates to a file are propagated to its server upon close. Directory modifications are
propagated immediately.
 Backup, disk quota enforcement, and most other administrative operations in AFS operate
on volumes. AFS uses ACLs and the granularity of protection is an entire directory.

 4.9 Multiple Choice Questions

Q.1 Types of name resolution mechanisms for flat names are ________.
a broadcasting

b home-based approaches

c Distributed Hash Tables (DHTs)

d all of these

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (4 - 41) Naming and Distributed File Systems

Q.2 LDAP stands for ________.

a Lightweight Distributed Access Protocol

b Lightweight Directory Access Protocol

c Lightweight Down Access Protocol

d Lightweight Direct Access Protocol

Q.3 Chord is based on ________ hashing.

a direct b distributed

c non-consistent d consistent

Q.4 A lowest-level domain, called a _________, typically corresponds to a local-area

network in a computer network or a cell in a mobile telephone network.
a top domain b bottom domain

c leaf domain d root domain

Q.5 ________ is not possible in distributed file system.

a File replication b Migration

c Client interface d Remote access

Q.6 Which one of the following hides the location where in the network the file is stored ?
a Transparent distributed file system

b Hidden distributed file system

c Escaped distribution file system

d Spy distributed file system

Q.7 In distributed file system, a file is uniquely identified by ________.

a host name

b local name

c the combination of host name and local name

d none of the mentioned

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (4 - 42) Naming and Distributed File Systems

Q.8 The file once created cannot be changed is called ________.

a immutable file b mutex file

c mutable file d none of the mentioned

Q.9 The NFS client and server modules communicate using ________.
a remote method invocation b remote procedure calls

c direct communication d file system

 Answer Keys for Multiple Choice Questions

Q.1 d Q.2 b Q.3 d Q.4 c Q.5 b
Q.6 a Q.7 c Q.8 a Q.9 b



TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

UNIT - V

5 Consistency and Replication

Syllabus

Introduction : Reasons for Replication, Replication as Scaling Technique. Data-Centric

Consistency Models : Continuous Consistency, Consistent Ordering of Operations.
Client-Centric Consistency Models : Eventual Consistency, Monotonic Reads,
Monotonic Writes, Read Your Writes, Writes Follow Reads. Replica Management :
Finding the best server location, Content Replication and Placement, Content
Distribution, Managing Replicated Objects. Consistency Protocols : Continuous
Consistency, Sequential Consistency, Cache Coherence Protocols, Example: Caching,
and Replication in the web.

Contents

5.1 Introduction to Replication

5.2 Data-Centric Consistency Models

5.3 Client-Centric Consistency Models

5.4 Replica Management

5.5 Consistency Protocols

5.6 Caching and Replication in the Web ............. May - 19 ........................ Marks 9

5.7 Multiple Choice Questions

(5 - 1)
Distributed Systems (5 - 2) Consistency and Replication

 5.1 Introduction to Replication

 Replication refers to the maintenance of copies at multiple sites. Replication is a technique
for enhancing services. A logical object is implemented by a collection of physical copies
called replicas.
 Motivation for replication is to improve a service’s performance, to increase its availability,
or to make it fault tolerant.
1. Performance : Placing copies of data close to the processes using them can improve
performance through reduction of access time. If there is only one copy, then the server
could become overloaded.
2. Increase availability : Factors that affect availability are server failures and network
partitions. User requires services to be highly available. The availability of the service is
that, if there are n replicated servers each of which would crash in a probability of p.
3. Fault tolerance : Highly available data is not necessarily strictly correct data.
Guarantee strictly correct behavior despite a certain number and type of faults. It
requires strict data consistency between all replicated servers. Replication of read-only
data is simple, but replication of mutable data incurs overheads in form of protocol.
4. Autonomous operation : In a distributed system that provides file replication as a
service to their clients, all files required an a client for operation during a limited time
period may be replicated on the file server residing at the client’s node. This will
facilitate temporary autonomous operation of client machines. A distributed system
having this feature can support detachable, portable machines.

 5.1.1 Reasons for Replication

 There are two primary reasons for replicating data including reliability and performance.
 Data are generally replicated to enhance reliability or improve performance.
 One of the major problems is keeping replicas consistent. Informally, this means that when
one copy is updated we need to ensure that the other copies are updated as well; otherwise
the replicas will no longer be the same.
 1. Data are replicated to increase the reliability of a system

 If a file system has been replicated it may be possible to continue working after one
replica crashes by simply switching to one of the other replicas.
 Also, by maintaining multiple copies, it becomes possible to provide better protection
against corrupted data.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge
Distributed Systems (5 - 3) Consistency and Replication

 For example, imagine there are three copies of a file and every read and write operation
is performed on each copy.
 It can safe against a single, failing write operation, by considering the value that is
returned by at least two copies as being the correct one.
 2. Replication for performance

 Scaling in numbers : Replication for performance is important when the distributed

system needs to scale in numbers and geographical area. Scaling in numbers occurs, for
example, when an increasing number of processes needs to access data that are managed
by a single server. In that case, performance can be improved by replicating the server
and subsequently dividing the work.
 Scaling in geographical area : The basic idea is that by placing a copy of data in the
proximity of the process using them, the time to access the data decreases. As a
consequence, the performance as perceived by that process increases.

 5.1.2 Replication as Scaling Technique

 Replication and caching is used for system scalability. Scalability issue generally appears in
the form of performance problem.
 Performance is increases by reducing access time. This is possible when multiple copies of
data is placed near to the object.
 It is necessary to keep up to date of data but it requires more bandwidth.
 Example : Object is replicated N times. We consider R is read frequency and W is write
frequency. If R<<W, it gives high consistency overhead and wasted messages.
 Keeping multiple copies of data is itself an issue of scalability problem. Collection of
copies is consistent when the copies are always the same. The read operation performed at
any copy will always return the same result. At the same time, when an update operation is
performed on one copy, the update should be propagated to all copies before a subsequent
operations takes place.
 What semantics to provide ? Tight consistency requires globally synchronized clocks.
Global synchronization simply takes a lot of communication time, especially when replicas
are spread across a wide area network. Solution to this problem is to use loosen
consistency.
 The variety of consistency semantics possible for this. In this case copies are not always the
same everywhere.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (5 - 4) Consistency and Replication

 5.2 Data-Centric Consistency Models

 The general organization of a logical data store, physically distributed and replicated across
multiple machines is shown in Fig. 5.2.1. Each process that can access data has its own
local copy and write operations are propagated to the other copies.
 A data store is a distributed collection of storages accessible to clients.
 Consistency model is a contract between processes and the data store. If processes obey
certain rules, data store will work correctly. All models attempt to return the results of the
last write for a read operation.

Fig. 5.2.1

 5.2.1 Strict Consistency

 Every read to a memory location x returns the value most recently written to x. It requires a
global time. A write is immediately visible to all processes.
 A shared memory system is said to support the strict consistency model if the value
returned by the read operation on a memory address is always the same as the value written
by the most recent write operation to that address.
 Not suitable in distributed systems. Difficult to achieve in real systems as network delays
can be variable.

 5.2.2 Sequential Consistency

 Linearizability is too strict for most practical purposes. The strongest memory model for
DSM that is used in practice is sequential consistency.
 Any read to a memory location x should have returned (in the actual execution) the value
stored by the most recent write operation to x in this sequential order.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (5 - 5) Consistency and Replication

 In this model, writes must occur in the same order on all copies; reads however can be
interleaved on each system, as convenient.
 A DSM system is said to be sequentially consistent if for any execution there is some
interleaving of the series of operations issued by all the processes that satisfies the
following two criteria :
1. SC1 : The interleaved sequence of operations is such that if occurs in the sequence, then
either the last write operation that occurs before it in the interleaved sequence is, or no
write operation occurs before it and a is the initial value of x.
2. SC2 : The order of operations in the interleaving is consistent with the program order in
which each individual client executed them.
 The result of the execution of a parallel program is the same as if the program is
executed on a single processor in a sequential order :
P: write x; write y; read x;
Q: read y; write x; read x;
 Some legitimate sequential orders :
P write x; P write y; P read x; Q read y; Q write x; Q read x;
P write x; Q read y; P write y; Q write x; P read x; Q read x;
Q read y; Q write x; P write x; P write y; P read x; Q read x
 All processors see the same sequence of memory references
a. Concurrently P : write x; Q : write x;
b. One process sees P writes first, then Q, then every process sees the same order.
 If write requests are processed exclusively, sequential consistency can be achieved.
 A sequential consistency memory model provides one - copy/single - copy semantics
because all the processes sharing a memory location always see exactly the same
contents stored in it.

 5.2.3 Linearizability
 The result of any execution is the same as if the operations by all processes on the data
were executed in some total order.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (5 - 6) Consistency and Replication

 The operations of each individual process appear in this sequence are in the order as how
they actually happened in real time :
a. Bring in server's view to define the ordering of concurrent events.
b. Real times of how activities have actually happened are defined by the actions
performed on the servers. It is define by the actual enqueuing time of each request.
c. Non-overlapping requests have to follow the order of the requests' enqueuing times.
d. Overlapping requests : Enqueuing times of the requests are in different orders on
different servers can have arbitrary order, but sequentially consistent.

 5.2.4 Causal Consistency

 Operations that are causally related must be seen by all processes in the same
corresponding order. Concurrent writes from different processors do not have any causal
relationship and can be seen in different order by different processors. There is no need to
write exclusively, cheaper write operations.
 P1 reads x and then writes y then writing to y is causally related to reading from x since y's
value may have been computed from x.
 Any write to x by P1 before x was read by P2 has causal relation to the subsequent write by
P2 to y.
a. P1: write x; P2: read x; write y;
b. If P2 reads x got P1's x value, then P2's write has to be after P1's write
 Need to keep track of dependency relations in order to determine whether two events are
causally related. This model is expensive.
 Following sequence obeys causal consistency :

Fig. 5.2.2

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (5 - 7) Consistency and Replication

 This sequence does not obey causal consistency :

Fig. 5.2.3

 5.2.5 Pipelined RAM Consistency

 Any pair of writes that are both done by a single processor are received by all other
processors in the same order.
 A pair of writes from different processes may be seen in different orders at different
processors.
 Following sequence is allowed with PRAM consistent memory. (Refer Fig. 5.2.4.)

Fig. 5.2.4

 P3 and P4 observe the writes by P1 and P2 in different orders, although W(x)1 and W(x)2
are potentially causally related.
 PRAM consistency model is simple and easy to implement and also has good performance.
 PRAM consistency can be implemented by simply sequencing the write operations
performed at each node independently of the write operations performed on other nodes.

 5.2.6 Weak Consistency

 PRAM consistency still unnecessarily restrictive for many applications : requires that writes
originating in single process be seen everywhere in order.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (5 - 8) Consistency and Replication

 A synchronization variable is introduced. When synchronization completes, all writes done

on that processor are propagated outward and all writes done on other processors are
brought in.
 The weak consistency has three properties :
1. Accesses to synchronization variables are sequentially consistent.
2. No access to a synchronization variable is allowed to be performed until all previous
writes have completed everywhere.
3. No data access (read or write) is allowed to be performed until all previous accesses to
synchronization variables have been performed.
 Let us consider following example :
P1 : W(x)1 W(x)2 S
P2 : R(x)1 R(x)2 S
P3 : R(x)2 R(x)1 S
 A valid sequence of events for weak consistency :
P1 : W(x)1 W(x)2 S
P2 : S R(x)1
 An invalid sequence for weak consistency : P2 must get 2 instead of 1 because it is already
synchronized.
 Shared data can only be counted on to be consistent after synchronization is done.
 All processes see accesses to synchronization variables in same order. Accessing a
synchronization variable "flushes the pipeline" by forcing writes to complete.
 By doing synchronization before reading shared data, a process can be sure of getting the
most recent values.
 Weak consistency requires the programmer to use locks to ensure reads and writes are done
in the proper order for data that needs it.

 5.2.7 Entry Consistency

 With release consistency, all local updates are propagated to other copies or servers during
release of shared data.
 With entry consistency, each shared data item is associated with a synchronization
variable.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (5 - 9) Consistency and Replication

 In order to access consistent data, each synchronization variable must be explicitly

acquired.
 Release consistency affects all shared data but entry consistency affects only those shared
data associated with a synchronization variable.
 A data store exhibits entry consistency if it meets all of the following conditions :
1. Synchronization variable is not allowed to perform operation with respect to a process
until all updates of shared data is not performed.
2. Before using a shared variable by process, no other process is allowed to use the shared
resources.
3. Owner is only access of synchronization variable, for other process, non-exclusive
mode synchronization is also not allowed.

 5.3 Client-Centric Consistency Models

 System wide view on data store is provided by data - centric consistency model. Client -
centric consistency models are generally used for applications that lack simultaneous
updates, i.e., most operations involve reading data.
 The following are very weak, client-centric consistency models :
1. Eventual consistency
2. Monotonic reads
3. Monotonic writes
4. Read your writes
5. Writes follow reads.

 5.3.1 Eventual Consistency

 The data stores offer a very weak consistency model, called eventual consistency. The
eventual consistency model states that, when no updates occur for a long period of time,
eventually all updates will propagate through the system and all the replicas will be
consistent.
 Systems such as domain name system and world wide web can be viewed as applications of
large scale distributed and replicated databases that tolerate a relatively high degree of
inconsistency. They have in common that if no updates take place for a long time, all
replicas will gradually and eventually become consistent. This form of consistency is called
eventual consistency.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge
Distributed Systems (5 - 10) Consistency and Replication

 Eventual consistency requires only that updates are guaranteed to propagate to all replicas.
Eventual consistent data stores work fine as long as clients always access the same replica.
 Eventual consistency for replicated data is fine if clients always access the same replica.
Client centric consistency provides consistency guarantees for a single client with respect to
the data stored by that client.
 What happens when different replicas are accessed ?
 Example : Consider a distributed database to which you have access through your
notebook. Assume your notebook acts as a front end to the database. At location A you
access the database doing reads and updates. At location B you continue your work, but
unless you access the same server as the one at location A, you may detect inconsistencies,
because :
1. Your updates at A may not have yet been propagated to B
2. You may be reading newer entries than the ones available at A
3. Your updates at B may eventually conflict with those at A
 Fig. 5.3.1 shows distributed database for mobile user.
 For the mobile user example, eventual consistent data stores will not work properly. Client
- centric consistency provides guarantees for a single client concerning the consistency of
access to a data store by that client. No guarantees are given concerning concurrent
accesses by different clients.

Fig. 5.3.1 : Distributed database for mobile user

 5.3.2 Monotonic - Read Consistency

 Assume read operations by a single process P at two different local copies of the same data
store.
 Once read, subsequent reads on that data items return same or more recent values.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (5 - 11) Consistency and Replication

 Example : Automatically reading your personal calendar updates from different servers.
 Monotonic Reads guarantees that the user sees all updates, no matter from which server the
automatic reading takes place.
 Example : Reading (not modifying) incoming mail while you are on the move.
 Each time you connect to a different e-mail server, that server fetches (at least) all the
updates from the server you previously visited.
 Example : The read operations performed by a single process P1 at two different local
copies of the same data store.
 The vertical axis shows the two different local copies of the data store. We called as
Location1 and Location2.
 Horizontal axis shows the time. Operations carried out by a single process P1 in boldface
are connected by a dashed line representing the order in which they are carried out.
 Fig. 5.3.2 shows monotonic read operation.

Fig. 5.3.2 (a) : A monotonic - read consistent data store

 Process P1 first performs a read operation on X at Location1, returning the value of X1.
This value results from the write operations in Write (X1) performed at Location1. Later,
P1 performs a read operation on X at Location2, shown as Read (X2).
 To guarantee monotonic-read consistency, all operations in Write (X1) should have been
propagated to Location2 before the second read operation takes place.

Fig. 5.3.2 (b) : A data store that does not provide monotonic reads

 Situation in which monotonic-read consistency is not guaranteed. After process P1 has read
X1 at Location1, it later performs the operation Read (X2 ) at Location2. But, only the
write operations in Write (X2 ) have been performed at Location2. No guarantees are given
that this set also contains all operations contained in Write (X1).

 5.3.3 Monotonic-Write Consistency

 In a monotonic-write consistent store, the following condition holds :
A write operation by a process on a data item x is completed before any successive write
operation on x by the same process.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (5 - 12) Consistency and Replication

 A write operation on a copy of item x is performed only if that copy has been brought up to
date by means of any preceding write operation, which may have taken place on other
copies of x. If need be, the new write must wait for old ones to finish.
 Example : Updating a program at server S2, and ensuring that all components on which
compilation and linking depends, are also placed at S2.
 Example : Maintaining versions of replicated files in the correct order everywhere.
 The write operations performed by a single process P at two different local copies of the
same data store.
 Resembles to PRAM, but here we are considering consistency only for a single process
(client) instead of for a collection of concurrent processes.
 Fig. 5.3.3 shows monotonic - write consistent data store and data store that does not provide
monotonic-write consistency.

Fig. 5.3.3 (a) : A monotonic - write consistent data store

 The Write(X2) requires that Write(X1) is updated on Location2 before it.

Fig. 5.3.3 (b) : Store that does not provide monotonic - write consistency
 Write(X1) has not been propagated to Location2
 Example 1 : Updating a program at server S2, and ensuring that all components on which
compilation and linking depends, are also placed at S2.
 Example 2 : Maintaining versions of replicated files in the correct order everywhere.

 5.3.4 Read Your Writes

 It is closed related to monotonic reads consistency. A write operation is always completed
before a successive read operation by the same process.
 Example : Editor and browser, if not integrated, you may not read-your-writes of an HTML
page.
 Example : Updating your Web page and guaranteeing that your Web browser shows the
newest version instead of its cached copy.

Fig. 5.3.4 (a) : A data store that provides read-your-writes consistency

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (5 - 13) Consistency and Replication

Fig. 5.3.4 (b) : A data store that does not provides read-your-writes consistency

 Writes Follow Reads

 Updates are propagated as the result of previous read operation.
 Any successive write operation on x by a process will be performed on a copy of x that is
most recently read by that process.
 Ex : comments on news group, let A an article read recently, R the response to that article,
then R must follows A.

 5.4 Replica Management

 Key issue for any distributed system that supports replication is to decide where, when, and
by whom replicas should be placed, and subsequently which mechanisms to use for keeping
the replicas consistent.
 The placement problem itself should be split into two sub-problems: that of placing replica
servers, and that of placing content.
 Replica-server placement is concerned with finding the best locations to place a server that
can host (part of) a data store.
 Content placement deals with finding the best servers for placing content. Before content
placement can take place, replica servers will have to be placed first.
 In the following, take a look at these two different placement problems, followed by a
discussion on the basic mechanisms for managing the replicated content.
 Replica - server placement is often more of a management and commercial issue than an
optimization problem. Fig. 5.4.1 shows choosing a proper cell size for server placement.

Fig. 5.4.1 : Choosing a proper cell size for server placement

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (5 - 14) Consistency and Replication

 5.5 Consistency Protocols

 Consistency protocol describes an implementation of a specific consistency model.

 5.5.1 Primary - Based Protocols

 It is used for sequential consistency. Each data item is associated with a "primary" replica.
 The primary is responsible for coordinating writes to the data item.
 There are two types of Primary-Based Protocol :
1. Remote-Write 2. Local-Write
 1. Remote - Write Protocols
 It is primary backup protocols. All writes are performed at a single (remote) server.
 Read operations can be carried out locally. This model is typically associated with
traditional client/server systems.
 Example : Fig. 5.5.1 shows primary based remote write protocol.

Fig. 5.5.1 : Primary based remote write protocol

1. A process wanting to perform a write operation on data item x, forwards that

operation to the primary server for x.
2. The primary performs the update on its local copy of x, and forwards the update to
the backup servers.
3. Each backup server performs the update as well, and sends an acknowledgment back
to the primary.
4. When all backups have updated their local copy, the primary sends an
acknowledgment back to the initial process.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (5 - 15) Consistency and Replication

 All of those writes can take a long time. Using a non-blocking write protocol to handle
the updates can lead to fault tolerant problems.
 As the primary is in control, all writes can be sent to each backup replica IN THE
SAME ORDER, making it easy to implement sequential consistency.
 2. Local - Write Protocols

 It is fully migrating approach. A single copy of the data item is still maintained.
 Upon a write, the data item gets transferred to the replica that is writing. The status of
primary for a data item is transferrable.
 Process : Whenever a process wants to update data item x, it locates the primary copy of
x, and moves it to its own location.
 Example : Fig. 5.5.2 shows local write protocol.
 Primary-based local - write protocol in which a single copy is migrated between
processes (prior to the read/write).

Fig. 5.5.2 : Local write protocol

 5.5.2 Replicated - Write Protocols

 It is Distributed - Write Protocols. Writes can be carried out at any replica.
 There are two types : Active Replication and Majority Voting (Quorums).

 5.5.2.1 Active Replication

 In active replication model, there are multiple replica managers, each with equivalent roles.
The replica manager’s operate as a group and each front end (client interface) multicasts
requests to a group of RM’s.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (5 - 16) Consistency and Replication

 Requests are processed by all RM’s independently. Client interface compares all replies
received and can tolerate N out of 2N+1 failure, i.e. consensus when N + 1 identical
response received. This model also can tolerate Byzantine failure.
 Fig. 5.5.3 shows active replication.

Fig. 5.5.3 : Active replication

 The sequence of events when a client requests an operation to be performed is as follows :

1. Request : Client request is sent to group of RM’s using totally ordered reliable
multicast, each sent with unique request id.
2. Co-ordination : The group communication system delivers the request to every correct
replica manager in the same order.
3. Execution : Each RM processes the request and sends response/result back to the front
end. Front end collects responses from each RM. The response contains the client’s
unique request identifier.
4. Agreement : No agreement phase is needed, because of the multicast delivery
semantics.
5. Response : Each replica manager sends its response to the front end.
 Fault Tolerance in Active Replication
 Replica manager’s work as replicated state machines, playing equivalent roles. That is, each
responds to a given series of requests in the same way. This is achieved by running the
same program code at all RMs. If any RM crashes, state is maintained by other correct
RMs.
 This system implements sequential consistency. The total order ensures that all correct
replica managers process the same set of requests in the same order. Each front end’s
requests are served in FIFO order.
 So, requests are FIFO - total ordered. But if clients are multi-threaded and communicate
with one another while waiting for responses from the service, we may need to incorporate
causal-total ordering.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge
Distributed Systems (5 - 17) Consistency and Replication

 5.5.2.2 Quorum based Protocols

 A quorum is the minimum number of votes that a distributed transaction has to obtain in
order to be allowed to perform an operation in a distributed system. A quorum - based
technique is implemented to enforce consistent operation in a distributed system.
 A quorum system engages a designated minimum number of the replicas for every read or
write, this number is called the read quorum or write quorum.
 Let there be N servers. To complete a write, a client must successfully place the updated
data item on more than N/2 servers. The updated data item will be assigned a new version
number that is obtained by incrementing its current version number.
 To read the data, the client must read out the copies from any subset of at least N/2 servers.
From these, it will find out the copy with the highest version number and accept that copy.
 Only one writer at a time can achieve write quorum. Every reader sees at least one copy of
the most recent read.

 5.6 Caching and Replication in the Web  SPPU : May - 19

 Client stores the cache memory in two places.

1. Simple caching method : Most of the browser supports this method. Whenever a
document is fetched it is stored in the browser's cache from where it is loaded the next
time. Clients can generally configure caching by indicating when consistency checking
should take place.
2. Web proxy method : Web proxy accepts requests from local clients and passes these
to Web servers. When a response comes in, the result is passed to the client. The
advantage of this approach is that the proxy can cache the result and return that result to
another client, if necessary. Web proxy can implement a shared cache.
 Client-side caching duplicates the data of previously requested files directly within browser
applications or other clients. Client cache is the most efficient type of caching, because it
allows browsers to access files without communicating with the web server. The system
controls client caching by modifying the HTTP headers of file requestresponses.
 Client-side caching in the web generally occurs at two places : cache and web proxy
 Web proxy : It accepts the request from local clients and passes these to web servers.
When response come in, the result is passed to the client.
 Cooperative caching is a popular mechanism to allow an array of distributed caches to
cooperate and serve each others' Web requests.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge
Distributed Systems (5 - 18) Consistency and Replication

 Web caching refers to the temporary storage of web content somewhere between web
servers and clients in order to satisfy future requests from the nearby location. Fig. 5.6.1
shows proxy web cache.

Fig. 5.6.1 : Proxy web cache

 Co-operative caching
 In co-operative caching mechanisms, a group of caches work together by collectively
pooling their memory resources to provide a larger proxy cache. These co-operating caches
can also be centrally managed by a server.
 To search in the co-operative cache, the proxy forwards the requested URL to a mapping
server. The use of a central mapping service distinguishes the CRISP cache from other
co-operative Internet caches.
 Often, multiple caches in a network coordinate and share resources in order to serve each
others’ requests. This is also known as cooperative caching.
 When a cache does not have the requested data object, it can forward the request to a
nearby cooperating cache that can serve the object faster than the origin server.
 Cooperative caching is typically implemented across caches within an organization such as
a large enterprise, ISP, or a Content Delivery Network (CDN).

Fig. 5.6.2 : Cooperative caching

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (5 - 19) Consistency and Replication

 Content Distribution Networks

 CDN is a collaborative collection of network elements spanning the Internet, where content
is replicated over several mirrored Web servers in order to perform transparent and
effective delivery of content to the end users. Collaboration among distributed CDN
components can occur over nodes in both homogeneous and hetergeneous environments.
 CDN functions :
1. Request redirection and content delivery services, to direct a request to the closest
suitable CDN cache server using mechanisms to bypass congestion.
2. Content outsourcing and distribution services, to replicate and/or cache content from the
origin server to distributed Web servers.
3. Content negotiation services to meet specific needs of each individual user.
4. Management services to manage the network components, to handle accounting and to
monitor and report on content usage.
 The three main enetities in a CDN system are content provider, CDN provider and end
users.
 Content provider or customer is one who delegates the URL name space of the Web objects
to be distributed. The origin server of the content provider holds those objects.
 CDN provider is a proprietary organization or company that provides infrastructure
facilities to content providers in order to deliver content in a timely and reliable manner.
 End users or clients are the entities who access content from the content provider's Web
site. Fig. 5.6.3 shows feedback control loop for a replica hosting system.

Fig. 5.6.3 : Feedback control loop for a replica hosting system

 CDN providers use caching and replica servers located in different geographical locations
to replicate content. CDN cache servers are also called edge servers or surrogates. The edge
servers of a CDN are called Web cluster as a whole.
 CDNs distribute content to the edge servers in such a way that all of them share the same
content and URL. Client requests are redirected to the nearby optimal edge server and it
delivers requested content to the end users. Thus, transparency for users is achieved.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge
Distributed Systems (5 - 20) Consistency and Replication

 Akamai is one of the largest CDNs currently deployed, with tens of thousands of replica
servers placed all over the Internet. To a large extent, Akamai uses well-known technology
to replicate content, notably a combination of DNS based redirection and proxy-caching
techniques.
 There are essentially three different kinds of aspects related to replication in Web hosting
systems :
1) Metric estimation
2) Adaptation triggering
3) Taking appropriate measures :
A. Replica placement decisions
B. Consistency enforcement
C. Client-request routing
 Metric estimation
1. Latency : Time is measured for an action. Fetching a document is an example of latency.
2. Spatial metrics : It consists of measuring the distance between nodes in terms of network
level routing hops.
3. Consistency metrics : Tell user to what extent a replica is deviating from its master
copy.
4. Financial metrics are closely related to the acutal infrastructure of the Internet. For
example, most commerical CDNs place servers at the edge of the Internet, meaning that
they hire capacity from ISPs directly servicing end users.
Review Questions

1. Explain client centric consistency protocol in detail.

SPPU : May-19, End Sem, Marks 9

2. Explain data centric strict consistency model.

 5.7 Multiple Choice Questions

Q.1 __________ redundancy add extra bits to allow recovery from garbled bits.
a Physical b Time c Information d All of these

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (5 - 21) Consistency and Replication

Q.2 Which of the following is the client-centric consistency model ?

a Monotonic reads b Monotonic writes

c Eventual consistency d All of these

Q.3 TMR stands for __________.

a Time Model Receiver b Transmission Modular Redundancy

c Triple Modular Redundancy d Two Modular Redundancy

Q.4 __________ redundancy adds extra equipment or processes so that the system can
tolerate the loss or malfunctioning of some components.
a Physical b Time c Information d Sequential

Q.5 If local states jointly do not form a distributed snapshot, further rollback is necessary.
This process of a cascaded rollback may lead to what is called the __________.
a checkpointing b recovery line

c domino effect d independent checkpointing

Q.6 List the name of techniques that avoid domino effect :

a Coordinated checkpointing rollback recover

b Log-based rollback recovery

c Communication-induced checkpointing rollback recovery

d All of these

Q.7 What is an orphan messages ?

a Messages that have been sent but not yet received.

b Messages whose send is done but receive undone due to rollback.

c Messages with receive recorded but message send not recorded.

d Arise due to message logging and replaying during process recovery.

Q.8 __________ protocols assume that a failure can occur after any non-deterministic
event in the computation.
a Optimistic logging b Pessimistic logging

c Causal logging d All of these

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (5 - 22) Consistency and Replication

Q.9 If no updates take place for a long time, all replicas will gradually become consistent.
This form of consistency is called __________ consistency.
a sequential b strict c weak d eventual

Q.10 In a push based approach, also referred to as __________ protocols, updates are
propagated to other replicas without those replicas even asking for the updates.
a client-based b server-based

c network-based d all of these

Q.11 __________ approaches are often used between permanent and server-initiated
replicas, but can also be used to push updates to client caches.
a Push-based b Pull-based

c Client-based d Server-based

Q.12 A pull-based approach is efficient when the read-to-update ratio is relatively

__________.
a high b medium c low d very high

 Answer Keys for Multiple Choice Questions

Q.1 c Q.2 d Q.3 c Q.4 a Q.5 c
Q.6 d Q.7 c Q.8 b Q.9 d Q.10 b
Q.11 a Q.12 c



TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

UNIT - VI

6 Fault Tolerance

Syllabus

Introduction to Fault Tolerance : Basic Concepts, Failure Models, Failure Masking by

Redundancy. Process Resilience : Resilience by Process Groups, Failure Masking and
Replication, Example: Paxos, Consensus in faulty systems with crash failures, some
limitations on realizing Fault Tolerant tolerance, Failure Detection. Reliable Client
Server Communication : Point to Point Communication, RPC Semantics in the
Presence of Failures. Reliable Group Communication : Atomic multicast, Distributed
commit. Recovery : Introduction, Check pointing, Message Logging, Recovery Oriented
Computing.

Contents

6.1 Introduction to Fault Tolerance

6.2 Process Resilience

6.3 Reliable Client Server Communication

6.4 Reliable Group Communication

6.5 Distributed Commit ....................................... Dec. - 18 ....................... Marks 8

6.6 Recovery ......... ............................................ Dec. - 18 ....................... Marks 8

6.7 Multiple Choice Questions

(6 - 1)
Distributed Systems (6 - 2) Fault Tolerance

 6.1 Introduction to Fault Tolerance

 A system is said to fail when it cannot meet its promises. If a distributed system is designed
to provide its users with a number of services, the system has failed when one or more of
those services cannot be provided.
 An error is a part of a system's state that may lead to a failure. The cause of an error is
called a fault.
 A good fault - tolerant system design requires a careful study of failures causes of failures,
and system responses to failures. A fault is a malfunction, possibly caused by a design
error, manufacturing error, programming error, physical damage, deterioration in the
course of time, harsh environmental conditions and unexpected inputs.
 Fault-Tolerance : The system can provide services even in the presence of faults.

Fig. 6.1.1

 Requirements :
1. Availability : It is defined as the property that a system is ready to be used immediately.
The fraction of the time that a system meets its specification. The probability that the
system is operational at a given time t.
2. Reliability : It refers to the property that a system can run continuously without failure.
Typically used to describe systems that cannot be repaired or where the continuous
operation of the system is critical.
3. Safety : It refers to the situation that when a system temporarily fails to operate
correctly.
4. Maintainability : It refers to how easy a failed system can be repaired.

 6.1.1 Failure Models

 Defines the ways in which failure may occur in order to provide an understanding of its
effects.
 A taxonomy of failures which distinguish between the failures of processes and
communication channels is provided :
1. Omission failures : Process or channel failed to do something.
2. Arbitrary failures : Any type of error can occur in processes or channels (worst).

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (6 - 3) Fault Tolerance

3. Timing failures : Applicable only to synchronous distributed systems where time limits
may not be met.
Fault models : Following are the fault models
o Omission faults
o Arbitrary faults
o Timing faults
 Faults can occur both in processes and communication channels. The reason can be both
software and hardware faults.
 Fault models are needed in order to build systems with predictable behaviour in case of
faults.
 Of course, such a system will function according to the predictions, only as long as the
real faults behave as defined by the “fault model”.
 1. Omission failures

 A process or communication channel fails to perform actions that it is supposed to do.

 Process omission failures :
 Process has crashed and can be detected using timeouts.
 Fail-stop process crash is one that can be detected with certainty by other processes.
 Process crash is called Fail-stop
i. If other processes can detect certainly that process has been crashed. (fail to respond)
ii. Can be produced in synchronous systems only where message delivery is guaranteed.
 In an asynchronous system a timeout can indicate only that a process is not responding. It
may have crashed or may be slow, or the messages may not have arrived.
 Communication omission failures :
 Communication primitives are send and receive.
 Send-omission : Loss of messages between the sending process and the outgoing message
buffer (both inclusive).
 Channel omission : Loss of message in the communication channel.
 Receive-omission : Loss of messages between the incoming message buffer and the
receiving process (both inclusive).

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (6 - 4) Fault Tolerance

Fig. 6.1.2 : Process and channels

 Arbitrary failures :
 Arbitrary process failure : Arbitrarily omits intended processing steps or takes unintended
processing steps.
 Arbitrary channel failures : Messages may be corrupted, duplicated, delivered out of
order, incur extremely large delays; or non - existent messages may be delivered.
 Above two are Byzantine failures, e.g., due to hackers, man-in-the-middle attacks, viruses,
worms, etc.
 A variety of Byzantine fault-tolerant protocols have been designed in literature.
 Arbitrary failures in processes cannot be detected by seeing whether the process responds to
invocations, because it might arbitrarily omit to reply.
 Communication channel also suffer from arbitrary failures. For examples : Messages
contents can be corrupted, a duplicate message can be sent or message can be lost on its
way.
 Omission and arbitrary failures are as follows :
Sr. No. Class of failure Affects Description
1. Fail-stop or Crash-stop Process Process halts and remains halted.
Other processes may detect this
state.
2. Omission Channel A message inserted in an
outgoing message buffer never
arrives at the other end's
incoming message buffer.
3. Send-omission Process A process completes a send, but
the message is not put in its
outgoing message buffer.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge
Distributed Systems (6 - 5) Fault Tolerance

Sr. No. Class of failure Affects Description

4. Receive - omission Process A message is put in a process's
incoming message buffer, but
that process does not receive it.
5. Arbitrary (Byzantine) Process or Channel Process/channel exhibits arbitrary
behaviour: it may send/transmit
arbitrary messages at arbitrary
times, commit omissions; a
process may stop or take an
incorrect step.
 Timing failures :
 Timing failures are applicable in synchronous distributed systems where time limits are set
on process execution time, message delivery time and clock drift rate.
 In an asynchronous distributed system, an overloaded server may respond too slowly, but
we can not say that it has a timing failure since no guarantee has been offered.
 Timing failures are listed below :
Sr. No. Class of failure Affects Description
1. Clock Process Process's local clock exceeds the bounds on its
rate of drift from real time.
2. Performance Process Process exceeds the bounds on the interval
between two steps.
3. Performance Channel A message's transmission takes longer than the
stated bound.
 Masking failures :
 Knowledge of the failure characteristic of a component can enable us to develop a reliable
service which uses such components which can fail.
 For example : Converting failure, checksum, retransmit message, replication, restoring
information (convert arbitrary failure to omission failure).
 A service masks a failure, either by hiding it altogether or by converting it into a more
acceptable type of failure.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (6 - 6) Fault Tolerance

 Reliability of one-to-one communication :

 Reliable communication is defined in terms of validity and integrity. Correct message
delivery in presence of failure.
1. Validity : Any message in the outgoing message buffer is eventually delivered to the
incoming message buffer.
2. Integrity : The message received is identical to one sent, and no messages are delivered
twice.
3. Threats : Malicious users and protocols.

 6.1.2 Failure Masking by Redundancy

 There are three kinds of fault tolerance approaches :
1. Information redundancy : Extra bit to recover from garbled bits.
2. Time redundancy : Do again.
3. Physical redundancy : Add extra components. There are two ways to organize extra
physical equipment : active replication (use the components at the same time) and
primary backup (use the backup if one fails).
 In the active replication technique, also called state-machine approach. All replicas play the
same role : There is no centralized control.
 The active replication technique requires that the invocations of client processes be
received by the non-faulty replicas in the same order. This requires an adequate
communication primitive, ensuring the order and the atomicity property.
 Types of Redundancy :
1. Hardware Redundancy : Based on physical replication of hardware.
2. Software Redundancy : The system is provided with different software versions of tasks,
preferably written independently by different terms.
3. Time Redundancy : Based on multiple executions on the same hardware in different
times.
4. Information Redundancy : Based on coding data in such a way that a certain number of
bit errors can be detected and/or corrected.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (6 - 7) Fault Tolerance

 Triple modular redundancy :

 Triple Modular Redundancy (TMR) uses three identical modules, performing identical
operations, with a majority voter determining the output. Fig. 6.1.3 shows TMR.

Fig. 6.1.3 : TMR

 Using of multiple levels of redundancy for fault tolerance was established.

 Using automata theory (Logic gates, state machines, combinatorial/sequential logic) to
model digital circuits and computational operations. By making reliable computers from
less reliable components.
 Triple modular redundancy with triplicated voters can be used to overcome susceptibility to
voter failure. The voter is no longer a single point of failure in the system.

 6.2 Process Resilience

 Resilience can be defined as the capability of the system to understand and manage failures
occurring in the system. Resiliency of a system is directly proportional to its up - time and
availability. The more resilient the systems, the more available it is to serve users.
 Processes can be made fault tolerant by arranging to have a group of processes, with each
member of the group being identical.
 A message sent to the group is delivered to all of the "copies" of the process (The group
members), and then only one of them performs the required service. If one of the processes
fail, it is assumed that one of the others will still be able to function.

 6.2.1 Design Issue

 To tolerate a faulty process, organize several identical processes into a group. A group is a
single abstraction of a collection of processes.
 So we can send a message to a group without explicitly knowing who are they, how many
are there, or where are they (e.g., e-mail groups, newsgroups).

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (6 - 8) Fault Tolerance

 Key property : When a message is sent, all members of the group must receive it. So, if one
fails, the others can take over for it.
 Groups could be dynamic. We need mechanisms to manage groups and membership
(e.g., join, leave, be part of two groups)
 Flat groups versus hierarchical groups
 Fig. 6.2.1 shows communication in a flat group and communication in a simple hierarchical
group.

Fig. 6.2.1

 1. Communication in a flat group :

 All the processes are equal and decisions are made collectively.
 There is no single point-of-failure, however decision making is complicated as
consensus is required.
 Good for fault tolerance as information exchange immediately occurs with all group
members. May impose overhead as control is completely distributed, and voting needs
to be carried out.
 Harder to implement.
 2. Communication in a simple hierarchical group :

 One of the processes is elected to be the coordinator, which selects another process (a
worker) to perform the operation.
 Not really fault tolerant or scalable
 However, easier to implement

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (6 - 9) Fault Tolerance

 6.2.2 Failure Masking and Replication

 A failure of a system occurs when the system cannot meet its promises.
 Failures are caused by faults. A fault is an anomalous condition. There are three categories
of faults :
1. Transient faults : Occur once and never reoccur (e.g., wireless communication being
interrupted by external interference).
2. Intermittent faults : Reoccur irregularly (e.g., a loose contact on a connector).
3. Permanent faults : Persist until the faulty component is replaced (e.g., software bugs).
 Fault tolerance means that a system can provide its services even in the presence of faults.
 Failure masking is a fault tolerance technique that hides occurrence of failures from other
processes. The most common approach to failure masking is redundancy.
 Three types of redundancy :
1. Information redundancy : Add extra bits to allow recovery from garbled bits
2. Time redundancy : Repeat an action if needed
3. Physical redundancy : Add extra equipment or processes so that the system can
tolerate the loss or malfunctioning of some components.
 Process Resilience :
 Processes can be made fault tolerant by arranging to have a group of processes, with each
member of the group being identical.
 A message sent to the group is delivered to all of the "copies" of the process (the group
members), and then only one of them performs the required service.
 If one of the processes fail, it is assumed that one of the others will still be able to function
(and service any pending request or operation).

 6.2.3 Byzantine Agreement Problem

 The Problem : “Several divisions of the Byzantine army are camped outside an enemy city,
each division commanded by its own general. After observing the enemy, they must decide
upon a common plan of action. Some of the generals may be traitors, trying to prevent the
loyal generals from reaching agreement.”
 Three or more generals are agree to attack or to retreat. Once the commander is issues the
order, lieutenants to the commander are to decide to attack or retreat.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (6 - 10) Fault Tolerance

 But the one or more of the generals may be treacherous, i.e. faulty.
 If the commander is treacherous, he proposes attacking to one general and retreating to
another.
 If a lieutenant is treacherous, he tells one of his peers that the commander told him to attack
and another that they are to retreat.
 Source processor broadcasts its values to others. Solution must meet following objectives :
Agreement : All non-faulty processors agree on the same value.
Validity : If source is nonfaulty, then the common agreed value must be the value supplied
by the source processor.
 “If source is faulty then all non - faulty processors can agree on any common value”.
“Value agreed upon by faulty processors is irrelevant”
 Fig. 6.2.2 shows Byzantine agreement.

Fig. 6.2.2 : Byzantine agreement

 No solution for three processes can handle a single traitor. In a system with m faulty
processes agreement can be achieved only if there are 2m+1 (more than 2/3) functioning
correctly.

 6.2.4 Failure Detection

 Each pair of processes is connected by reliable channels and processes are independent
from each other. Processes only fail by crashing.
 Underlying network components may suffer failures, but reliable protocols recover.
Reliable channel eventually delivers message. No bound as in an asynchronous system.
 A collection of processes is split into several groups that cannot communicate. Failure of
router between two networks may mean that a collection of four processes is split into two
pairs. Here intra - pair communication is possible but because of router fail this
communication is also not possible. This is called as network partition.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (6 - 11) Fault Tolerance

 Fig. 6.2.3 shows network partition.

Fig. 6.2.3 : Network partition

 Failure detector is object/code in a process that detects failures of other processes. Failure
detector is not every time accurate. This is category as unreliable failure detector.
Unreliable failure detector is one of two values : Unsuspected or Suspected.
 Unsuspected : Failure is unlikely, for example : Failure detector has recently received
communication from unsuspected peer. This may be inaccurate.
 Suspected : Indication that peer process failed. For example : No message received in quite
sometime. Also this may be inaccurate because peer process hasn't failed, but the
communication link is down, or peer process is much slower than expected.
 A simple algorithm
 If we assume that all messages are delivered within some bound, say D seconds. Then we
can implement a simple failure detector as :
 Every process p sends a "p is still alive" message to all failure detector processes,
periodically, once every T seconds.
 If a failure detector process does not receive a message from process q within T + D
seconds of the previous one then it marks q as "Suspected".
 If we choose our bound D too high then often a failed process will be marked as
"Unsuspected". A synchronous system has a known bound on the message delivery time
and the clock drift and hence can implement a reliable failure detector. An asynchronous
system could give one of three answers : "Unsuspected, Suspected or Failed" choosing
two different values of D.
 In fact we could instead respond to queries about process p with the probability that p has
failed, if we have a known distribution of message transmission times. For example : If you
know that 90 % of messages arrive within 2 seconds and it has been two seconds since your
last expected message you can conclude there is a : NOT a 90 % chance that the process p
has failed.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (6 - 12) Fault Tolerance

 6.3 Reliable Client Server Communication

 A communication channel may lose and/or corrupt messages.
 How to handle communication failures? Use a reliable transport protocol (e.g., TCP) or
handle at the application layer.
 Techniques for reliable communication
a. Use redundant bits to detect bit errors in packets
b. Use sequence numbers to detect packet loss
c. Recover from corrupted/lost packets using.
 Five types of failures can occur in RPC
1. Client cannot locate server
2. Server crashes after receiving a request
3. Client request is lost
4. Server response is lost
5. Client crashes after sending a request.
 a. Server Crashes after Receiving a Request

 The client cannot tell if the crash occurred before or after the request is carried out
 Three possible semantics
1. At - least-once : keep trying until a reply is received
2. At - most-once : give up immediately and report back failure
3. Exactly once : desirable but not achievable.
 b. Lost Request / Reply Messages

 Client waits for reply message, resents the request upon timeout
 Problem : Upon timeout, client cannot tell whether the request was lost or the reply was
lost
 Client can safely resend the request for idempotent operations
 An idempotent operation is an operation that can be safely repeated
 E.g., reading the first line of a file is idempotent, transferring money is not
 For non-idempotent operations, client can add sequence numbers to requests so that the
server can distinguish a retransmitted request from an original request

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (6 - 13) Fault Tolerance

 Server need keep track of the most recently received sequence number from each client.
Server will not carry out a retransmitted request, but will send a reply to the client.
 c. Client Crashes after Sending a Request

 What happens to the server computation, referred to as an orphan ?

 Extermination : Client explicitly kills off the orphan when it comes back up. Client stub
makes a log entry on disk before sending an RPC message.
 Reincarnation : When a client reboots, it broadcasts a new epoch number; when server
receives the broadcast, it kills the computations that were running on behalf of the client.
 Expiration : Each RPC is associated with an expiration time T.
 The call is aborted when the expiration time is reached. If RPC cannot finish within T,
the client must ask for another quantum.
 If after a crash the client waits a time T before rebooting, all orphans are sure to be gone.

 6.4 Reliable Group Communication

 Reliable multicast services guarantee that all messages are delivered to all members of a
process group.
 Small group : Multiple, reliable point-to-point channels will do the job, however, such a
solution scales poorly as the group membership grows.
 What happens if a process joins the group during communication ?
 Worse : What happens if the sender of the multiple, reliable point-to-point channels crashes
half way through sending the messages ?
 Reliability, deals with recovering from communication failures, such as buffer overflows
and garbled packets. Because reliability is more difficult to implement for group
communication than for point-to-point communication, a number of existing operating
systems provide unreliable group communication, whereas almost all operating systems
provide reliable point-to-point communication, for example, in the form of RPC.
 Basic Reliable - Multicasting Schemes
 Simple solution to reliable multicasting when all receivers are known and are assumed not
to fail.
 The sending process assigns a sequence number to outgoing messages. Assume that
messages are received in the order they are sent.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (6 - 14) Fault Tolerance

 Each multicast message is stored locally in a history buffer at the sender. Assuming the
receivers are known to the sender, the sender simply keeps the message in its history buffer
until each receiver has returned an acknowledgment.
 If a receiver detects it is missing a message, it may return a negative acknowledgment,
requesting the sender for a retransmission
 Another important design decision in group communication is the ordering of messages
sent to a group. Roughly speaking, there are four possible orderings: no ordering, FIFO
ordering, causal ordering, and total ordering.

 6.4.1 Message Ordering

 R1 and R2 receive m1 and m2 in a different order. Fig. 6.4.1 shows no ordering constraint
for message delivery.

Fig. 6.4.1 : No ordering

 Some message ordering required,

1. Absolute ordering
2. Consistent/Total ordering
3. Causal ordering
4. FIFO ordering.
 Absolute ordering
 Fig. 6.4.2 shows absolute ordering. In this ordering, all messages are delivered to all
receiver processes in the exact order in which they were sent.
 Rule : mi must be delivered before mj if Ti < Tj

 Implementation : A clock synchronized among machines is required. A sliding time

window used to commit message delivery whose timestamp is in this window. Window
size is properly chosen taking into consideration the maximum possible time that may be
required by a message to go from one machine to other machine in the network.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge
Distributed Systems (6 - 15) Fault Tolerance

Fig. 6.4.2 : Absolute ordering

 Example : Distributed simulation

 Drawbacks :
1. Too strict constraint
2. No absolute synchronized clock
3. No guarantee to catch all tardy messages.
 Consistent / Total ordering
 It ensures that all messages are delivered to all receiver processes in the same order.
 Fig. 6.4.3 shows consistent ordering.
 Rule : Messages received in the same order, regardless of their timestamp.
 Implementation : A message sent to a sequencer, assigned a sequence number, and finally
multicast to receivers. A message retrieved in incremental order at a receiver.

Fig. 6.4.3 : Consistent ordering

 Example : Replicated database updates

 Drawback : A centralized algorithm.
 Causal ordering
 If two message sending events are not causally related, the two messages may be delivered
to the receivers in any order.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge
Distributed Systems (6 - 16) Fault Tolerance

 Two message sending events are said to be causally related if they are correlated by the
happened before relation.
 Fig. 6.4.4 shows the causal ordering.

Fig. 6.4.4 : Causal ordering

 Rule : Happened-before relation

k l k l
o If e i , ei  h and k < l, then e i  ei,
o If ei = send(m) and ej = receive(m), then ei  ej,

o If e  e and e  e, then e  e

 Implementation : Use of a vector message.
 Example : Distributed file system.
 Drawbacks :
1. Vector as an overhead.
2. Broadcast assumed.

 6.5 Distributed Commit  SPPU : Dec. - 18

 Some applications perform operations on multiple databases. For example : Transfer funds
between two bank accounts or debiting one account and crediting another.
 We would like a guarantee that either all the databases get updated, or none does.
 Distributed commit problem : Operation is committed when all participants can perform it.
Once a commit decision is reached, this requirement holds even if some participants fail
and later recover.
 Commit protocols are used to ensure atomicity across sites.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (6 - 17) Fault Tolerance

 Transaction which executes at multiple sites must either be committed at all the sites, or
aborted at all the sites. But it is not acceptable to have a transaction committed at one site
and aborted at another.
 The two-phase commit (2PC) protocol is widely used.
 The three-phase commit (3PC) protocol is more complicated and more expensive, but
avoids some drawbacks of two-phase commit protocol. This protocol is not used in
practice.
 Transaction behave as one operation :
1. Atomicity : All-or-none, if transaction failed then no changes apply to the database
2. Consistency : There is no violation of the database integrity constraints
3. Isolation : Partial results are hidden (due to incomplete transactions)
4. Durability : The effects of transactions that were committed are permanent.

 6.5.1 Atomic Commit Protocols

 Atomicity principle requires that either all the distributed operations of a transaction
complete or all abort. At some stage, client executes close Transaction ( ). Now, atomicity
requires that either all participants or the coordinator commit or all abort.
 In a distributed transaction, the client has requested the operations at more than one server.
So, need to ensure safety property in real-life implementation. Never have some agreeing to
commit, and others agreeing to abort.
 One-phase commit protocol
 The coordinator tells the participants whether to commit or abort. What is the problem with
that ?
 This does not allow one of the servers to decide to abort - it may have discovered a
deadlock or it may have crashed and been restarted.
 In this protocol, when the client requests a commit, it does not allow a server to make a
unilateral decision to abort a transaction. Because commit transaction is related to
concurrency control.
 Advantages of one-phase commit protocol :
1. Simple protocol fewer overheads.
2. Low latency due to fewer disks writes.
3. Useful in case of low bandwidth networks, as lesser messages are exchanged.
4. In most cases all the updates will be logged before commit so durability is assured.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (6 - 18) Fault Tolerance

 Disadvantages of one-phase commit protocol :

1. The greatest disadvantage is that it can only handle immediate consistency constraints as
there is no voting phase.
2. It can not handle the deferred consistency constraints.

 6.5.2 Two Phase Commit Protocols

 Two phase commit (2PC) is the standard protocol for making commit and abort atomic. It
is designed to allow any participant to choose to abort a transaction. If one part of the
transaction is aborted, then the whole transaction must also be aborted.
 Protocol first phase : Each participant votes for the transactions to be committed or aborted.
Once a participant has voted to commit a transaction, it is not allowed to abort it.
 Protocol second phase : Every participant in the transaction carries out the joint decision. If
any one participant votes to abort, then the decision must be to abort the transaction. If all
the participants vote to commit, then the decision is to commit the transaction.
 The protocol is implemented in two phases :
 Phase 1 : Preparation
1. The coordinator sends a PREPARE TO COMMIT message to all subordinates.
2. The subordinates receive the message; write the transaction log, using the write-ahead
protocol; and send an acknowledgment (YES/PREPARED TO COMMIT or NO/NOT
PREPARED) message to the coordinator.
3. The coordinator makes sure that all nodes are ready to commit, or it aborts the action.
If all nodes are PREPARED TO COMMIT, the transaction goes to Phase 2. If one or
more nodes reply NO or NOT PREPARED, the coordinator broadcasts an ABORT
message to all subordinates.
 Phase 2 : The Final COMMIT
1. The coordinator broadcasts a COMMIT message to all subordinates and waits for the
replies.
2. Each subordinate receives the COMMIT message, and then updates the database using
the DO protocol.
3. The subordinates reply with a COMMITTED or NOT COMMITTED message to the
coordinator. If one or more subordinates did not commit, the coordinator sends an
ABORT message, thereby forcing them to UNDO all changes.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (6 - 19) Fault Tolerance

 The objective of the two-phase commit is to ensure that each node commits its part of the
transaction; otherwise, the transaction is aborted. If one of the nodes fails to commit, the
information necessary to recover the database is in the transaction log, and the database can
be recovered with the DO-UNDO-REDO protocol.
 Time-out actions in the Two phase commit
 It is to avoid blocking forever when a process crashes or a message is lost. Fig. 6.5.1 shows
the communication in two phase commit protocol.

Fig. 6.5.1 : Communication in two phase commit protocol

 Performance of the two-phase commit protocol

 If there are no failures, the 2PC involving N participants require N can Commit ? messages
and replies, followed by N doCommit messages.
 The cost in messages is proportional to 3N, and the cost in time is three rounds of
messages. The haveCommitted messages are not counted.
 There may be arbitrarily many server and communication failures. 2PC is guaranteed to
complete eventually, but it is not possible to specify a time limit within which it will be
completed. Two phase commit protocol can cause considerable delays to participants in
uncertain state.
 These delays occur when the coordinator has failed and cannot reply to get decision
requests from participants.
 Some 3PCs designed to alleviate such delays. They require more messages and more
rounds for the normal case.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (6 - 20) Fault Tolerance

1. To deal with server crashes : Each participant saves tentative updates into permanent
storage, right before replying yes/no in first phase. It is retrievable after crash recovery.
2. To deal with can commit ? loss : The participant may decide to abort unilaterally after a
timeout.
3. To deal with Yes/No loss : The coordinator aborts the transaction after a timeout. It must
announce doAbort to those who sent in their votes.
4. To deal with doCommit loss : The participant may wait for a timeout, send a
getDecision request cannot abort after having voted Yes but before receiving
doCommit/doAbort.
 Advantages of two phase commit
1. It ensures atomicity even in the presence of deferred constraints.
2. It ensures independent recovery of all sites.
3. Since it takes place in two-phases, it can handle network failures, disconnections and in
their presence assure atomicity.
 Disadvantages of two phase commit
1. Involves a great deal of message complexity.
2. Greater communication overheads as compared to simple optimistic protocols.
3. Blocking of site nodes in case of failure of coordinator.
4. Multiple forced writes of log, which increase latency.
5. Its performance is again a trade off, especially for short lived transactions, like internet
applications.
Review Question

1. Explain the requirements of atomic commitment problem. How atomic commit protocol
can be implemented by two phase commit ? SPPU : Dec. - 18, End sem, Marks 8

 6.6 Recovery  SPPU : Dec. - 18

 Recovery refers to restoring a system to its normal operational state. Once a failure has
occurred, it is essential that the process where the failure happened can recover to a correct
state. Fundamental to fault tolerance is the recovery from an error.
 Resources are allocated to executing processes in a computer. For example : A process has
memory allocated to it and a process may have locked shared resources, such as files and
memory.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge
Distributed Systems (6 - 21) Fault Tolerance

 Following are some solution on process recovery :

1. Reclaim resources allocated to process
2. Undo modification made to databases and
3. Restart the process
4. Or restart process from point of failure and resume execution.
 In distributed process recovery, undo effect of interactions of failed process with other
cooperating processes.
 System is combination of hardware and software components. These components provide a
specified service. Failure of a system occurs when the system does not perform its service
in the manner specified. An erroneous state of the system is a state which could lead to a
system failure by a sequence of valid state transitions.
 A system is said to "fail" when it cannot meet its promises. A failure is brought about by
the existence of "errors" in the system. A system is said to have a failure if the service it
delivers to the user deviates from compliance with the system specification for a specified
period of time. Fig. 6.6.1 shows concept of fault and recovery.

Fig. 6.6.1 : Concept of recovery

a. System failure : System does not meet requirements, i.e. does not perform its services
as specified.
b. Erroneous system state : State which could lead to a system failure by a sequence of
valid state transitions.
c. Error : The part of the system state which differs from its intended value.
d. Fault : Anomalous physical condition, e.g. design errors, manufacturing problems,
damage, external disturbances.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (6 - 22) Fault Tolerance

 A failure occurs when an actual running system deviates from this specified behavior. The
cause of a failure is called an error. An error represents an invalid system state, one that is
not allowed by the system behavior specification.
 The error itself is the result of a defect in the system or fault. In other words, a fault is the
root cause of a failure. That means that an error is merely the symptom of a fault. A fault
may not necessarily result in an error, but the same fault may result in multiple errors.
Similarly, a single error may lead to multiple failures.
 To ensure correctness, recovery mechanisms must be in place to ensure transaction
atomicity and durability even in the midst of failures.
 Distributed recovery is more complicated than centralized database recovery because
failures can occur at the communication links or a remote site. Ideally, a recovery system
should be simple, incur tolerable overhead, maintain system consistency, provide partial
operability and avoid global rollback.
 Reliability refers to the probability that the system under consideration does not experience
any failures in a given time period. Availability refers to the probability that the system can
continue its normal execution according to the specification at a given point in time in spite
of failures.

 6.6.1 Classification of Failures

 Failures in a computer system can be classified as follows :
1. Process failure
2. System failure
3. Secondary storage failure
4. Communication medium failure.
 1. Process failure :
 It means that it has halted and will not execute any further.
 In a process failure, the computation results in an incorrect outcome, the process causes
the system state to deviate from specifications, the process may fail to progress.
 Behavior : Process causes system state to deviate from specification, for example :
incorrect computation, process stop execution.
 Errors causing process failure : Protection violation, deadlocks, timeout, wrong user
input, etc…
 Recovery : Abort process or Restart process from prior state.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (6 - 23) Fault Tolerance

 2. System failure :
 Behavior : Processor fails to execute.
 A system failure occurs when the processor fails to execute. It is caused by software
errors and hardware problems.
 Caused by software errors or hardware faults i.e. CPU/memory/bus failure.
 Recovery : System stopped and restarted in correct state.
 Assumption : Fail-stop processors, i.e. system stops execution, internal state is lost.
 3. Secondary storage failure :
 A secondary storage failure is said to have occurred when the stored data cannot be
accessed. This failure is usually caused by parity error, head crash or dust particles
settled on the medium.
 Behavior : Stored data cannot be accessed.
 Errors causing failure : Parity error, head crash, etc.
 Recovery/Design strategies : Reconstruct content from archive plus log of activities and
design mirrored disk system.
 A system failure can further be classified as follows.
1. An amnesia failure 2. A partial amnesia failure
3. A pause failure 4. A halting failure
 4. Communication medium failure :
 Behavior : A site cannot communicate with another operational site.
 A communication medium failure occurs when a site cannot communicate with another
operational site in the network. It is usually caused by the failure of the switching nodes
and/or the links of the communicating system.
 Errors/Faults : Failure of switching nodes or communication links.
 Recovery/Design strategies : Reroute, error-resistant communication protocols.

 6.6.2 Steps after Failure

 Distributed recovery manager take care of atomicity of global transactions. If distributed
DBMS detects that a site failure has occurred, then the following steps are to be followed in
recovery process :
1. Failure affected transactions must be aborted.
2. Site failure message is broadcasted to all sites.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (6 - 24) Fault Tolerance

3. Checking must be done periodically to see whether the failed site has recovered or not.
4. After restarting the failure site, site must initiate a recovery procedure to abort all partial
transactions that were active at the time of failure.

 6.6.3 Backward and Forward Recovery

 Once a failure has occurred, it is essential that the process where the failure happened
recovers to a correct state.
 Recovery from an error is fundamental to fault tolerance.
 Two main forms of recovery :
1. Forward error recovery
2. Backward error recovery

Fig. 6.6.2

 Backward recovery, by use of checkpointing (global snapshot of distributed system status)

to record the system state but checkpointing is costly (performance degradation)
 Forward error recovery
 Forward recovery, attempt to bring system to a new stable state from which it is possible to
proceed (applied in situations where the nature if errors are known and a reset can be
applied).
 Forward error recovery continues from an erroneous state by making selective corrections
to the system state.
 This includes making safe the controlled environment which may be hazardous or damaged
because of the failure.
 It is system specific and depends on accurate predictions of the location and cause of errors
(i.e, damage assessment).
 Examples : Redundant pointers in data structures and the use of self-correcting codes such
as Hamming Codes.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge
Distributed Systems (6 - 25) Fault Tolerance

 Advantage Forward - error Recovery :

1. Less overhead.
 Disadvantages of Forward Recovery :
1. In order to work, all potential errors need to be accounted for up-front.
2. Limited use, i.e. only when impact of faults understood.
3. Cannot be used as general mechanism for error recovery.
4. Design specifically for a particular system.
 Backward Recovery
 Most extensively used in distributed systems and generally safest. It can be incorporated
into middleware layers.
 Backward recovery is complicated in the case of process, machine or network failure but no
guarantee that same fault may occur again.
 It can not be applied to irreversible (non-idempotent) operations, e.g. ATM withdrawal.
 Advantages of Backward Recovery :
1. Simple to implement
2. Can be used as general recovery mechanism
3. Capable of providing recovery from arbitrary damage.
 Disadvantages of Backward Recovery :
1. Checkpointing can be very expensive - especially when errors are very rare
2. Performance penalty
3. No guarantee that fault does not occur again
4. Some components cannot be recovered.

 6.6.4 Checkpoint
 Checkpointing : The process of writing the current committed values of a server’s object
to a new recovery file, together with transaction status entries and intentions lists of
transactions that have not yet been fully resolved.
 Checkpoint : The information stored by the checkpointing process. It is a point of
synchronization between database and log file. All buffers are force-written to secondary
storage.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (6 - 26) Fault Tolerance

 Checkpoint record is created containing identifiers of all active transactions.

 When failure occurs, redo all transactions that committed since the checkpoint and undo all
transactions active at time of crash.
 The purpose of make checkpoints reduces the number of transactions to be dealt with
during recovery, to reclaim file space.
 When to make checkpoint : Immediately after recovery, or from time to time. After
recovery from the checkpoint, discard old recovery file.
 Checkpointing in distributed systems requires that all processes (sites) that interact with one
another establish periodic checkpoints.
 All the sites save their local states : local checkpoints. All the local checkpoints, one from
each site, collectively form a global checkpoint.
 The domino effect is caused by orphan messages, which in turn are caused by rollbacks.
 Strongly consistent set of checkpoints
 The most recent distributed snapshot in a system is called the recovery line.
 Fig. 6.6.3 shows recovery line and checkpoint.

Fig. 6.6.3 : Recovery line and checkpoint

 Establish a set of local checkpoints (one for each process in the set) such that no
information flow takes place (i.e., no orphan messages) during the interval spanned by the
checkpoints.
 A strongly consistent set of checkpoints (recovery line) corresponds to a strongly consistent
global state.
 There is one recovery point for each process in the set during the interval spanned by the
checkpoints; there is no information flow between any pair of processes in the set and
process in the set and any process outside the set.
 A consistent set of checkpoints corresponds to a consistent global state.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (6 - 27) Fault Tolerance

 Fig. 6.6.4 shows consistent set of checkpoint.

Fig. 6.6.4 : Consistent set of checkpoint

 Set {x1, y1, z1} is a strongly consistent set of checkpoints.
 Set {x2, y2, z2} is a consistent set of checkpoints (need to handle lost messages).

 No local checkpoint includes an effect whose would be undone due to the rollback of
another process.
 Consistent set of checkpoints
 Similar to the consistent global state.
 Each message that is received in a checkpoint (state) should also be recorded as sent in
another checkpoint (state).
 Suppose that Y fails after receiving message 'm'. If Y restarts from checkpoint, message 'm'
is lost due to rollback.
 Checkpoint notation :
 Each node maintains :
1. Monotonically increasing counter with which each message from that node is labelled.
2. Records of the last message from and the first message to all other nodes.

Fig. 6.6.5

Note : "sl" denotes a "smallest label" that is <any other label and "ll" denotes a "largest
label" that is > any other label.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (6 - 28) Fault Tolerance

 Simple method for taking consistent set of checkpoint :

 Every process takes a checkpoint after sending every message.
 The set of most recent checkpoints is always consistent.

 6.6.5 Checkpoint and Rollback Recovery

 1. Checkpointing

 Checkpointing is well-known techniques for handling failures in distributed systems

including distributed database systems. The checkpointing protocols for distributed
database systems can be classified as log-oriented and dump-oriented.
 A checkpoint is a procedure to limit the amount of work for restart.
 In the log-oriented approach, a dump of the database is taken periodically and also a
marker is saved at appropriate places in the log. When a failure occurs, the latest dump
is restored and the operations on the log after the dump was taken are applied to the
dump until the next marker is reached to restore the database to a consistent state.
 In the dump-oriented approach, checkpointing is referred to as the process of saving the
state of all data items in the database in such a way that it represents a transaction-
consistent global checkpoint of the database.

 Checkpointing in a distributed database system is analyzed by establishing a

correspondence between consistent snapshots in a general distributed system and
transaction-consistent checkpoints in a distributed database system.

 Each checkpoint on a data item is assigned a unique sequence number. A data item is
checkpointed only after the state of the data item changes. That is, after a data item is
checkpointed, it is not checkpointed again until at least one other transaction has
accessed and changed the data item.
 Let T = {Ti | 1  i  m}be a set of transactions that access the database system.

 Each regular transaction is a partially ordered set of read and/or write operations. A
checkpointing transaction consists of only one operation, an operation that is similar to a
write operation which requires mutually exclusive access to the data item.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (6 - 29) Fault Tolerance

 Checkpointing in distributed systems requires that all processes (sites) that interact with
one another establish periodic checkpoints. All the sites save their local states :local
checkpoints. All the local checkpoints, one from each site, collectively form a global
checkpoint. The domino effect is caused by orphan messages, which in turn are caused
by rollbacks.
 Simple method for taking consistent set of checkpoint
a. Every process takes a checkpoint after sending every message.

b. The set of the most recent checkpoints is always consistent.

 Livelock problem during recovery is avoided by taking a consistent set of checkpoints.
Algorithm is said to be synchronous when the processes involved coordinate their local
checkpointing actions such that the set of all recent checkpoints in the system
guaranteed to be consistent.
 Checkpointing Algorithm
 Make some simplifying assumptions
1. Processes communicate by exchanging messages through channels
2. Channels are FIFO, end-to-end protocols cope with message loss due to rollback
recovery.
3. Communication failures do not partition the network
 A single process invokes the algorithm. The checkpoint and the rollback recovery
algorithms are not invoked concurrently.
Two types of checkpoints

1. Tentative : A temporary checkpoint that is made a permanent checkpoint on the

successful termination of the checkpoint algorithm

2. Permanent : A local checkpoint at a process.

Phase One

 Initiating process Pi takes a tentative checkpoint and requests that all the processes take
tentative checkpoints.
 Each process informs Pi whether it succeeded in taking a tentative checkpoint.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (6 - 30) Fault Tolerance

 If Pi learns that all processes have taken tentative checkpoints, Pi decides that all
tentative checkpoints should be made permanent.
 Otherwise, Pi decides that all tentative checkpoints should be discarded.
Phase Two
1. Pi propagates its decision to all processes.
2. On receiving the message from Pi, all processes act accordingly.

 Between tentative checkpoint and commit/abort of checkpoint process must hold back
messages.

 Does this guarantee we have a strongly consistent state ? Can you construct an example
that shows we can still have lost messages ?
Synchronous Checkpointing : Properties

 All or none of the processes take permanent checkpoints

 There is no record of a message being received but not sent
Optimization of the Checkpoint Algorithm

 A minimal number of processes take checkpoints

 All processes from which Pi has received messages after it has taken its last checkpoint
take a checkpoint to record the sending of those messages.

Fig. 6.6.6 : Checkpoints taken unnecessarily

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (6 - 31) Fault Tolerance

 Fig. 6.6.7 shows the checkpoints taken unnecessarily.

1. Process X decides to initiate checkpoint algorithm after receiving message ‘m’.
2. It take a tentative checkpoints x2 and sends take tentative checkpoint messages to
processes Y and Z , causing Y and Z to take checkpoints y2 and z2 respectively.
3. Now { x2, y2, z2} forms a consistent set of checkpoints.
4. { x2, y2, z1} also forms a consistent set of checkpoints.

5. Y takes tentative checkpoint only if the last message received by X from Y was sent
after Y sent the first message after the last checkpoint (last_recv(x, y) > =
first_send(y, x)).
 When a process takes a checkpoint, it will ask all other processes that sent messages to
the process to take checkpoints.
Synchronous Checkpointing Disadvantages
1. Additional messages must be exchanged to coordinate checkpointing.
2. Synchronization delays are introduced during normal operations.
3. No computational messages can be sent while the checkpointing algorithm is in
progress.
4. If failure rarely occurs between successive checkpoints, then the checkpoint algorithm
places an unnecessary extra load on the system, which can significantly affect
performance.
 2. Rollback Recovery

 Restore the system state to a consistent state after a failure with assumptions : Single
initiator, checkpoint and rollback recovery algorithms are not invoked concurrently.
Phase One :
 Process Pi checks whether all processes are willing to restart from their previous
checkpoints.
 A process may reply “no” if it is already participating in a checkpointing or recovering
process initiated by some other process.
 If all processes are willing to restart from their previous checkpoints, Pi decides that they
should restart.
 Otherwise, Pi decides that all the processes continue with their normal activities.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (6 - 32) Fault Tolerance

Phase Two :
 Pi propagates its decision to all processes.
 On receiving Pi’s decision, the processes act accordingly.
Optimization
 A minimum number of processes roll back
 Y will restart from its permanent checkpoint only if X is rolling back to a state where
the sending of one or more messages from X to Y is being undone.
 Fig. 6.6.7 shows the unnecessary rollback.

Fig. 6.6.7 : Unnecessary rollback

 6.6.6 Message Logging

 Rollback recovery treats a distributed system as a collection of application processes that
communicate through a network.
 Fault tolerance is achieved by periodically using stable storage to save the processes' states
during failure-free execution. Upon a failure, a failed process restarts from one of its saved
states, thereby reducing the amount of lost computation. Each of the saved states is called a
checkpoint.
 Message-passing systems complicate rollback recovery because messages induce inter-
process dependencies during failure-free operation. Upon a failure of one or more processes
in a system, these dependencies may force some of the processes that did not fail to roll
back, creating what is commonly called rollback propagation.
 A message-passing system consists of a fixed number of processes that communicate only
through messages. Processes cooperate to execute a distributed application program and
interact with the outside world by receiving and sending input and output messages,
respectively.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge
Distributed Systems (6 - 33) Fault Tolerance

 Fig. 6.6.8 shows a system consisting of three processes, where horizontal lines extending
toward the right-hand side represent the execution of each process, and arrows between
processes represent messages.

Fig. 6.6.8 : Message passing in three process

 A global state of a message-passing system is a collection of the individual states of all

participating processes and of the states of the communication channels.
 Intuitively, a consistent global state is one that may occur during a failure-free, correct
execution of a distributed computation. Inconsistent states occur because of failures.
 Orphan Messages and Domino Effect :
 Fig. 6.6.9 shows the domino effects. The process of a cascaded rollback may lead to what is
called the domino effect.

Fig. 6.6.9 : Domino effect

 An orphan process is a process that survives the crash of another process, but whose state
is inconsistent with the crashed process after its recovery.
 Three processes X, Y and Z are exchange their information. Information exchange is shown
by arrows () and symbol "[" marks a recovery point to which a process can be rolled back
in the event of a failure.
1. Case 1 : Failure of X after x3 : no impact on Y or Z.

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (6 - 34) Fault Tolerance

2. Case 2 : Failure of Y after sending msg. 'm'.

o Y rolled back to y2.

o 'm' = orphan message

o X rolled back to x2.
3. Case 3 : Failure of Z after z2
o Y has to roll back to y1
o X has to roll back to x1 Domino effect
o Z has to roll back to z1

 Lost Messages
 Regenerating lost messages on recovery :
1. If implemented on unreliable communication channels, the application is responsible
2. If implemented on reliable communication channels, the recovery algorithm is
responsible.
 Fig. 6.6.10 shows lost messages due to roll back recovery.

Fig. 6.6.10 : Lost messages

 Assume that x1 and y1 are the only recovery points for processes X and Y, respectively.

 Assume Y fails after receiving message ‘m’

 Y rolled back to y1, X rolled back to x1.

 Message ‘m’ is lost.

 There is no distinction between this case and the case where message ‘m’ is lost in
communication channel and processes X and Y are in states x1 and y1, respectively.
Review Question

1. Explain different methods to recover from the failure.

SPPU : Dec. - 18, End Sem, Marks 8

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (6 - 35) Fault Tolerance

 6.7 Multiple Choice Questions

Q.1 __________ failure occurs when a server fails to respond to a request.
a Omission b Timing c Response d Arbitrary

Q.2 A problem with the __________ commit protocol is that when the coordinator has
crashed, participants may not be able to reach a final decision.
a three-phase b two-phase c checkpoint d none

Q.3 __________ redundancy add extra bits to allow recovery from garbled bits.
a Physical b Time c Information d All of these

Q.4 Failures are caused by __________.

a error b faults c error and faults d syntax error

Q.5 TMR stands for __________.

a time model receiver b transmission modular redundancy

c triple modular redundancy d two modular redundancy

Q.6 __________ redundancy adds extra equipment or processes so that the system can
tolerate the loss or malfunctioning of some components.
a Physical b Time c Information d Sequential

Q.7 If local states jointly do not form a distributed snapshot, further rollback is necessary.
This process of a cascaded rollback may lead to what is called the __________
a checkpointing b recovery line

c domino effect d independent checkpointing

Q.8 As processes take local checkpoints independent of each other, this method is also
referred to as __________.
a coordinated checkpointing b independent checkpointing

c distributed snapshot d incremental snapshot

Q.9 List the name of techniques that avoid domino effect :

a Coordinated checkpointing rollback recover

b Log-based rollback recovery

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems (6 - 36) Fault Tolerance

c Communication-induced checkpointing rollback recovery

d All of these

Q.10 What is an orphan messages ?

a Messages that have been sent but not yet received

b Messages whose send is done but receive is undone due to rollback

c Messages with receive recorded but message send not recorded

d Arise due to message logging and replaying during process recovery

Q.11 __________ protocols assume that a failure can occur after any non-deterministic
event in the computation.
a Optimistic logging b Pessimistic logging

c Causal logging d All of these

Q.12 In __________ checkpointing, each process takes its checkpoints independently.

a communication-induced b coordinated

c uncoordinated d none

Q.13 The most recent consistent global checkpoint is termed as the __________.
a domino effect b recovery line

c coordinated checkpoint d all of these

Q.14 The checkpoints that a process takes independently are __________ checkpoints
while those that a process is forced to take are called forced checkpoints.
a global b communication-induced

c local d uncoordinated

 Answer Keys for Multiple Choice Questions

Q.1 a Q.2 b Q.3 c Q.4 b Q.5 c
Q.6 a Q.7 c Q.8 b Q.9 d Q.10 c
Q.11 b Q.12 c Q.13 b Q.14 c


TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Object Oriented Programming - 1 S-1 Solved University Question Paper

Solved Model Question Paper (In Sem)

Distributed Systems
T.E. (Computer) Semester - V (Elective - I) (As Per 2019 Pattern)

Time : 1 Hour] [Maximum Marks : 30

N.B : i. Attempt Q.1 or Q.2, Q.3 or Q.4.
ii. Neat diagrams must be drawn wherever necessary.
iii. Figures to the right side indicate full marks.
iv. Assume suitable data, if necessary.
Q.1 a) List types of distributed systems. Explain any one distributed computing systems.
[Refer section 1.3.1] [4]
b) Define transparency in distributed system with its type. [Refer section 1.2] [5]
c) What is architectural model ? Explain object-based architectures.
[Refer sections 1.4 and 1.4.2] [6]
OR
Q.2 a) Define distributed system. List characteristics of DS. [Refer section 1.1] [4]
b) Explain centralized organizations of system architecture. [Refer section 1.6] [5]
c) Explain following middleware organization :
Wrappers, interceptors, modifiable middleware [Refer section 1.5] [6]
Q.3 a) What is RPC ? How stud is generated ? [Refer sections 2.2 and 2.2.4] [4]
b) Explain message oriented persistent communication. [Refer section 2.4.3] [5]
c) Explain RPC call semantics. [Refer section 2.2.9] [6]
OR
Q.4 a) What is synchronous and asynchronous communication ? List the drawbacks of
synchronous communication. [Refer section 2.4.1] [3]
b) What is RMI ? Explain RMI invocation semantics. Also explain advantages and
disadvantages of RMI. [Refer section 2.3] [6]
c) What is multicast communication ? List its characteristics. Explain application-level
tree-based communication. [Refer section 2.5] [6]

®
TECHNICAL PUBLICATIONS - An up thrust for knowledge
(M - 1)
Distributed Systems M-2 Solved Model Question Papers

Solved Model Question Paper (End Sem)

Distributed Systems
T.E. (Computer) Semester - V (Elective - I) (As Per 2019 Pattern)

1
Time : 2 Hours] [Maximum Marks : 70
2
N.B : i. Attempt Q.1 or Q.2, Q.3 or Q.4, Q.5 or Q.6, Q.7 or Q.8.
ii. Neat diagrams must be drawn wherever necessary.
iii. Figures to the right side indicate full marks.
iv. Assume suitable data, if necessary.
Q.1 a) Explain Berkeley Algorithm. [Refer section 3.2.1] [4]
b) Explain happened before relationship in a distributed system for logical clock.
[Refer section 3.3] [6]
c) What is election algorithm ? Explain Bully Algorithm.
[Refer sections 3.5 and 3.5.1] [8]
OR
Q.2 a) Explain Maekawas Voting Algorithm. [Refer section 3.4.7] [4]
b) Explain the following terms :
i) Drift rate ii) Clock skew iii) Coordinated universal time. [Refer section 3.1] [6]
c) Discuss central server algorithm. Explain performance metrics for mutual exclusion
algorithms. [Refer section 3.4] [8]
Q.3 a) Explain desirable features of a good naming system. [Refer section 4.1.1.1] [4]
b) How resolver looks up a remote name ? [Refer section 4.3.3] [6]
c) What is distributed hash tables ? Explain chord with finger table.
[Refer section 4.2.3] [7]
OR
Q.4 a) What is LDAP ? Why use LDAP ? [Refer section 4.4.2] [3]
b) Explain distributed file system requirements. [Refer section 4.5.2] [6]
c) What is NFS ? List goals of NFS design. Draw and explain NFS architecture.
[Refer section 4.7] [8]
Q.5 a) What is replica management ? [Refer section 5.4] [4]

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems M-3 Solved Model Question Papers

b) What is active replication ? Explain fault tolerance in active replication.

[Refer section 5.5.2.1] [6]
c) Explain any three client-centric consistency models. [Refer section 5.3] [8]
OR
Q.6 a) Explain co-operative caching. [Refer section 5.6] [4]
b) Explain following primary-based protocols :
1. Remote-write 2. Local-write. [Refer section 5.5.1] [6]
c) Explain strict consistency, sequential consistency, causal consistency of data-centric
consistency models. [Refer section 5.2] [8]
Q.7 a) What is triple modular redundancy. [Refer section 6.1.2] [3]
b) Define checkpointing. Explain consistent set of checkpoint. [Refer section 6.6.4] [6]
c) Explain various classification of failures. [Refer section 6.6.1] [8]
OR
Q.8 a) Explain byzantine agreement problem. [Refer section 6.2.3] [3]
b) Explain Orphan messages and Domino effect with example. [Refer section 6.6.6] [6]
c) Briefly discuss two phase commit protocols. [Refer section 6.5.2] [8]


















Notes

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

Distributed Systems M-4 Solved Model Question Papers

TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

9 789390 770700
Made in India

SwiftUI For Masterminds 3rd Edition
No ratings yet
SwiftUI For Masterminds 3rd Edition
1,584 pages
Financial Management Techknowledge Searchable
No ratings yet
Financial Management Techknowledge Searchable
185 pages
Software Engineering Handwritten Notes For Placements
No ratings yet
Software Engineering Handwritten Notes For Placements
46 pages
Iot Notes All Units PDF
No ratings yet
Iot Notes All Units PDF
51 pages
LogRhythm Software Install Guide 7.4.8 RevA
No ratings yet
LogRhythm Software Install Guide 7.4.8 RevA
123 pages
Atometa Book
67% (3)
Atometa Book
145 pages
Multi - Core Architectures and Programming - Lecture Notes, Study Material and Important Questions, Answers
0% (1)
Multi - Core Architectures and Programming - Lecture Notes, Study Material and Important Questions, Answers
49 pages
Node JS - L1: Trend NXT Hands-On Assignments
0% (1)
Node JS - L1: Trend NXT Hands-On Assignments
3 pages
Distributing Computing
No ratings yet
Distributing Computing
212 pages
AL3451 Machine Learning Lecture Notes 1
No ratings yet
AL3451 Machine Learning Lecture Notes 1
212 pages
Information Security
No ratings yet
Information Security
292 pages
Software Engineering Handwritten Notes by Abhishek
No ratings yet
Software Engineering Handwritten Notes by Abhishek
176 pages
Jntuh Iot Le Cture Notes
No ratings yet
Jntuh Iot Le Cture Notes
93 pages
Co Po Mapping Justification DSA
No ratings yet
Co Po Mapping Justification DSA
3 pages
PPL Notes
No ratings yet
PPL Notes
126 pages
Soft Computing Quantum
No ratings yet
Soft Computing Quantum
100 pages
IT8601-Computational Intelligence PDF
No ratings yet
IT8601-Computational Intelligence PDF
12 pages
Unit II - Perceptron
No ratings yet
Unit II - Perceptron
20 pages
Vtu ML Lab Manual
67% (3)
Vtu ML Lab Manual
47 pages
CS3451 OS UNIT 1 NOTES EduEngg
No ratings yet
CS3451 OS UNIT 1 NOTES EduEngg
34 pages
(Made Easy) Operating System - CSE Gate Handwritten Notes PDF Download PDF
No ratings yet
(Made Easy) Operating System - CSE Gate Handwritten Notes PDF Download PDF
182 pages
Theory of Computation EduEngg
100% (1)
Theory of Computation EduEngg
520 pages
Unit 1 Introduction To ML
100% (1)
Unit 1 Introduction To ML
52 pages
Lecturernotes - Module - 3 - BCS515D - Distributed Systems
No ratings yet
Lecturernotes - Module - 3 - BCS515D - Distributed Systems
11 pages
Scripting Languages Unit - V Handwritten Notes
No ratings yet
Scripting Languages Unit - V Handwritten Notes
37 pages
Ap92s12 Advanced Digital System Design
67% (3)
Ap92s12 Advanced Digital System Design
2 pages
Daa Notes Complete
No ratings yet
Daa Notes Complete
203 pages
Machine Learning 2019 (TechNeo) - 1
No ratings yet
Machine Learning 2019 (TechNeo) - 1
158 pages
Suchitra Publicaions
43% (7)
Suchitra Publicaions
10 pages
ML Decode
No ratings yet
ML Decode
130 pages
EC8093-Digital Image Processing
50% (2)
EC8093-Digital Image Processing
11 pages
ML UNIT-4 Notes PDF
100% (1)
ML UNIT-4 Notes PDF
40 pages
RM IPR (21RMI56) SRN QBank
No ratings yet
RM IPR (21RMI56) SRN QBank
2 pages
CS8392 - Oops Ebook
No ratings yet
CS8392 - Oops Ebook
476 pages
DBMS (R23) UNIT - 1
No ratings yet
DBMS (R23) UNIT - 1
15 pages
Final_AI Affirmation Pledge
No ratings yet
Final_AI Affirmation Pledge
20 pages
Data Structure Using C by Mamata Garanayak 238c40
33% (3)
Data Structure Using C by Mamata Garanayak 238c40
460 pages
21rmi56 GSK RD Mitm Me
100% (1)
21rmi56 GSK RD Mitm Me
128 pages
CS3551 Distributed-Systems 2marks and Partb
No ratings yet
CS3551 Distributed-Systems 2marks and Partb
14 pages
Multicore-Architecture-And-Programming Notes
No ratings yet
Multicore-Architecture-And-Programming Notes
64 pages
MC Notes Complete
No ratings yet
MC Notes Complete
137 pages
Unit 1 - Distributed System - WWW - Rgpvnotes.in
100% (2)
Unit 1 - Distributed System - WWW - Rgpvnotes.in
19 pages
FAFL Padma Reddy
100% (1)
FAFL Padma Reddy
457 pages
Data Copy in Copy Out
No ratings yet
Data Copy in Copy Out
2 pages
Software Engineering
No ratings yet
Software Engineering
279 pages
Ai (Bad402)
100% (2)
Ai (Bad402)
4 pages
Computer Graphics Notes - TutorialsDuniya
100% (1)
Computer Graphics Notes - TutorialsDuniya
188 pages
Advance Data Structures Notes-R23
No ratings yet
Advance Data Structures Notes-R23
107 pages
System Software PDF
No ratings yet
System Software PDF
456 pages
Question Bank: Software Engineering and Project Management (BCS501)
No ratings yet
Question Bank: Software Engineering and Project Management (BCS501)
1 page
Problem Statement
No ratings yet
Problem Statement
23 pages
DAA Notes
No ratings yet
DAA Notes
126 pages
Changing Nature of Software
No ratings yet
Changing Nature of Software
25 pages
Operating System Unit Wise Important Questions As Per External Exam
No ratings yet
Operating System Unit Wise Important Questions As Per External Exam
1 page
BAI602-ML-I
No ratings yet
BAI602-ML-I
4 pages
Algorithms Lab Viva Questions
No ratings yet
Algorithms Lab Viva Questions
2 pages
Design and Analysis of Algorithms Laboratory 10CSL47
No ratings yet
Design and Analysis of Algorithms Laboratory 10CSL47
28 pages
Machine Learning with Python: Design and Develop Machine Learning and Deep Learning Technique using real world code examples
From Everand
Machine Learning with Python: Design and Develop Machine Learning and Deep Learning Technique using real world code examples
Abhishek Vijayvargia
No ratings yet
Iii Year Vi Sem CS6601 Distributed Systems
No ratings yet
Iii Year Vi Sem CS6601 Distributed Systems
72 pages
CS8603 DS Course Plan-1
No ratings yet
CS8603 DS Course Plan-1
9 pages
Distributed Systems Practitioners Dimos Raptis Raspoznan
No ratings yet
Distributed Systems Practitioners Dimos Raptis Raspoznan
259 pages
Building Secure and Reliable Network Applications
No ratings yet
Building Secure and Reliable Network Applications
4 pages
OSY - Final Project (22516) Microproject
No ratings yet
OSY - Final Project (22516) Microproject
12 pages
Lecture PPT (24 HRS) Private
No ratings yet
Lecture PPT (24 HRS) Private
44 pages
Electronic Battery With Arduino: Emmanuel Garcia Escobedo
No ratings yet
Electronic Battery With Arduino: Emmanuel Garcia Escobedo
5 pages
Shopee
No ratings yet
Shopee
2 pages
Unit I
No ratings yet
Unit I
22 pages
Missing Data Mechanisms and Imputation Methods
No ratings yet
Missing Data Mechanisms and Imputation Methods
16 pages
Master Thesis
No ratings yet
Master Thesis
121 pages
KENT CamAttendance Launch
No ratings yet
KENT CamAttendance Launch
20 pages
User Manual DVP-14SS
No ratings yet
User Manual DVP-14SS
440 pages
AD. Domain Persistence: Golden Ticket Attack
No ratings yet
AD. Domain Persistence: Golden Ticket Attack
23 pages
User's Manual: General
No ratings yet
User's Manual: General
4 pages
Final Year Project Titles 2017-2018
0% (1)
Final Year Project Titles 2017-2018
24 pages
طلب تجديد رخصة الإقامة
No ratings yet
طلب تجديد رخصة الإقامة
2 pages
Board Game Counter: Digital Electronics
No ratings yet
Board Game Counter: Digital Electronics
18 pages
AWS CSAA Free Test: Whizlabs Learning Center
No ratings yet
AWS CSAA Free Test: Whizlabs Learning Center
28 pages
Skills Test 2A
No ratings yet
Skills Test 2A
4 pages
6800 Security Gateway Datasheet
No ratings yet
6800 Security Gateway Datasheet
5 pages
Bigbook Utul
No ratings yet
Bigbook Utul
403 pages
Michael Dell Thesis
100% (3)
Michael Dell Thesis
7 pages
ABB - Benefits of State Based Control White Paper
No ratings yet
ABB - Benefits of State Based Control White Paper
32 pages
Brochure
No ratings yet
Brochure
12 pages
jdbc connectivity notes
No ratings yet
jdbc connectivity notes
20 pages
Eid444: E-Commerce: Faculty # A.S.Venkata Praneel
No ratings yet
Eid444: E-Commerce: Faculty # A.S.Venkata Praneel
25 pages
(Ebook) Analysis And Design Of Algorithms by Amrinder Arora ISBN 9781634870214, 9781634870733, 1634870212, 1634870735 instant download
100% (1)
(Ebook) Analysis And Design Of Algorithms by Amrinder Arora ISBN 9781634870214, 9781634870733, 1634870212, 1634870735 instant download
35 pages
Pandas Overview
No ratings yet
Pandas Overview
4 pages
Reviewer Module 4 Transition Management
No ratings yet
Reviewer Module 4 Transition Management
5 pages
Udp Multicast
No ratings yet
Udp Multicast
59 pages
6 English 2nd Paper
No ratings yet
6 English 2nd Paper
1 page