Group Comm PDF
Group Comm PDF
2500 0.8
0.7
2000
0.6
Utilization
Messages/Second
0.5
1500
0.4
1000 0.3
0.2
500
0.1
0 0
100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
Message Size (Bytes)
a process can join groups and multicast messages to (i.e. processes that are not members of a group
groups. Using Transis, messages can be addressed can multicast to that group).
to the entire process group by specifying the group A process is linked with a library that connects it
name (a character string selected by the user). The to the Transis daemon. When the process connects to
group membership changes when a new process joins Transis, an inter-process communication handle (simi-
or leaves the group, a processor containing processes lar to a socket handle) is created. A process can main-
belonging to the group crashes, or a network partition tain multiple connections to Transis. A process may
or re-merge occurs. Processes belonging to the group voluntarily join specic process groups on a specic
receive conguration change notication when such an connection. A message which is received can be a reg-
event occurs. ular message sent by a process, or a membership no-
Transis incorporates sophisticated algorithms for tication created by Transis regarding a membership
membership and reliable ordered delivery of messages change of one of the groups to which this process be-
that tolerate message omission, processor crashes longs. Transis service semantics is described in [2, 9].
and recoveries, and network partitions and remerges. Transis is operational for almost three years now. It
High performance is achieved by utilizing non-reliable is used by students of the Distributed Systems course
broadcast or multicast where possible (such as on lo- at The Hebrew University and by the members of The
cal area networks). Transis performance can be seen High Availability Lab. Several projects were imple-
in Figure 1. mented on top of Transis. Among them were highly
Transis application programming interface (API) available mail system, two types of replication servers
provides a connection oriented service that, in princi- and several graphical demonstration programs.
ple, extends a point-to-point service such as TCP/IP Ongoing work on the Transis project focuses,
to a reliable multicast service. The API contains en- among other things, on security and authentication of
tries that allow a process to connect to Transis, to join users which is important for useful distributed system
and leave process groups, to multicast messages to pro- management tools.
cess groups, to receive messages and to disconnect.
Transis is implemented as a daemon. The Tran- 4 The Architecture
sis daemon handles the physical multicast communi- The architecture is composed of two layers as de-
cation. It keeps track of the processes residing in its picted in Figure 2. The bottom layer is Transis, our
computer which participate in group communication, group communication toolkit, which provides reliable
and also keeps track of the computer's membership multicast and membership services. The top layer is
(i.e. connectivity). The benet of this structure are composed of a management server and a monitor. Al-
signicant. The main advantages in our context are: though we use Transis as our group communication
Message ordering and reliability are maintained layers, other existing toolkits such as Totem [4], Ho-
at the level of the daemons and not on a per rus [13] or Newtop [7] could have been used.
group basis. Therefore, the number of groups in The management server provides two classes of ser-
the system has almost no in
uence on system per- vices: long-term services and short-term ones. Long-
formance. term services provide consistent semantics across par-
titions and over time. They are used for replication of
Flow control is maintained at the level of the dae- network tables (maps) such as the password database,
mons rather than at the level of the individual which are maintained on a secondary storage. These
process group. This leads to better overall per- services implement an ecient replica control protocol
formance. that applies changes on a per-update basis.
Short-term services are reliable as long as the net-
Implementing an open group semantics is easy work is not partitioned and the management server
External External
Application Application
T r a n s i s T r a n s i s T r a n s i s
does not crash. In case of a network partition or a The command is executed by each of the rele-
server crash, the monitor and the management servers vant servers relatively to the working directory
receive notication from Transis. The application may that corresponds to the initiating monitor.
be informed and may take whatever steps necessary.
Short-term services include simultaneous task execu- Siminst. Install a software package on a num-
tion and software installation. ber of specied hosts. The installation is per-
The monitor provides a user interface to the ser- formed relatively to the working directory that
vices of the management server. The monitor is a corresponds to the initiating monitor.
process which may run on any of the nodes that run
Transis. Several monitors may run simultaneously in Update-map. Update map while preserving con-
the network. sistency between replicas.
The management server runs as a daemon on each Query-map. Retrieve information from a map.
of the participating nodes. It is an event driven pro-
gram. Events can be generated by the monitor, an- Exit. Terminate the management server process.
other server or Transis.
Each server maintains a vector of contexts, with In practice, sites may be heterogeneous both in
one entry for each monitor it is currently interacting terms of software (e.g. operating system) and hard-
with. Each entry contains (among other things) the ware. We make use of a generic platform-independent
current working directory of the server as set by the representation for management commands and for the
corresponding monitor. reports of their execution. This representation is the
The long-term services are a non-intervening ex- only format used for communication between the pro-
tension of the current standard Unix NIS. Since the tocol entities. The Representation Converter (see Fig-
hosts NIS map repositories retain their original for- ure 2) is responsible for converting this generic repre-
mat, applications (e.g. gethostbyname) that use RPC sentation into a platform-specic form. This archi-
to retrieve information from them are not changed, tecture enables the support of new platforms with a
The service quality is improved because the replication relative ease.
scheme implemented by the management server guar- A prototype of the presented architecture was im-
antees consistency and is much more ecient com- plemented on top of Transis and was tested in a clus-
pared to the ad-hoc solution provided by NIS. ter of Unix workstations. The code, developed in the
The management server API contains the following C programming language, spans approximately 6500
entries: lines. The table management protocol (the more so-
Status. Return the status of the server and its phisticated part) constitutes about half of the code.
host machine. 5 Simultaneous Execution
Chdir. Change the server's working directory The system manager may frequently need to in-
which corresponds to the requesting monitor. voke an identical management command on several
machines. Potentially, the machines may be of dier-
Simex. Execute a command simultaneously ent types. The activation of a particular daemon or
(more or less) on a number of specied hosts. script on several machines, or the shutdown operation
Initially: Initially:
connect to Transis; connect to Transis;
join private group; join private group;
join group Cluster; join group Cluster;
while(true) f while(true) f
m = receive(); m = receive();
switch(m:type) switch(m:type)
case command from a user: case Chdir(dir) from the monitor M :
NR = M; contexts[M ]:working dir = dir;
multicast(command, Cluster); send ACK to M 's private group;
while(NR 6= ;) case Status from the monitor M :
m = receive(); get status of my machine;
switch(m:type) convert status to a system-independent form;
case view change message: send status to M 's private group;
NR = NR n (M n m:M); case Simex(command) from the monitor M :
M = m:M; convert command to a system-specic form;
case result of execution from server: chdir(contexts[M ]:working dir);
NR = NR n server; result = execute(command);
return the result; convert result to a system-independent form;
case view change message: send result to M 's private group;
M = m:M; case Exit:
g terminate my process;
g
command can be one of the following:
Chdir, Status, Simex or Exit
(a) The Monitor (b) The Management Server
of several machines are good examples. Another ex- and time, due to the point-to-point nature of TCP/IP.
ample is the simultaneous monitoring of CPU load, In addition, repeating the same procedure many times
memory usage and other relevant system parameters is prone to human errors resulting in inconsistent in-
on all or part of the machines in a cluster. stallations.
Figure 3(a) and Figure 3(b) present the pseudo- In contrast, we use Transis to disseminate the rel-
code of the relevant parts of the management server evant les to the members of the subgroup eciently.
and the monitor respectively. The management server We use the technique presented in Section 5 to exe-
maintains two sets: M and NR. M is the most re- cute the installation commands simultaneously at all
cent membership of the group Cluster as reported the involved locations. Each command is submitted
by Transis. NR is the set of the currently connected only once, reducing the possibility of human errors.
management servers which have not yet report the Using the process group paradigm, the system admin-
outcome of a command execution to the monitor. istrator can dynamically organize hosts with the same
It is easy to see how other tasks are integrated with installation requirements into a single multicast group.
the simultaneous execution task to form our proposed Our installation protocol proceeds in the following
architecture. steps. First, the monitor multicasts a message adver-
tising the installation of a package P , the set Rp of its
6 Software Installation installation requirements (e.g. disk space, available
Software installation and version upgrade consti- memory, operating system version etc.), the installa-
tute one of the most time-consuming system manage- tion multicast group Gp and the target list Tp . Upon
ment tasks. In large heterogeneous sites which com- reception of this message, the management server joins
prise tens or even hundreds of machines, there are of- Gp , if the system which it controls conforms to Rp
ten subgroups of computers with identical (or similar) and belongs to Tp . When all the management servers
architecture running copies of the same application from Tp have either joined Gp or reported that they
software and operating system. Presently, it is a com- will not participate in the installation, the monitor be-
mon practice to perform installation or upgrade by re- gins multicasting the les comprising the package P to
peating the same procedure at all locations in the sub- the group G. Finally, the status of the installation at
group separately. Installation or upgrade procedures every management server is reported to the monitor.
include the transfer of the packages, the execution of The Transis membership service helps detecting hosts
installation utilities and the update of relevant cong- which may not have completed the installation due to
uration les. Traditionally, all the above mentioned a network partition or host crash.
operations are performed using the TCP/IP protocol. The same protocol may later be repeated for a more
This approach is wasteful in terms of both bandwidth restricted multicast group G0 G. The monitor ques-
tions the members of G0 about the missing les prior Vec: a vector of sequence numbers containing one
to the redistribution, and only the needed les are entry for each of the SG's members. If V ec[i] = n
multicast to G0 in order to save bandwidth and time. then S knows that server i has all the updates
up to n. Initially, all V ec's entries are 0. Vec is
7 Table Management retained on a non-volatile storage.
This section presents the protocol for ecient and
consistent management of the replicated network ta- SGT: the Transis group name of SG.
bles, each of which represents a service. Servers which
share replicas of the same table form the same ser- Memb: the current membership of SGT as re-
vice group (SG). A service group consists of an ad- ported by Transis. This is a structure which
ministratively assigned primary server and a number contains a unique identier of the membership
of secondary ones. For the sake of simplicity we will (memb id) and a set of currently connected
consider a single SG in the following discussion. servers (set).
The primary server enforces a single total order on
all the update messages inside the SG. This is achieved ARU: 4 a sequence number such that S knows
by forwarding each new request for update from a that all the updates with sequence numbers no
client to the SG's primary. The primary creates an greater than ARU were received and applied to
update message from the request, assigns it a unique the table by all the members of SGT. Note that
sequence number, and multicasts this update message ARU = min1ijV ecj (V ec[i]).
to the SG. Each secondary server applies the update
messages to the SG's table in the order consistent with min sn, max sn: the minimal and maximal se-
the primary's one. This guarantees that all the servers quence numbers of update messages that need to
in the same network component remain in a consistent be retransmitted upon a membership change.
state. If the network partitions, at most one compo-
nent (the one that includes the primary) can perform Memb counter: variable that counts the State
new updates. Therefore, con
icting updates are never messages during the information exchange upon
possible. a membership change.
When a membership change (network partition or
merge, or server crash) is reported by the group com- Message Types
munication layer, the connected servers exchange in-
formation and converge to the most updated consis- Req: a new request to perform an update to the
tent state known to any of them. Note that this hap- table. This request is sent by a client to one of
pens even if the primary is not a member of the cur- the servers. The update operation is stored in the
rent membership. The information exchange is done action eld of this message.
in two stages. In the rst stage, the servers exchange
state messages containing a vector, representing their Upd: an update message multicast by SG's pri-
knowledge about the last update known to each server. mary to SGT . This message carries a unique se-
In the second stage, the most updated server multi- quence number in the sn eld in addition to the
casts updates that are missed by any member of the elds of a Req message.
currently connected group. 3
Each server logs all the update messages from the M: a membership change notication delivered
primary on a non-volatile storage. This log is used by Transis. This message contains the same two
for restoring of a server's state when a server recovers elds as the Memb structure.
from a crash. A server discards an update from the
log when it learns that all the other servers have ap- State: a state message which carries the V ec
plied this update to their table (and hence, no server and the identier of the sender. This message
will need to recover that update in the future). is stamped with the membership identier of the
membership it was sent in.
Data Structures StateP: similar to the state message which is used
Each management server S 2 SG maintains the for garbage collection when the membership con-
following data structures: tains all the members of SG.
my id: a unique identier of S. Qry: a query message from a client.
p id: the identier of SG's primary server.
In addition, a type eld is included with each message.
MQ: a list of the updates received by S. MQ is The Pseudo Code
retained on a non-volatile storage.
The following subsections present the pseudo-code
3 If the primary server is present in a component, it will be of the table management protocol.
the one performing the retransmission. Otherwise, one of the
most updated secondary servers is deterministically chosen. 4 ARU stands for \all-received-up-to"
Request from a client State message from a server
The server which receives an update request from a When a valid State message is received, the server
client, forwards it to the primary server. The pri- updates its knowledge regarding other servers' knowl-
mary server creates an update message from this re- edge. After all the States messages have been re-
quest, applies it to the SG's table and multicasts it to ceived, the needed update messages are retransmitted
the group. Procedure handle-request details these by the most updated server. If the primary server is
steps. a member of the current membership, it is selected as
the most updated server, otherwise the most updated
handle-request(m) secondary server with the smallest identier is selected
f using the procedure most-updated-server.
if (my id == p id) then Procedure handle-state details these steps.
V ec[my id] = V ec[my id] + 1;
m:sn = V ec[my id];
m:type = Upd; handle-state(m)
append m to MQ; f
sync MQ and V ec to disk; if (m:memb id 6= Memb:memb id) then
apply m:action to SG's table; return;
multicast(m, SGT); V ec = max(V ec; m:V ec);
else if (p id 2 Memb) then if (m:V ec[m:sender] < min sn) then
send(m, p id); min sn = m:V ec[m:sender];
g if (m:V ec[m:sender] > max sn) then
max sn = m:V ec[m:sn];
Memb counter = Memb counter ; 1;
Update from a server if (Memb counter == 0) then
if ( most-updated-server () ) then
A secondary server which receives an update message for each m0 2 MQ0 s:t: m0 :sn > min sn do
in the correct order, applies the update to the table multicast(m , SGT);
and changes its data structures accordingly. Proce- g
dure handle-update details these steps. The most-updated-server procedure presented
below returns true if the invoking server is the most-
handle-update(m) updated-server with the minimal identier, and false
f otherwise.
if (my id 6= p id and boolean most-updated-server()
m:sn == V ec[my id] + 1) then f
V ec[my id] = m:sn; for each i 2 Memb:set and i < my id do
append m to MQ; if (V ec[i] == max sn) then
sync MQ and V ec to disk; return false;
apply m:action to SG's table; if (V ec[my id] == max sn) then
else return true;
discard m; g
g
Garbage collection
Membership change notication from Transis In order to discard updates which are no longer
Upon a membership change, the connected servers ex- needed, procedure collect-garbage is called upon
change information in order to converge to the most the reception of either a State message, or a StateP
updated state. Procedure handle-membership pre- message. The StateP message is sent periodically
pares the data structures for this recovery process and if the membership contains all the members of the
multicasts a State message. Note that the State mes- SG. The reason for having the StateP message, is to
sage contains V ec, representing the local knowledge avoid maintaining large amounts of updates that are
regarding other servers' states. no longer needed because each member of the SG has
already applied them.
handle-membership(m)
f collect-garbage(m)
Memb:set = m:set; f
Memb:memb id = m:memb id; V ec = max(V ec; m:V ec);
min sn = max sn = V ec[my id]; new ARU = min1ijV ecj (V ec[i]);
Memb counter = j Memb j; if (new ARU >0 ARU) then 0
create a State 0message m0 ; for each m 20 MQ s:t: m new ARU do
multicast(m , SGT ); remove m from MQ;
g ARU = new ARU;
sync MQ and V ec to disk; [5] N. Amit, D. Ginat, S. Kipnis, and J. Mihaeli.
g Distributed SMIT: System management tool for
large Unix environments. Research report, IBM
Events handling Israel Science and Technology, 1995. In prepara-
tion.
The following is the main loop of the table manage-
ment part of the management server. [6] P. A. Bernstein, V. Hadzilacos, and N. Goodman.
Concurrency Control and Recovery in Database
Initially: Systems, chapter 7. Addison Wesley, 1987.
connect to Transis; [7] P. Ezhilchelvan, R. Macedo, and S. Shrivastava.
join group SGT; Newtop: A fault-tolerant group communication
initialize all the V ec entries to 0; protocol. In Proceedings of the 15th International
bring in MQ and V ec (if present) from disk; Conference on Distributed Computing Systems,
ARU = min1ijV ecj (V ec[i]); May 1995.
while(true) f
m = receive(); [8] N. Huleihel. Ecient ordering of messages in
switch(m:type) wide area networks. Master's thesis, Institute
case Req: of Computer Science, The Hebrew University of
handle-request(m); Jerusalem, Israel, 1996.
case Upd:
handle-update(m); [9] L. E. Moser, Y. Amir, P. Melliar-Smith, and D. A.
case Qry: Agarwal. Extended virtual synchrony. In Proceed-
retrieve an answer from the local table; ings of the 14th International Conference on Dis-
send the answer to the client; tributed Computing Systems, pages 56{65, June
case M: 1994.
handle-membership(m);
case State: [10] H. Stern. Managing NFS and NIS, chapter 2, 3,
handle-state(m); 4. O'Reilly & Associates Inc, rst edition, June
collect-garbage(m); 1991.
case StateP: [11] Tivoli Systems Inc. Multiplexed Distribution
collect-garbage(m); (MDist), November 1994. Available via anony-
g mous ftp from ftp.tivoli.com /pub/info.
8 Conclusion [12] Tivoli Systems Inc. TME 2.0: Technology Con-
We have presented an architecture that utilizes cepts and Facilities, 1994. Technology white pa-
group communication to provide ecient and reliable per discussing Tivoli 2.0 components and ca-
distributed system management. The common man- pabilities. Available via anonymous ftp from
agement tasks of simultaneous execution, software in- ftp.tivoli.com /pub/info.
stallation and table management were addressed. The
resulting services are convenient to use, consistent in [13] R. van Renesse, K. P. Birman, R. Friedman,
presence of failures, and complementary to the exist- M. Hayden, and D. Karr. A framework for pro-
ing standard mechanisms. tocol composition in Horus. In Proceedings of
References the ACM symposium on Principles of Distributed
Computing, August 1995.
[1] Y. Amir, 1995. The Spread toolkit, Private Com-
munication.
[2] Y. Amir. Replication Using Group Communica-
tion Over a Partiotioned Network. PhD thesis,
Institute of Computer Science, The Hebrew Uni-
versity of Jerusalem, Israel, 1995.
[3] Y. Amir, D. Dolev, S. Kramer, and D. Malki.
Transis: A communication sub-system for high
availability. In Proceedings of the 22nd Annual In-
ternational Symposium on Fault-Tolerant Com-
puting, pages 76{84, July 1992. The full version
of this paper is available as TR CS91-13, Dept. of
Comp. Sci., the Hebrew University of Jerusalem.
[4] Y. Amir, L. E. Moser, P. M. Melliar-Smith, D. A.
Agarwal, and P. Ciarfella. The Totem single-
ring ordering and membership protocol. 13(4),
November 1995.