Modeling and Analysis of Redundancy Management in Distributed Object-Oriented Systems by Using UML Statecharts
Modeling and Analysis of Redundancy Management in Distributed Object-Oriented Systems by Using UML Statecharts
create object
Fault Fault
Factory Detector Factory Detector
3 Modeling of redundancy structures ing to them a fault tree consisting of an AND gate (i.e.
an infrastructure object fails if all of its replicas fail).
We are interested in the dependability model available Note that this replication style can be refined.
in the early phases of the system design when decisions
can be made about the architecture of the system and the Replication style: It is the responsibility of the gateway
applied redundancy scheme (including replication, recov- which processes the request of the client and the re-
ery and repair strategy). Accordingly, we focus on the ar- sponse(s) of the server object(s). The statechart of the
chitecture of the system and abstract from implementation gateway determines the condition of the system fail-
details of the replica consistency, checkpointing and recov- ure. Usually, this function can be represented by a fault
ery. The main source of information is the object model of tree.
the system (available as UML class, object and deployment Repair and recovery strategy: It is the responsibility of
diagrams) and the dynamic model of the core parts of the the RM which processes events from the fault notifier
system (available as UML statechart diagrams). and from the factory, and sends messages to the fac-
The availability of the object group (presented in the pre- tory and to the replica objects. The processing of the
vious section) is determined by several factors and param- failure/repair events, which is defined by the statechart
eters. In the following, we outline how they can be repre- diagram of the RM, determines the repair and recovery
sented in (and then extracted from) the UML model. strategy.
Reliability parameters of the replica objects, the hosts
Accordingly, the most sophisticated model in the re-
and the local fault detectors: The dynamic behavior of
dundancy structure is that of the RM. The full modeling
these objects is not analyzed. The designer assigns the
failure rate and the average time required for recovery power of statecharts is required to be able to describe the
conditions and sequences of events/actions that determine
to these objects as UML tagged values.
when a failed object is recovered, how many replicas are
Reliability parameters of the global components of the maintained, what is the condition of object removal from
infrastructure: They are modeled in a similar way like the replica group etc. Note that the statechart model of
in the case of the replica objects. The roles of the ob- the RM describes only dependability-related behavior, no
jects are identified by UML stereotypes. Active repli- application-specific details are included. Thus, there is no
cation of these objects is taken into account by assign- need to filter out irrelevant states or transitions.
Event From (sender) To (receiver) Function
create RM factory Create an object
initialized factory RM Object is created and initialized
remove RM factory Remove an object
removed factory RM Object was removed
object fault fault notifier RM Failure of an object
host fault fault notifier RM Failure of a host
recover RM replica Initiate recovery
recovered fault notifier RM Object is recovered
unrecoverable fault notifier RM Object can not be recovered
Without restricting the interfaces and the implementation nets with timing and stochastic extensions. Petri nets (PN)
of the system, we assume that the events presented in Table are a widely accepted formalism for modeling and analysis
1 are processed by the RM. of distributed systems. For performance and dependabil-
ity evaluation extensions of PNs like Generalized Stochas-
4 Dependability analysis by model transfor- tic Petri Nets [1], Stochastic Reward Nets [12] offer not
only precise mathematical background but also sophisti-
mation
cated analysis tools. Our choice was the class of Stochas-
tic Reward Nets (SRN). SRNs generalize classical PNs by
4.1 Dependability sub-models rewards (various measures) and by assigning guards and
distributions of the firing time to transitions. Three SRN
Our dependability model consists of several sub-models tools, SPNP [6], PANDA [2] and DEEM [5] were used in
as follows. our analysis environment. Dependability measures can be
The core part of the dependability model that repre-
specified by reward functions. In certain cases (e.g. in the
case of exponential transition firing times) analytic solution
sents the replica management is generated by an auto-
is possible, otherwise simulation has to be performed. If a
matic model transformation from the statechart of the
steady state exists then steady state measures can be com-
RM to a stochastic Petri-net model.
puted, otherwise transient analysis can be executed. The
The replica objects and the global components of the analysis of the probability of states identified as represent-
infrastructure are represented by simplified models ing erroneous behavior leads to reliability (if no repair is
that are stored in a library of pre-defined sub-models. modeled) and availability characteristics (if repair is mod-
Note that the simplified models can be replaced by de- eled).
tailed ones when the dependability-related behavior of Correct SRN representation of the statechart with event
these components is fully described by statecharts (e.g. processing and state hierarchy needs a thorough analysis of
in the case of the fault notifier). In this case the auto- the semantics of both models. Our transformation was de-
matic model transformation can be used (like above). fined in a modular way, by introducing a set of SRN pat-
The connection among these sub-models – as defined
terns. These patterns are assigned to peculiar constructs
(like event dispatcher) or concepts (like state hierarchy, syn-
in the UML object diagram – is provided by the event
chronization) of the UML statechart formalism, this way
processing mechanism, i.e. event queues and dispatch-
they help in decomposing the problem and also in proving
ers belonging to the objects.
the correctness of the proposed solutions (according to the
In the following, we outline the model transformation informal requirements of the UML semantics as defined in
necessary to construct the core part of the dependability the standard [14]). The source models of the transforma-
model. tion described in this paper are restricted to UML statecharts
without history states. Actions are restricted to generation
4.2 From UML statecharts to stochastic Petri nets of new events, while events do not have parameters.
According to the semantics of UML statecharts (pre-
Our analysis of the redundancy management is based sented informally in [14] and formalized in [9]), several pe-
on a transformation from UML statechart models to Petri culiar concepts have to be taken into account. We discuss
them in the following. source and target states at different levels of the state hierar-
chy. Due to the concurrency, multiple transitions (triggered
by the same event) may be enabled at the same time. En-
4.2.1 Event queue and dispatcher
abled transitions which have common state(s) to exit are in
The events arriving from the environment or from the state conflict. Some conflicts can be resolved by the priority rela-
machine specified by the statechart itself are collected in tion: if a transition has a source state that is substate of the
the queue and dispatched by the dispatcher one at a time. source state of an other transition (being in conflict) then it
Event queues provide the interfaces among state machines has higher priority.
belonging to different objects. Since UML defines precisely From the point of view of the priority, enabled transitions
neither the policy of the dispatcher nor the number and dis- can be represented in the form of a tree according to their
tribution of event queues, we have defined patterns for sev- source states in the state hierarchy. Transitions on different
eral policies and leave it to the designer to specify the details branches of this tree can fire independently, while the con-
in the UML model (by using constraints). If the events are flicts of transitions being on the same path from the root to
selected non-deterministically, then the queue can be im- a leaf are resolved by the priority relation. Conflicts among
plemented with SRNs quite easily. However, FIFO (First transitions emanating from the same state or from different
In, First Out) is a costly policy in terms of the size and state active states, over which the priority relation is not defined,
space of the SRN. Figure 2 presents a FIFO queue for two are resolved non-deterministically.
events ”up” and ”down”. Accordingly, the SRN representing the maximal selec-
tion of UML transitions triggered by the same event is a tree
of interconnected sub-SRNs (each of them representing a
up0 down0 single UML transition) with an additional control structure.
This control structure consists of two chains of places. A to-
discard_up discard_down
[queue_2] [queue_2]
ken runs on one of the chains when the event is “not yet con-
[!queue_2] [!queue_2] sumed” by the transitions on the given arc of the tree, and
a token runs on the other chain when the event is “already
consumed”. A joining node of the tree merges the chains of
queue_2 up_queue_2 down_queue_2
the subtrees. All of the UML transitions in the subtree have
higher priority than any transitions along the common path
[!queue_1] [!queue_1] of the tree (on the root-side of the joining node), therefore
“event is unconsumed” applies to this common path if and
queue_1 up_queue_1 down_queue_1 only if the event was not consumed by any of the transitions
of the subtree. The “event is consumed” applies to the com-
mon path when some of the transitions of the subtree have
[!queue_0] [!queue_0] already fired (they had carried over the tokens on the “con-
sumed” chain) and the other transitions could not fire (they
queue_0 up_queue_0 down_queue_0 passed on the tokens along the chain). This construction
ensures that if the token representing the event reaches the
root of the tree, no more sub-SRNs corresponding to transi-
READY tions of the statechart will fire, the execution step is almost
finished.
up1 down1
Figure 2. SRN pattern of a FIFO queue of two The relationship of timing and guard evaluation is not spec-
events ified in standard UML. In our approach, time delay is as-
sociated with UML transitions, assuming that this delay is
the result of program code execution or communication de-
lay. Accordingly, the guard expressions have to be evalu-
4.2.2 Hierarchy of states and transitions ated before the firing of the (timed) transitions. Taking into
account the needs of different applications, we implemented
One important feature of statecharts is the hierarchic struc- three possible semantics for timed and guarded UML tran-
ture of states. States can contain sub-states or concurrent sitions: (i) the selection of the transitions is irrespective of
sub-machines. Transitions of a statechart may have their timing, (ii) the guard has to be true during the delay else the
transition will be deselected and (iii) the “fastest” enabled object-level replication is used. In the following presenta-
transition wins. tion we focus on the model of the RM. The model is com-
The corresponding SRN patterns represent the timed pleted by simplified models for the infrastructure objects
transitions of the statechart. The types and parameters of and the server replicas.
the timed SRN transitions correspond to the ones of the
transformed statechart transitions. The timing policy (re- 5.1 Model of the replication management
sampling, race with age/enabling memory) is defined by the
designer and must be supported by the SRN-tool used for In our model the RM creates 2 replicas when the server
the analysis. object group is constructed. In case of a host failure the
replica deployed on that host is removed from the group and
4.2.4 Step semantics a new one is created on a different host. We do not focus
on the repair of hosts, thus we assume here that the number
The transitions of the UML statechart fire in steps, i.e. a
of hosts is not limited. In case of an object failure the RM
stable state configuration is reached only if the maximal set
initiates the recovery of the replica. If the recovery was
of enabled transitions has already fired. In contrary, SRN
not successful then the replica is removed. When failure of
reaches a stable state after each firing. The UML semantics
a just-recovered replica is reported, it is removed without
requires the evaluation of the guards of the transitions at the
trying to recover again. The replica is also removed when
beginning of a step, before firing of any transition. Thus
there are no other replicas working. The services of the
the guards refer to the consistent state configuration before
replicated objects are available as long as there is at least
the actual step. In SRNs, the guard of a transition will be
one working replica.
evaluated just before the given transition fires, the evalua-
tion is not scheduled to the beginning of a “step” and the unrecoverable/remove
REMOVING host−fault/remove RECOVERING
“results” are not stored. Accordingly, the last stable state recovered/−
configuration of the state machine must be recorded to en-
obj−fault/remove
sure the correct evaluation of guards. To do that, the places
representing the states of the statechart are duplicated. For
a state there is a place containing a token if and only if
host−fault/remove
WORKING
removed/create
the state was active
is an other place
containing
just before the actual step, and there
a token if and only if the
−/create
CREATING
initialized/− OK
T/−
SUSP.
obj−fault
state will be active after the actual step. In this way the [!B.WORKING]/remove [B.WORKING]/recover
guards and the transitions changing the state refer to differ-
ent places. This concept necessitates a synchronization of
the duplicated places at the end of each step. Figure 3. Statechart model of the redundancy
manager
4.2.5 Composition of subnets
The SRN corresponding to a given UML statechart is com- The statechart model of the RM consists of 2 concurrent,
posed of the subnets introduced in the previous subsections. identical sub-machines (A and B) supervising the 2 replicas.
The subnets are connected with each other using interface Figure 3 depicts the statechart of one sub-machine (A). (We
places. simplified this Figure by not depicting a ”time-out - retry on
The initial state of the SRN is defined as follows. If the different host” mechanism.)
event queue contains events in the initial state then these When starting, the RM sends event create to the fac-
events are represented by the initial marking of the appro- tory of the chosen host. If it has received a message about
priate places of the event queue subnet. The initial state the successful construction (initialized) then it considers the
configuration is mapped to the SRN by inserting tokens into given replica working. In state Working two events will trig-
the corresponding place-pairs. ger transitions. An event host-fault moves the component to
the state Removing sending an event remove to the factory
5 An example of the given host. An object-fault moves the component ei-
ther to the same state (Removing) sending remove or to the
Our illustrative example is a model of the architecture of state Recovering sending recover to the object replica. The
a distributed object-oriented system. The application (e.g. choice depends on the state of the other replica (B). In the
an e-commerce application or a hospital patient monitor- state Recovering the local fault notifier of the host can re-
ing system) cannot tolerate long unavailability of the ser- port the successful recovery of the object, in which case the
vice provided by the system. To achieve this goal active component moves back to a sub-state of its state Working.
The component leaves this sub-state when a timer has ex- accepting events create and remove and sending events
pired (the timer is a separate object). If another object-fault initialized and removed.
occurs before this time, the event remove is sent to the fac- The failure of a host is effectual only if an object
tory of the replica. The local fault notifier of the host may
replica is deployed on it.
report the object being unrecoverable by sending an event
unrecoverable to the RM. In this case remove is sent to the The fault notifier collects information about the state
factory of the replica. In the state Removing the RM waits of the object replica and the host with some delay, and
for an event removed from the factory, and after receiving forwards these events to the RM. Its SRN model was
it, it begins with an event create again. constructed from the UML statechart model by using
The RM is considered to have a FIFO event queue of the transformation.
length 6, capable of accepting 12 different events.
5.2 Measurement results
Factory Host Object
recovery
are 24,406 transitions among them.
[H&L]
S [E]
Transient analysis. The analysis answers the question
L O R what is the probability of having at least one (or two) work-
ing object replicas. In the early phase of the design usu-
C Fault notifier ally timed SRN transitions with exponential distribution are
used in the model and the designer estimates the parameters
[S] [O] [F&H]
of the distributions. This assumption enables an analytical
[S]
solution of the model. Here we assumed the following pa-
K E
rameters:
[R || F]
99
[1] M. Ajmone Marsan. Stochastic Petri nets: An elementary in-
troduction. In G. Rozenberg, editor, Advances in Petri Nets,
98
LNCS 424, pages 1–29. Springer Verlag, 1991.
Probability of done [%]
One selected replica working [2] S. Allmaier and S. Dalibor. Panda - Petri net ANalysis and
97 Design Assistant. In Tools Descriptions, 9th Int. Conf. on
Modeling Techniques and Tools for Computer Performance
96
Evaluation (Tools’97), St. Malo, France, 1997.
[3] A. Bondavalli, M. Dal Cin, D. Latella, and A. Pataricza.
High-level Integrated Design Environment for Dependabil-
95
ity (HIDE). In Proc. Fifth Int. Workshop on Object-Oriented
Both replicas working
Real-Time Dependable Systems (WORDS-99), Monterey,
94
0 100 200 300 400 500 600 700 800 900 1000
California, USA, November 18-20. 1999.
Time [time unit]
[4] A. Bondavalli, I. Majzik, and I. Mura. Automatic depend-
ability analysis for supporting design decisions in UML. In
Figure 5. Probability of having working repli- Proc. HASE’99, Fourth IEEE Int. Symposium on High As-
cas surance Systems Engineering, 1999.
[5] A. Bondavalli, I. Mura, S. Chiaradonna, R. Filippini, S. Poli,
and F. Sandrini. DEEM: A tool for the dependability
modeling and evaluation of multiple phased systems. In
Proc. IEEE Int. Conf. on Dependable Systems and Networks
However, the analysis of the time delay of considering
(DSN), New York, June 26-28, 2000.
an object being suspicious is more tricky. Assuming ex- [6] G. Ciardo, J. Muppala, and K. S. Trivedi. SPNP - stochastic
ponential distribution of all timing activities, the analysis Petri net package. In Proc. IEEE 3rd Int. Workshop on Petri
will show that the availability of the system is not sensi- Nets and Performance Models (PNPM’89), pages 142–151.,
tive to the parameter of this time delay (as the length of the Kyoto, Japan, 1989.
interval and the number of failures in this interval are not [7] G. Huszerl and I. Majzik. Quantitative analysis of depend-
bound). Naturally, this result points out the ambiguity of ability critical systems based on UML statechart models. In
the assumption of exponential timing, and not the inappro- Proc. HASE 2000, Fifth IEEE Int. Symposium on High As-
priateness of the fault handling policy. The analysis can be surance Systems Engineering, 2000.
performed correctly by using SRN models with determinis- [8] Isis Distributed Systems Inc. and Iona Technologies Ltd. Or-
bix+Isis Programmers’s Guide, 1995.
tic timed transitions. [9] D. Latella, I. Majzik, and M. Massink. Automatic verifi-
cation of UML statechart diagrams using the SPIN model-
checker. Formal Aspects of Computing, 11(6):637–664,
6 Conclusion 1999.
[10] S. Maffeis and D. C. Schmidt. Constructing reliable dis-
tributed communication systems with CORBA. IEEE Com-
We showed in this paper that complex, application- munications Magazine, 14(2), 1997.
dependent replication strategies of distributed object- [11] L. E. Moser, P. M. Melliar-Smith, P. Narasimhan, L. Tewks-
oriented systems can be analyzed automatically. The anal- bury, and V. Kalogeraki. The Eternal system: An architec-
ysis can be performed in an early design phase when the ture for enterprise applications. In Proc. Int. Enterprise Dis-
structure of the system and the behavior of the replication tributed Object Computing Conf., pages 214–222, 1999.
[12] J. K. Muppala, G. Ciardo, and K. S. Trivedi. Stochastic re-
manager is defined. On the one hand, the hosts, infrastruc-
ward nets for reliability prediction. Commun. in Reliability,
ture objects and server object replicas can be represented Maintainability and Serviceability, 1(2):9–20, July 1994.
by simplified dependability sub-models (their detailed be- [13] Object Management Group. Fault tolerant CORBA specifi-
havior should not be specified). On the other hand, the de- cation v1.0, ptc/2000-04-04. http://www.omg.org/, 2000.
signer can use the full power of UML statecharts to describe [14] OMG. UML Semantics, version 1.1. Object Management
the core part of the redundancy management, i.e. the be- Group, September 1997.
havior of the RM. The statechart of the RM is transformed
to an SRN dependability model which is completed by the
other sub-models and analyzed by off-the-shelf tools. The
optimal replication management can be selected by mod-
eling alternative behaviors of the RM, executing the auto-
matic model transformation and the subsequent dependabil-
ity analysis.