Full Text 01
Full Text 01
Stefan Wallin Alarm and Service Monitoring ofLarge-Scale Multi-Service Mobile Networks
Division of Computer Science and Electrical Engineering
Stefan Wallin
Stefan Wallin
ICT
Dept. of Computer Science and Electrical Engineering
Luleå University of Technology
Luleå, Sweden
Supervisor:
Associate Professor Christer Åhlund and Dr. Evgeny Osipov
Tryck: Universitetstryckeriet, Luleå
ISSN: 1402-1757
ISBN 978-91-86233-34-1
Luleå
www.ltu.se
And here he remained in such terror as none but
he can know, trembling in every limb, and the cold sweat starting from every pore, when
suddenly there rose upon the night-wind the noise of distant shouting, and the roar of
voices mingled in alarm and wonder. Any sound of men in that lonely place, even though
it conveyed a real cause of alarm, was something to him. He regained his strength and en-
ergy at the prospect of personal danger; and springing to his feet, rushed into the open air.
This thesis proposes solutions for alarm and service monitoring that address monitoring
of large scale multi-service mobile networks.
The work on alarms is based on statistical analysis of data collected from a real-world
alarm flow and an associated trouble ticket database containing the network administra-
tors’ expert knowledge. Using data from the trouble ticketing system as a reference, we
examine the relationship between the original alarm severity and the human perception
of the alarm priority. Using this knowledge, we suggest a neural network-based approach
for alarm prioritization. Tests using live data show that our prototype assigns the same
severity as a human expert in 50% of all cases, compared to 17% for a naı̈ve approach.
In order to model and monitor the services, this thesis proposes a novel domain-
specific language called SALmon, which permits efficient representation of service mod-
els, along with a computational engine for evaluation of service models. We show that
the proposed system is a good match for real-world scenarios with special requirements
around service modeling.
v
vi
Contents
Chapter 1 – Thesis Introduction 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Research Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Chapter 2 – Towards Better Network Management Solutions 7
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 The Chaotic Alarms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Service Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Chapter 3 – Summary of Publications 15
3.1 Overview of Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Paper A – Rethinking Network Management Solutions 25
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3 Ways to Improve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Paper B – Telecom Network and Service Management: an Operator Survey 37
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3 Current Status of Network Management . . . . . . . . . . . . . . . . . . 41
4 OSS Motivation and Drivers . . . . . . . . . . . . . . . . . . . . . . . . . 41
5 The Future of OSS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6 Standards and Research Efforts . . . . . . . . . . . . . . . . . . . . . . . 45
7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Paper C – Multipurpose Models for QoS Monitoring 51
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3 Reference Architecture for a QoS Monitoring Solution . . . . . . . . . . . 57
4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . 61
vii
viii
This thesis is based on close to 20 years of industrial experience working with mobile
operators in providing network and service management solutions. Many of these solu-
tions have been unnecessarily complex and costly and have not supported service and
customer-oriented functions to the desired degree. While frustration is not always the
best starting point, I thought it was time to both summarise my experience and to
consider some necessary changes in the field.
In order to focus on relevant subjects, I have worked with colleagues, partners and
customers – service and equipment providers – to understand the challenges and changes
that are needed. As such, this is the correct place to thank all the various friends who have
helped me gather vital information on the subject. Also, I wish to thank the operator
that provided me with a large database of alarms and trouble tickets.
Thanks to Christer Åhlund, my supervisor, who lets me work in a chaotic and flexible
mode of operation. Also, I wish to thank the whole team at LTU in Skellefteå for a great
working atmosphere. I do much of my work with Viktor Leijon, whose qualifications as
a devil’s advocate and at the same time supporting research colleague have pushed the
results to a higher level.
I’d like to thank my partners at Data Ductus, Urban Lundmark, Stefan Lundström
and Lennart Wiklund for letting me spend time on this project.
David Partain has provided both substantial insights and language reviews; I owe
him several pints. Thanks to Gary Webster as well for help with language reviews.
My research is based on struggling with network management challenges in the indus-
try. There would have been no challenges if Ingemar Häggström had not kept throwing
assignments on my desk from his different positions at Ericsson, Hewlett-Packard and
Ulticom.
Finally, and most important of all, my complete gratitude and love to my family:
Inger, Axel and Ellen, my everything.
ix
x
Part I
xii
Chapter 1
Thesis Introduction
1.1 Introduction
Network service providers have rapidly transitioned from providing few services over
fixed lines with limited competition to multi-faceted Internet-based services with intense
competition. Not only do operators provide different access networks such as xDSL, optic
fiber, and mobile broadband, they are also struggling with new consumer services like
IP telephony and mobile TV. Traditional network operators are extending their business
in several countries with local competition, at the same time as new players like the
utility market are entering the service provider arena. In Paper A, we look at how the
business model for operators has changed drastically while the corresponding business
and network support systems have troubles dealing with the new needs.
Based on our industrial experience we asked major telecom service providers to tell us
about their network management problems and expectations for the future. According
to this survey, presented in Paper B, the following requirements will drive the solutions
in the coming years: Service management and quality, Customer satisfaction, and Cost
reduction. One operator summarized the current status of service management as follow-
ings:
Services are not currently managed well in any suite of applications and re-
quire a tremendous amount of work to maintain.
One of the main themes of this thesis is service quality. This is obviously also a major
contributor to achieving customer satisfaction.
We also asked the operators to identify the most important research areas. Alarm
correlation, self-healing, and auto-configuration were identified as the most urgent re-
search topics. Out of these we picked alarm correlation but with a focus on filtering and
prioritization rather than traditional correlation. This means that we see other ways of
reducing the number of alarms and at the same time providing better alarm quality.
In many cases, alarm messages go unaltered, unfiltered, and uncorrelated from the
network elements to the network management system. This leads to a chaotic situation
1
2 Thesis Introduction
which needs to be improved before we can move into service and customer management.
The current alarm situation was expressed in the following way by one of the operators:
If we could find ways to deliver only the relevant alarms, this will result in cost
reductions and better use of expert network administrators.
Aside from these prioritized items, the operators mentioned interface integration,
service quality and service modeling as relevant research areas. Interface integration will
be part of our future work, while service quality and modeling is our current focus.
• Detection tries to find irregularities in the data and seeks to explain the cause. A
representative telecom application is to detect churn and fraud.
Standards Research
TM Forum DMTF
SID merge CIM SLAng
Extend
w3.org IETF
SML PCIM
Service modeling Policies
trouble ticket systems. They are in general less powerful with respect to service modeling
and depend on an external service model database in order to create the service instances.
We focus on a small language, dedicated to service modeling and capable of handling
both the static and dynamic aspects of service models.
data and not services and service parameters. However in the long run we expect to see
YANG attributes as inputs to service models.
CIM, SML, and similar efforts focus on static class diagrams to express the service
model. They are not intended for monitoring, but rather to help with the construction
and composition of services, and tend to lack formally defined semantics. Our solution
adds the capability to express the functional QoS calculations and time-series of QoS
data. While there has been some effort put into graphical modeling tools, our experience
is that it is easy to expect too much from graphical modeling tools for service modeling.
This can to some degree be compared to the partial failure of graphical CASE tools
compared to standard programming methods [23].
1. The interaction and mapping between the abstract service model and the corre-
sponding service implementation.
To some degree we are addressing the two first mapping problems by parameter calcula-
tions and explicit support for lower-level QoS inputs. We also support service chaining
by references between service objects.
While Garschhammer establishes requirements for a generic service model, Gopal [25]
identifies requirements for a service modeling language such as aggregation of components
and calculations, and list-valued attributes, which we can express in SALmon. Marilly
et al. [26] identify a set of main challenges in order for SLA management to appear. We
have addressed three out of four challenges, namely: information model, scalability, and
end-to-end view. Räisänen elaborates service management and mobile networks in [27].
Special attention is given to the resource allocation mechanisms in order to provide the
required quality of service. The author also points to the need of an integrated aggregated
customer-focused SLA across services.
SLAng [28] is a language focused on defining formal SLAs in the context of server
applications such as web services. It uses an XML formalism for the SLAs. SLAng
identifies fundamental requirements needed in order to capture SLAs but differs from
our current effort in that it “focuses primarily on SLAs, not service models in general”.
SLAng is being further developed by UCL.
Chapter 2
Towards Better Network
Management Solutions
2.1 Introduction
In this section we summarize our work on alarm and service management. At this point
we want to emphasize the link between the two areas. A service model needs inputs about
the status of the service from various sources. One of the sources is alarms. However, in
order to use alarms to indicate the service state, we need to get the right alarms and the
right severities for these alarms, rather than the current unreliable alarm quality. This is
why the alarm improvements presented in Section 2.2 are a stepping stone towards using
alarms as input to service models as outlined in Section 2.3.
The section on alarms, Section 2.2, is based on two articles [29, 30] of which the latter
is included in this thesis as Paper D. Section 2.3 on service modeling is based on three
articles[31, 2, 32] of which two are included in this thesis as Papers C and E
7
8 Towards Better Network Management Solutions
sheer number of alarms to be managed results in a huge cost for network service providers.
The problem has been around for decades without much improvement. Over the years
we have seen standards and industry organizations chasing the silver bullet protocol [34]
without much attention to defining the meaning of management information.
After considering various attempts to define what an alarm is, we have adopted the
following:
At the core of this definition is that every alarm needs attention and manual investi-
gation and possible actions. This interpretation is close to the origins of the term “alarm”
which stems from from Old French, a l’arme; “to weapons”, telling armed men to pick
up their weapons and get ready for action.
Level 1, Syntax and grammar: the protocol and information modeling language to
define the alarm interface.
Level 3, Pragmatics: what is the meaning and effect of the alarm when using contex-
tual information?
In the rest of this section we will use the above taxonomy to describe our findings
and recommendations.
2.2. The Chaotic Alarms 9
Customer Provider
We compared the neural network to competing approaches and found that our proto-
type assigns the same severity as a human expert in 50% of all cases, compared to 17%
for the severity supplied in the original alarm.
This solution has several benefits:
• Priorities are available immediately as the alarms arrive.
All of these different perspectives have led us to use a general definition of Quality of
Service:
This lets us cover all the aspects mentioned above. With that as a base we designed
a domain-specific service modeling language and monitoring engine called SALmon.
system.status@(NOW-1h)
The expression can be used both as right-hand value and left-hand value, the latter
to change a value retrospectively. Intervals of a time-variable can also be retrieved by
specifying a time range as in the following example:
SALmon can be used to model the different aspects of QoS, as expressed in Figure 2.3.
In the example in Listing 2.1, we present two classes that represent the actual delivered
quality of service for a VOIP service and the associated network access. The inputs,
line 2-3, are collected from voice probes and decoders. Note that these kinds of data
are typically based on time-series of data and therefore the time-index of a variable
like rFactor can be used in expressions. R-Factor [39] is derived from metrics such
as latency, jitter, and packet loss and assesses the quality-of-experience for VoIP calls.
Typical scores range from 50 (bad) to 90 (excellent). The anchor on line 4 is a reference
from the VOIP service access point to the associated network access point. Line 5-8
aggregates inputs from the network instance. The network service access point in line 10
follows the same pattern. It also defines some properties which are static attributes
2.3. Service Modeling 13
10 c l a s s NetworkSAP
p r o p e r t i e s name , type , DS Class ;
12 input i n l o s s , i n j i t t e r , in delay , in bandwidth ;
loss = in loss ;
14 jitter = in jitter ;
delay = in delay ;
16 bandwidth = i n b a n d w i d t h ;
6 r F a c t o r S t a t u s = l i n e a r T h r e s h o l d voipSAP . r F a c t o r 0 minRFactor ;
assigned at instantiation. In this example they indicate the type of network and the
DiffServ class.
The corresponding required quality of service is modeled in Listing 2.2. In this case,
properties represent the service level requirements, for example the required R-Factor.
From a modeling point of view, a service level is associated with the corresponding service
instance as shown in line 4. The rFactorStatus calculation in line 6 shows an example
of a QoS calculation in line with the Bloemer definition of quality of service in that it
evaluates the difference between delivered and required QoS, (as a percentage).
We have implemented the SALmon language runtime and interpreter using the JavaTM
J2SE Framework [40], the ANTLR parser generator [41] and the Berkeley DB [42].
SALmon is different from many of the upcoming products targeting the problem do-
main. We believe that a small and efficient domain-specific language will be successful.
With a dedicated language you can precisely express what you need and update your
models with short turn-around. While graphical approaches may seem attractive at the
outset, they often face obstacles when faced with the full complexity of the operator’s
reality. We have verified SALmon against scenarios from our industrial experience [32],
14 Towards Better Network Management Solutions
and we see that the basic requirements for a service modeling language are expressed in
a straightforward way. Our approach also eases the transition from current monitoring
solutions where most of the integration is performed by using modern, regular, languages
like Python. A SALmon-based approach will attract skilled integrators and give them a
tool where they can rapidly change and develop models.
2.4 Contributions
The contributions of this thesis include:
• A survey that covers 15 major telecom operators world-wide, including more than
100 million customers/subscribers. This survey forms a basis for deciding which
research topics warrant the most attention.
• We examine the basic alarm characteristics using real alarms from a large mobile
operator. The analysis identifies which alarm types are candidates for filtering and
automation which would result in substantial savings for managing the network.
• Further, this thesis presents a neural network solution that automatically assigns
priorities to alarms in real time based on the knowledge of expert network adminis-
trators. The training data is automatically gathered using data-mining techniques
from the trouble-ticket database.
• Service management will be a strategic challenge for network operators in the com-
ing years. We have designed and implemented a prototype of an efficient domain-
specific language dedicated to service modeling and monitoring. We have shown
that the capabilities of the language cover relevant functional requirements and
large scale models.
This paper is the starting point for our research and this thesis. It summarizes the
problems we have observed, challenges and prioritized areas for improvement, and it is
primarily based on our industrial experience. The basic ideas for service modeling and
knowledge management technologies to prioritize alarms were initialized in this paper. I
wrote most parts of the paper.
15
16 Summary of Publications
Problem
My industrial background Structured survey
Paper A Paper B
Rethinking Network Telecom Network and Service
Management Solutions Management: an Operator Survey
Alarms
Solutions Knowledge management
Service monitoring
Prioritization of Telecom
and modeling Alarms using Neural
Networks
Paper C
Multipurpose Models for
QoS Monitoring
SALmon Paper D
description Statistical Analysis and
Prioritization of Telecom
Paper E Alarms using Neural Networks
SALmon - A Service Modeling
Language and Monitoring
Engine
Service broker
SALmon application
This paper describes the background for a service modeling and monitoring language.
The first sketches of the language are outlined at the end. In order to generalize the QoS
3.1. Overview of Publications 17
Year 2008
Telecom Service Providers are faced with an overwhelming flow of alarms. Network ad-
ministrators need to judge which alarms to resolve in order to maintain service quality.
We have prototyped a solution that uses neural networks to assign alarm priority. The
neural network learns from network administrators by using the manually assigned pri-
orities in trouble-tickets. Real alarms and trouble-tickets from a large operator is used.
I created the idea and wrote most of the paper. Landén implemented the solution as
his BSc thesis work.
Year 2009
This is an extended journal version of the above paper including an additional statistical
evaluation. I worked on a more extensive description of the alarm and data-mining
handling process. Leijon added the statistical evaluation.
Year 2008
This paper describes our service modeling and monitoring language after Ehnmark im-
plemented the first version as his MSc Thesis work.
18 Summary of Publications
Year 2008
The research group “Mobile Systems” at Luleå University in Skellefteå led by Christer
Åhlund works on methods for access network selection. This paper shows how SALmon
could be used in that context as the basis of a Policy Based architecture including service
and SLA monitoring. I wrote all sections related to SALmon and worked with Åhlund on
the overview sections. The final sections on mobility management is the work of Åhlund,
Andersson and Brännstrom.
From my research point of view, this paper picks up on an idea from Paper A in
regards to a service broker. It also illustrates how SALmon can be used in an overall
solution.
References
[2] C. Åhlund, S. Wallin, K. Andersson, and R. Brännström, “A service level model and
Internet mobility monitor,” Telecommunication Systems, vol. 37, no. 1, pp. 49–70,
2008.
[4] R. D. Gardner and D. A. Harle, “Methods and systems for alarm correlation,” in
Global Telecommunications Conference (GLOBECOM’96), vol. 1, 1996.
[8] I. Bose and R. K. Mahapatra, “Business data mining, a machine learning perspec-
tive,” Information & Management, vol. 39, no. 3, pp. 211–225, 2001.
[10] R. D. Gardner and D. A. Harle, “Alarm correlation and network fault resolution
using the Kohonen self organising map,” in Global Telecommunications Conference
(GLOBECOM’97), vol. 3, 1997.
19
20 References
[13] D. Levy and R. Chillarege, “Early Warning of Failures through Alarm Analysis-A
Case Study in Telecom Voice Mail Systems,” in Proceedings of the 14th International
Symposium on Software Reliability Engineering, p. 271, IEEE Computer Society
Washington, DC, USA, 2003.
[15] H. Wietgrefe, K.-D. Tuchs, K. Jobmann, G. Carls, P. Fröhlich, W. Nejdl, and S. Ste-
infeld, “Using neural networks for alarm correlation in cellular phone networks,” in
International Workshop on Applications of Neural Networks to Telecommunications
(IWANNT), 1997.
[18] Distributed Management Task Force, “CIM Specification.” Version 2.15.0, 2007.
[25] R. Gopal, “Unifying network configuration and service assurance with a service mod-
eling language,” Network Operations and Management Symposium, 2002. NOMS
2002. 2002 IEEE/IFIP, pp. 711–725, 2002.
References 21
[42] M. Olson, K. Bostic, and M. Seltzer, “Berkeley DB,” in Proceedings of the FREENIX
Track: 1999 USENIX Annual Technical Conference, pp. 183–192, 1999.
[44] S. Wallin and V. Leijon, “Telecom Network and Service Management : an Operator
Survey,” in 12th IFIP/IEEE International Conference on Management of Multime-
dia and Mobile Networks and Services, 2009. submitted.
Part II
24
Paper A
Rethinking Network Management
Solutions
Authors:
Stefan Wallin and Viktor Leijon
c 2006, IEEE Computer Society
25
26 Rethinking Network Management Solutions
Rethinking Network Management Solutions
Abstract
This paper looks at network management from an overall perspective. We try to explain
what the current problems at big telecom operators are, and how things are changing
in the operators environments. Finally we present the major steps needed to be taken
as seen by the authors. The problem statement has been worked out with input from
people in charge of large telecom network management centers. In order to cope with
current problems and improve the quality and effectiveness the major steps forward are:
(1) Service centric management, (2) Dynamic management, (3) Knowledge management,
(4) Automation and correlation and finally (5)Managing network management interfaces.
The paper does not elaborate in detail in any of the items; rather it presents an
outline of where to go.
1 Introduction
As with any technology, it’s important to focus management solutions on the users,
even when the users are those providing a service. In that broader context, network
management has three types of users: network operators, which must earn money on their
services, network service users (business and consumer), who pay for using services, and
network administrators, who staff the network operations center. All three user types
benefit from a well-thought-out management solution: operators increase their profits,
service users get better service, and administrators streamline their workload.
In short, the right network management solutions empower network operators to
provide new services, maintain service quality, and manage billing and usage [1].
By its nature, network management is a hierarchical, centralized function that puts
the operator in control; therefore it makes sense to provide a centralized network man-
agement solution. Operators are under pressure to reduce network operating costs and
provide new services at an increasing speed. These two requirements highlight the need
for an effective, automated, network-management solution.
To explore such a solution, we interviewed people in charge of large telecom network
management centers and identified six challenges facing big telecom operators:
Constant changes: New or upgraded devices and new services launch frequently.
27
28 Rethinking Network Management Solutions
Complex services structure: Services are vital for business and customer interaction,
but they are not really managed.
Cuts in operations costs: A small team must run a large, multifaceted network.
Difficult interface integration: Diverse equipment and support systems make man-
aging interface integration a challenge.
We then considered strategies for tackling each of these challenges and determined
several best practices.
2 Challenges
2.1 Excessive Alarms
The bulk of network administrators’ daily work involves alarms. Unfortunately, the large
number of alarms indicates that the systems produce many irrelevant and noncorrelated
alarms, making it hard to understand the true state of problems in the network.
Today’s alarms are more or less raw alarms from the different equipment and vendor-
specific management systems. Operators must establish an efficient organization to han-
dle the alarms, a process that typically follows three steps:
1. The first-line organization performs three tasks: check for alarms that indicate the
same problem, group the alarms and attach them to a trouble ticket, and distribute
problem information to affected parties, such as SLA customers and customer care.
2. If it’s a simple problem, the first line resolves it and closes the ticket.
3. If it’s a complex problem, the first line dispatches it to the second- and third-line
organizations. This might involve equipment vendors or operator staff in the field
who might perform onsite management, card replacement, and so on.
The ever-increasing number of systems and services increases the number of alarms.
Still, operators can’t afford to employ additional people to handle the alarm lists, and
automatic solutions are limited.
Automatic trouble ticketing, for example, manages the workflow from problem identi-
fication to problem solution, but its usefulness doesn’t extend to prioritizing the alarm’s
importance. Such knowledge is critical because an alarm’s context determines if it af-
fects services, customer SLAs, and the affected equipment’s state. The resource emitting
the alarm typically doesn’t know the context, so the network-management system must
supply it through alarm filtering and correlation.
Alarm-correlation projects are complex and not particularly successful. First, alarm
quality is insufficient. The information carried in the alarm messages is not good enough
Rethinking Network Management Solutions 29
Networks change. Network elements are upgraded, new services launch, and customers
come and go.These daily changes are a challenge for operators and network management
solutions.
Few operators have a fully controlled or automated process for handling these changes.
Moreover, the network organizations are introducing critical equipment into the network
without informing the network administrators. Surprises occur in the monitoring activi-
ties when unknown alarms and equipment suddenly appear. SLAs and business-critical
services are sold to enterprise customers but without corresponding support in the man-
agement solution to actually monitor the specific SLA or customer.
The dynamic nature of networks and services puts increasing focus on change man-
agement. The expected time for changes has dropped from months to hours. We see
operators and organizations realizing this and trying to reuse the change management
process from the Information Technology Infrastructure Library framework [3], a set of
best practices drawn from public and private sectors worldwide. Change management’s
goal is to ensure that standardized methods and procedures are used to efficiently and
promptly handle all changes to minimize its impact on service quality, consequently im-
proving the organization’s daily operations.
30 Rethinking Network Management Solutions
Topology management: network topology, service topology, and the mapping between
these.
Service management: formal but dynamic management of services, SLAs, and cus-
tomers, across all processes and systems.
Service centric integration and modeling: use of service types and instances as keys
in information systems, customer care systems, fault management systems, and so
on.
3 Ways to Improve
Given these problems and changes in the environment that will affect network manage-
ment in the future, we believe the next generation of network management solutions must
be based on principles different from the current solutions.
Customer Site
Medium
System B System C
System A
Low Medium
Event 1 Event 2
Low Medium
Language static class diagrams. We do not see UML classes as strong enough in express-
ing such items as semantics and interfaces. Also, CIM’s model complexity and size has
exploded. Modeling every aspect as classes yields a huge model that will not cope with
the changing and dynamic nature of the future services and networks.
The IETF made an attempt to take SNMP one step forward with the SMIng data
definition language, [5], which makes SNMP MIBs capable of holding objects, structured
data types, and so on. SMIng has a pragmatic approach and would probably make a
significant difference in the short term, although it has not created any footprints in the
industry as yet. Attempts from the IETF’s Network Management Research Group have
the big advantage of being down to earth and well engineered. In the long run, however,
something more powerful is needed.
From a solution point of view, service-centric network components are emergingÑfor
instance, Cramer or Granite. These product examples are signs that the field is mov-
ing in the right direction. Topology must be a core component, however, and the current
solutions and tools do not handle the dynamic nature of the topology changes. Typical
implementations use an export, clean, merge, and load process to create an overall topol-
ogy database. To be fair, the fault is not with the topology tools themselves but with
the poor equipment interfaces [6].
Users will pay the broker, who will pay the operator. This business model will put even
more emphasis on how service providers express service capabilities and features.
• Find ways that will let operators integrate equipment more smoothly into their
overall management solution.
• Use dynamic approaches in interface technologies. Minimize the need for external
data.
• Filter and correlate alarms before sending them. Send problem-oriented alarm
states pinpointing the affected service rather then low-level symptoms.
We also see a strong need for improvements in modeling formalisms to express service
models and more dynamic semantic interface definitions. An even more important issue
is the quality of the models themselves, irrespective of the modeling formalism.
In many ways, network management problems have changed little since 1988, when
SNMP was introduced. There is still no sense of how to model management information
and no greater insight into which information is truly valuable to a management appli-
cation. Progress requires investigating fundamental modeling questions: What charac-
terizes a good model? Given a bunch of such models, what are the common structures,
design patterns, ways of thinking, aggregation models, and so on? And given common
denominators of good models, the problem becomes how to construct tools that let de-
velopers build such models easily. Is it even possible to develop a structured theory of
network management that truly starts small and builds on real-world knowledge?
Telecom network management solutions need to shift perspectives from one of net-
work element management to service management. Operators need a service view of their
network, with automatic service-impact correlation. This requires some major changes
in the underlying solutions: equipment vendors must improve the supplied management
interfaces and network management solutions must implement a higher degree of au-
tomation and cor- relation with a service focus. One obstacle is the lack of models and
Telecom Network and Service Management: an Operator Survey 35
formalisms to describe topology and service structures. We’re currently working to define
a formal service modeling approach to enable the service layer.
References
[1] TeleManagement Forum, “Business process framework (eTOM).” GB921, version 7.3,
July 2008.
[3] itSMF, “Service Operation.” Office of Government Commerce, ITIL Version 3 Pub-
lications, 2007.
[4] Distributed Management Task Force, “CIM Specification.” Version 2.15.0, 2007.
[6] R. State, O. Festor, and E. Nataf, “Managing Highly Dynamic Services Using Ex-
tended Temporal Network Information Models,” Journal of Network and Systems
Management, vol. 10, no. 2, pp. 195–209, 2002.
Authors:
Stefan Wallin and Viktor Leijon
37
38 Telecom Network and Service Management: an Operator Survey
Telecom Network and Service Management: an
Operator Survey
Abstract
1 Introduction
Network management research covers a wide range of different topics, and it is hard for
the individual researchers to prioritize between them. One factor to take into account is
the requirements emanating from the telecom industry.
In order to get an objective view of what the industry considers important we have
surveyed fifteen different companies, to gather their opinions on the current state of
network management systems, as well as their expectations on the future.
As far as we can tell, there have been no previous surveys of this type for telecom-
munications network management. The process control and power industry areas seems
to have a higher degree of industry feedback to the research [1, 2, 3], probably because
of the human safety risks involved.
The results of this survey has strategic value both for researchers and solution vendors.
It identifies areas where there is a strong need for further research and point to what
changes are needed in order to stay competitive.
The contributions of this paper are:
• We present survey results from fifteen different companies, with a total of over
100 million customers, covering the current state (Section 3) and most important
change drivers (Section 4).
• The respondents were then asked about their view on the future of OSS systems
(Section 5) and what they expected from the OSS research community (Section 7.1).
• We conclude with a discussion of the focus areas identified in the survey: service
topology and alarm quality in Section 7.2.
39
40 Telecom Network and Service Management: an Operator Survey
2 Method
We distributed the survey questions by e-mail to 20 operators of different size and on
different continents. The individuals were selected based on their roles as network man-
agement architects or managers. 15 out of 20 operators answered. The respondents are a
mix of fixed, broadband and mobile operators with a total customer base of over 100 mil-
lion subscribers, see Table 2.1. The operators were classified by number subscribers into
the categories [< 10 M, < 20 M, < 100 M] to avoid giving out identifying information.
Some clarifying questions were sent over e-mail and a draft version of this paper was
sent to all operators that provided answers. All questions except one were essay questions
to avoid limiting the answers by our pre-conceived ideas. We have aggregated similar
answers into groups, often using eTOM processes as targets. The number of answers for
each alternative will not add up to the exact number of responding operators since the
answers typically mentioned several different answers.
Increased focus on services and service quality was identified as the most important
factor behind changes in the OSS. In order to understand this subject better we asked
the operators to elaborate on how they viewed service management, one of the operators
summarized it in the following way:
Services are not currently managed well in any suite of applications and re-
quires a tremendous amount of work to maintain.
The competitive market is pushing operators to offer more and more innovative ser-
vices, including SLAs, which require the OSS solution to measure the service quality.
One operator described the experience with the two major alternatives to service
monitoring, either using probes or mapping existing events and alarms to a service model.
The latter approach failed since there was no good way to describe which alarms were
really critical. They made the decision to use only probing for future services, stressing
that future services will have service probing from the service deployment phase.
Some of the operators stressed the importance of standards for service models. The
problem with models is that services are rapidly changing, therefore requiring a large
amount of customization work. One operator expressed reservations about how detailed
services can be:
Time and money will not be available to [develop] sophisticated approaches
over a long period. Customers will have to accept limited quality assurance
and quality documentation. Service levels will always be high level, if [they
exist] at all.
Another operator commented on how the use of service models is evolving:
Service models are becoming more and more important: currently [they are]
not implemented in core processes but used as means to semi-document and
analyze when evaluating impact of faults, new services, etc.
As indicated by Figure 1 cost reduction is clearly another key factor. We asked the
operators to further break down the cost drivers and the results are shown in Figure 2.
The first two items can be considered two sides of the same coin: Integration costs
are high in OSS due to the introduction of new technologies and services. When a new
type of network element is deployed it needs to be integrated into the OSS solution and
while most solutions are based on well established products, there is still a high degree
of customization needed to adopt the tools to user needs and processes.
In order to get a unified view of the solution, the resource oriented interfaces needs to
be mapped into an overall service model. Operators are struggling with this challenge,
the “information model development” in Figure 2. Finally the OSS itself is expensive
due to the number of components that are needed. Even in the case where an OSS is
built using only one vendor, it is still made up of a portfolio of modules which add up to
a relatively costly software solution.
Returning to the change drivers (Figure 1) the next items are network growth and
increased focus on customer satisfaction. While network growth is inherent in network
Telecom Network and Service Management: an Operator Survey 43
management, customer satisfaction has not historically been one a primary goals for OSS
solutions.
To put the true business role of the OSS solution in focus, we asked if it was seen as
a competitive tool or not. The answers were divided into two general streams:
• Yes. Motivations are the desire to decrease time-to-repair, decrease OPEX, and im-
prove customer satisfaction by quicker response to new requirements and customer
complaints. Two thirds of the responses fall into this category.
• No. These operators felt that it will be outsourced. The out-sourcing scenario was
partly motivated by internal failures and a desire to give the problem to someone
else.
Managing services must be the focus of the future development, while pushing
network management into a supporting role, [...] service models [should be]
44 Telecom Network and Service Management: an Operator Survey
6.1 Standards
The attitude towards standards was not very enthusiastic: “They are too complicated
and are actually adding to the cost of ownership”, in this case the 3GPP Alarm Interface
was the main source for concern. Another operator had similar distrust for standards:
“[In] alarm integration to the OSS, most of the vendors do not follow any one. We are
pushing [our] internal standard to have useful alarm information for the end users”. Some
operators mentioned SNMP as a working protocol that is easy to integrate, however the
lack of standard OSS interface MIBs is a problem, and the vendor MIBs vary in quality.
As important areas for future standardization efforts they mentioned “interfaces, data
and semantic models, standardization of procedures” and “well defined top level com-
mon framework and common languages”. We see from these comments that the current
practice of using different protocols for different interfaces and having weak formal data
models is a problem for OSS integrations. There is no accepted overall common frame-
work which would enable unified naming of resources.
Surprisingly, none of the operators mentioned OSS/J [4]. On the other hand several
operators considered the eTOM and ITIL [5] process standards to have real practical
value. They used these process frameworks to structure the work and make it more
efficient.
46 Telecom Network and Service Management: an Operator Survey
7 Discussion
7.1 OSS research
Historically a lot of research has gone into alarm correlation. Looking at the current
correlation rates it is not clear how successful these projects have been, many correlation
projects are facing challenges trying to capture operator knowledge and transforming it
into rules. Other methods for finding rules based on rule discovery, knowledge manage-
ment, data-mining and self-learning networks are interesting, but seem to require further
investigation to be of practical use.
Having service models in place would be a key to unlocking many other functions.
The current industry practice of large UML models with loose structure does not seem
Telecom Network and Service Management: an Operator Survey 47
able to cope with the requirements, more modular and semantically rich ways of doing
service modeling are required.
Another basic area is that of formal interface definitions. The current integration costs
can not be justified when we consider the fundamentally simple nature of the information
that flows over the interfaces, integrating alarms should not be a complex and costly task.
Semantically richer interface definitions would probably improve the situation, but this
requires a focus on the semantics of the model and not only the syntax, protocol and
software architecture.
Boutaba and Xiao [6] point to the following major enablers for future OSS systems:
• Policy-based network management, [7], [8]
• Distributed computing, [9], [10]
• Mobile agents, [11], [12]
• Web techniques, [13]
• Java, [14]
These are topics that we see in many network management efforts. However we see
little correlation between these and those identified in this survey. Furthermore it partly
illustrates the research’s focus on software architectures rather then management.
operators to maintain a centralized view of the topology which will be the future OSS
foundation. Note that this improvement has to come without significant additional inte-
gration costs or complexity.
Moving forward in the direction identified in this survey will fail if there is no way to
share the information model between OSS systems. The feedback we get from operators
who apply SID [15] is that it works well as a design base for the OSS system but not
as a service model to maintain the dynamic business services. Service modeling must
be dynamic and have well-defined semantics, the current practice of static models and
informal documents does not cope with a changing environment.
While OSS solutions have primarily been network oriented it now needs to change
focus to customer care, since operators see a huge possibility to reduce costs in an au-
tomated customer care. The number of employees in mobile network customer care
greatly outnumbers the OSS staff, therefore, solutions that help automate customer care
activities will be a priority in the coming years.
Operators will look for automated provisioning solutions including self-configuration
of the network elements which helps avoid tedious parameter setting. The move from
fixed-line services to broadband consumer services stresses this further since customers
need to be able to buy, configure and troubleshoot their services with minimal support.
While we see these changes coming we need to realize that the integration of Telecom,
IT and IP has not yet happened. Some of the operators have a somewhat naı̈ve vision
of a future when network administrators will only look at service views and SLA status.
None of our respondents reported positive results regarding the deployment of stan-
dards. Over the years we have seen great efforts to move from one protocol to another,
from OSI-based solutions to CORBA and now Web Services. This journey is based on
a desire to find a technology solution to an information problem, unfortunately OSS
integration standards like OSS/J has not yet proven its cost effectiveness.
Alarm quality and alarm correlation is still an underdeveloped area, although research
and industry initiatives go back decades [16] the current alarm flow at a standard network
operations centre is fairly basic and often of low quality.
We did not get any real numbers on filtering and correlation rates, but the informal
indications pointed to very low success rate which is consistent with what is reported
by Stanton [17]. In many cases alarm messages go untransformed, unfiltered, and un-
correlated from the network elements to the network management system which leads
to a chaotic situation that needs to be cleaned up before we can move into service and
customer management. A representative answer was:
Some operators chose to completely ignore alarm correlation, they considered it too
expensive and complex to get good results. These respondents instead pointed to probing,
Telecom Network and Service Management: an Operator Survey 49
statistics, and performance based solutions to get an overall picture rather then trying
to automate root-cause analysis. It was also stressed that advanced alarm correlation
projects are in many cases signs of bad alarm quality from the low-level systems.
Finally, we let an operator conclude this survey by pointing to ongoing challenging
OSS improvements:
8 Conclusion
We hope that this survey can form the basis for prioritizing among research topics.
The most important conclusion is probably that there is a great potential to further
network management research by working closer with service providers. There is a gap
between the current research efforts which typically focus on new software architectures
and protocols and the telecom companies that has other priorities.
It is worth noting that after decades of research, alarm correlation is still the most
prioritized research area. This can partially be interpreted as a failure, since no solution
seems to be ready. Instead we see a new set of research challenges emerging, connected
to self-healing, service activation and provisioning.
If research is to support the future focus areas for service providers we need to find
solutions for service and quality management. Another observation is the failure of alarms
as an indicator of service status, where we see a trend towards probe based solutions.
The operators gave a clear message on their desire to move from network and resource
management towards customer and service management solutions. This comes as no
surprise, as the trend has been clear for some time, but the path there needs attention.
A new brand of OSS Solutions that are based on the service life-cycle rather then separate
OSS components for different processes is needed.
References
[1] M. Bransby and J. Jenkinson, “The Management of Alarm Systems,” HSE Contract
Research Report, vol. 166, 1998.
[2] N. Stanton, Human Factors in Alarm Design. Taylor & Francis, 1994.
Authors:
Stefan Wallin and Viktor Leijon
c 2007, IEEE
51
52 Multipurpose Models for QoS Monitoring
Multipurpose Models for QoS Monitoring
Abstract
Telecom operators face an increasing need for service quality management to cope with
competition and complex service portfolios in the mobile sector. Improvements in this
area can lead to significant market benefits for operators in highly competitive markets.
We propose an architecture for a service monitoring tool, including a time aware formal
language for model specification. Using these models allows for increased predictability
and flexibility in a constantly changing environment.
1 Introduction
Service operators face many new challenges in network management [1]. Among the
most important trends is the increasing move towards service centric management. It is
necessary for an operator to deliver predictable Quality of Service.
There are currently few useful solutions for dealing with QoS for a large number of
services types, service instances and users.
Network QoS
Service Delivery Organisation
Customer facing
Service QoS
processes QoS Service Model
Network QoS
Figure 1 illustrates the overall QoS problem. To get a good general measure of QoS,
multiple parameters need to be taken into account. It is fairly easy to have a state for a
single service at a given time for one customer. But a large telecom operator needs to have
an overall picture with support for different views. Different service views can include
53
54 Multipurpose Models for QoS Monitoring
geographical, customer-based, SLA service etc. With such service views an operator can
connect technical service quality with business goals.
We are not trying to define individual QoS parameters for specific network technolo-
gies or services. Neither are we trying to define “good service quality”. These areas
are already well covered by individual standards, research and products. Instead we are
trying to provide general capability for building large scale, multi purpose service models.
This paper defines the following components for a service management system:
• Using this definition in Section 2.2 we examine what kinds of scenarios a service
monitoring tool must be able to handle.
• The outline for the architecture of a service monitoring system is given in Section 3.
2 Overview
2.1 Quality Of Service
There are several different interpretations of what Quality of Service means, each with
different scopes.
In the IETF the term Quality Of Service typically refers to technologies to achieve
“good quality”. QoS in this sense is a means to prioritize network traffic, manage band-
width and congestion, and to help ensure that the highest priority data gets through the
network as quickly as possible. The IETF uses the following definition of QoS [2]: “A set
of service requirements to be met by the network while transporting a flow”. Important
solutions from this domain are IntServ [3], DiffServ [4], MPLS [5] and Policy solutions
[6]. These are all important QoS solutions to achieve good service quality with Internet
protocols. Similar solutions exist for other domains like fixed networks, 2G/3G networks
[7] and specific services such as VoIP, customer care processes, etc. These QoS techniques
are aimed at controlling and monitoring QoS in an objective way.
The ITU uses a more end-user focused definition of the term [8]: “the collective effect
of service performance which determines the degree of satisfaction of a user of the service”.
Using this definition takes one step further in that it includes the subjective/perceived
measurements by an end-user. Casas [9] presents a method to measure perceived Quality
Of Service and the relationship between subjective and objective measurements.
The relationship between service quality and customer loyalty in general is discussed
by Bloemer [8]. He also enhances the quality of service definition into a third level:
SLAs: service providers want to sell SLAs where they promise a certain service quality.
These promises should be expressed in a formal way that can be measured. SLAs
often includes several kinds of QoS parameters ranging from technical parameters
like jitter and delay to more indirect parameters like customer care or process
metrics
Service quality: service providers are often overwhelmed with individual QoS param-
eters, but these are not integrated into an overall service view. It is hard to get
information about current, past and future service quality.
Dynamic networks: services are carried over different networks like xDSL, WLAN, 3G.
In order to support seamless roaming and still keep the perceived service quality
we need calculations which span several different domains.
QoS definition framework: new applications and services are constantly being de-
fined. Each of these new services will have its own specific QoS parameters. There
is a need for a common understanding about which these parameters are, and what
their meaning is.
For a more in-depth coverage of the motivations behind an overall solution for monitoring
QoS see for instance Espvik [9] or Räisänen [10].
We conclude that the definition of QoS must be neutral and generic. It must encom-
pass statements such as: – What is the quality of the Swedish GSM network? – What
was the quality of the Swedish GSM network last Christmas Eve? – What is the quality
of all the services provided to customer B? –How did a failure on a specific base station
affect the GSM service in Stockholm at 10.00 AM yesterday?
Considering these requirements have lead us to adopt the following simple definition:
These functions and the interpretation of the parameters are part of a specific model,
and formulated by experts on the particular service. In some sense the function should
represent the “degree of conformance of the service delivered to a user by a provider in
accordance with an agreement between them” [11].
56 Multipurpose Models for QoS Monitoring
link, for example, what would the effect on customer satisfaction be? What would the
financial effect be like?
A modeling language with sufficient power to express the complex models needed for
the use-cases. This language must be able both to express the way that parameters
are computed from each other and to express the structure of the service model.
An analytic engine which can execute the modeling language and compute the values
of all the parameters. This engine needs to have the appropriate interfaces, and be
parallelizable so that it can be implemented in a scalable and fault-tolerant fashion.
Information visualization systems interfaces to extract and present the relevant data
from the analytic engine. This can be accomplished by a combination of integration
with report generators and a general, data driven interface.
It is imperative that the system has enough power to express the models, and that it
is simple enough that integrating it into the support systems of an operator is feasible.
This forces us to make a design trade-off between power and flexibility. There are al-
ready languages and system to create models such as Modelica [14] which is targeted at
modeling physical systems. We have a different scope, but we note that just as in models
of the physical world, the concept of time will be very important in our system.
class Cell:
input errCount
errRate = (errCount@NOW - errCount@(NOW-10m))/10
linkErrors = sum l.errors (l in link)
class CellSL:
properties maxOwnErrors, maxLinkErrors
status = worst errStatus linkStatus
errStatus? = linearThreshold cell.errRate 0 maxOwnErrors
linkStatus = simpleThreshold cell.linkErrors maxLinkErrors
• Due to the nature of service modeling, the programming language must be able to
treat time as an integral part of the syntax: all variables are seen as arrays, indexed
by a time stamp.
• It is possible to use the time-index syntax to retrospectively change the value of
variables.
• List comprehension and an extensive set of built-in functions provide the power
needed to express complex models.
To make the language more concrete we present a simplified example taken from a
model of a cellular network. The first class in Figure 2, cell, defines what properties
we associate with a cell; and that it has a single measurement input (errCount). The
definitions state how the parameters errRate and linkErrors should be computed from
other parameters. Note the @ sign, which is used to indicate access to a time indexed
value. The second definition, CellSL, defines a service level - a promise on the behavior
of the underlying component - which encompasses the cell. It uses the parameters from
the cell to form a view of the component. In effect saying “If we apply a Service Level
on this cell, this is what the status would be like?”.
Note that the service level is parametric with regards to the number of errors allowed.
We provide the specialization BronzeCellSL, which gives specific values.
Finally we define the relationships between Cell and Links, Cell Service Levels and
Cells. When relationships are defined it establishes implicit attributes: Class1 <=>* Class2
gives the implicit class2 attribute in Class1 (list value) and the class1 attribute in
class2
Instantiation of the classes are separated from the definition. The example in Figure
3 shows the naı̈ve case when objects are created one-by-one.
Multipurpose Models for QoS Monitoring 59
4 Related Work
Previous efforts on QoS monitoring have been focused primarily on frameworks, which
we explore in section 4.1.1. There have also been some work on modeling, which we
present in section 4.2.
60 Multipurpose Models for QoS Monitoring
Control frameworks that actually try to manage resources in order to achieve a QoS
Level
End-to-end QoS Monitoring: trying to measure and monitor the overall quality of
service.
Flow [21], Managed Objects SLM [22]. These products are quite successful in collecting
events and measurements and monitoring SLAs. Most of the tools have weaknesses in
the service modeling area; they deploy various UML flavors or simple object modeling
techniques which allow for very little static analysis, for instance to be able to determine
dependency graphs to facilitate lazy computation of parameters. Our work aims to
improve service modeling and computational aspects such as time based calculations,
and maintaining the state of a massive number of services and users.
References
[1] S. Wallin and V. Leijon, “Rethinking network management solutions,” IT Profes-
sional, vol. 8, no. 6, pp. 19–23, 2006.
[4] S. Blake, D. Black, M. Carlson, E. Davies, Z. Wang, and W. Weiss, “An Architecture
for Differentiated Service.” RFC 2475 (Informational), Dec. 1998. Updated by RFC
3260.
[7] 3GPP, “Quality Of Service (QoS) Concept and Architecture.” 3GPP TS 23.107,
version 5.13, 2004.
[8] J. Bloemer, K. de Ruyter, and M. Wetzels, “Linking perceived service quality and
service loyalty: a multi-dimensional perspective,” European Journal of Marketing,
vol. 33, no. 11, pp. 1082–1106, 1999.
Authors:
Stefan Wallin, Viktor Leijon, and Leif Landen
c 2009, Springer.
65
66 Prioritization of Telecom Alarms using Neural Networks
Statistical Analysis and Prioritization of Telecom
Alarms Using Neural Networks
Abstract
Telecom Service Providers are faced with an overwhelming flow of alarms, which
makes good alarm classification and prioritization very important.
This paper first provides statistical analysis of data collected from a real-world alarm
flow and then presents a quantitative characterization of the alarm situation. Using data
from the trouble ticketing system as a reference, we examine the relationship between
the original alarm severity and the human perception of them.
Using this knowledge of alarm flow properties and trouble ticketing information, we
suggest a neural network-based approach for alarm classification. Tests using live data
show that our prototype assigns the same severity as a human expert in 50% of all cases,
compared to 17% for a naı̈ve approach.
1 Introduction
A medium-sized telecom network operations center receives several hundred thousand
alarms per day. This volume of alarms creates severe challenges for the operations staff.
Fundamental questions that need answers in order to improve the state of affairs are:
While extensive research efforts are focused on alarm correlation [1], the target for the
work presented in this paper is filtering and prioritization of alarms.
Although all alarm systems support advanced filtering mechanisms, the problem is
defining the filtering rules. Being able to filter out a high percentage of alarms would
increase efficiency of the network management center since network administrators would
only have to work with relevant problems.
Because there is such a high volume of alarms and tickets this kind of filtering and
prioritization is of vital importance if operators are to determine which alarms are most
critical to resolve [2]. Today, prioritization of alarms and trouble tickets is largely per-
formed manually by network administrators who use a combination of their experience
and support systems such as inventory and SLA management systems to determine the
67
68 Prioritization of Telecom Alarms using Neural Networks
priority of an alarm. This manual process makes the organization dependent on a few
individual experts [3]. Furthermore, the priority information is typically only available
in the trouble ticket system and not in the alarm system.
Two hypotheses are studied in this paper: that statistic analysis techniques can be
used to find alarm filtering strategies and that a learning neural network could suggest
relevant priorities by capturing network administrators’ knowledge.
We start by define the inner workings of a telecom alarm flow (Section 2) and then
describe how our data was extracted from the database (Section 3).
This paper takes four steps towards automatic alarm prioritization:
• We present some statistical properties of a real world alarm flow taken from a
mobile service provider (Section 4).
• We show important properties such as that 11% of all alarms belong to easily
identifiable classes of alarms which never give rise to actions from operators and
that over 82% belong to classes where less than one alarm in a thousand generate
an action.
• Using statistical analysis we show that the neural network performs significantly
better than a naı̈ve but realistic alternative.
• Associate the alarms with a trouble ticket to manage the problem resolution process.
Alarms are refined and distributed from the detection point in individual network
elements, such as base stations, via subnetwork managers up to the overall integrated
network management systems. Various interface technologies and models for alarms are
used across these interfaces. X.733 [4] is the de facto standard for alarm interfaces and
all later standard efforts are based on X.733 to some degree. It contains basic definitions
of parameters in alarm notifications.
Prioritization of Telecom Alarms using Neural Networks 69
The 3GPP Alarm IRP [5] defines alarms using a state focused definition. Furthermore,
it models operator actions that can change the state of an alarm. The state where the
alarm is acknowledged and cleared is the final state, and the life cycle of the alarm ends.
An X.733 compliant alarm has the following main attributes.
Managed Object: identification of the faulty resource. It points not only to the net-
work element but also down to the individual logical or physical component.
Event Type: category of the alarm (communications, quality of service, processing er-
ror, equipment alarm, environmental alarm).
Probable Cause: cause of the alarm, where the values are defined in various standards.
Event Time: the time of the last event referring to this alarm.
For the purpose of this study, we have defined a Finite State Machine for alarms as
shown in Figure 1. It is a simplification and abstraction of major standards and typical
telecom management systems.
70 Prioritization of Telecom Alarms using Neural Networks
From the resource point of view, the main events are new and clear, which moves
the alarm into the active or cleared state. Note, however, that the cleared state does
not imply that the network administrator considers the problem solved. This is managed
by the trouble ticket process. In order to manage the problem, the user acknowledges
and associates a ticket with the alarm. We will refer to this as “handling” the alarm.
A trouble ticket contains information such as priority, affected services, and responsible
work group. The mobile operator we studied used a priority in the trouble ticket system
ranging from 1 to 6. Priority 1 is the most urgent and indicates a problem that needs
to be resolved within hours, whereas priority 6 has no deadline. When the problem is
solved, the administrator closes the trouble ticket. The life cycle of an alarm ends when a
user decides that the alarm needs no further attention. This is indicated with end in the
above state diagram. In many cases, this is automatically performed when the associated
trouble ticket is closed. There is a short cut to bypass the trouble ticket process in order
to end alarms that do not represent real problems.
Whether an alarm notification should be considered a new or changed alarm is a
topic of its own. According to X.733, a notification with the same managed object, event
type, probable cause and specific problem is considered to change an existing alarm. The
three last parameters identify the type of alarm, and we will refer to the triple <Event
type, Probable cause, Specific problems> as “alarm type”. We will refer to alarms
with the same managed object and alarm type as “associated alarms”, and they will be
grouped together under one main alarm, the changed alarm in the state diagram.
The X.733 definition of alarm type has created various vendor-specific mechanisms
since the parameters are static and defined by standards. In real life, a vendor needs to
be able to add new alarm types in a deployed system. The approaches to circumvent
this differ between vendors, some use their own non-standardized probable cause values,
and some use free text specific problem fields. A second problem is how to identify
the managed object, different protocols and vendors use different naming schemes. The
actual resolution varies from vendor to vendor. These two fundamental problems create
major challenges for alarm systems and alarm analysis.
In the described alarm flow, network administrators need to answer two important
questions: Do I need to handle this alarm? What is its priority? The operator we studied
primarily uses the event time, managed object, and alarm type attributes in combination
with their own experience and lookups in support systems to judge the alarm relevance.
2. {A1 , T T2 }: Alarms and corresponding tickets (3583112 main alarms, 16150788 as-
sociated alarms and 90091 trouble tickets)
The alarms were from approximately 50 different types of equipment from many
different vendors.
The records in the trouble ticket database contain the most important fields from the
alarms such as managed object, specific problem and additional text. In this way the
trouble ticket database alone is sufficient to perform analysis and neural network training
on alarms with associated tickets. This is the case for {T T1 } which was used for training
and testing of the priority network, as described in Section 5.
In order to perform full statistical analysis of the alarms as well as to train the neural
network to judge if the alarm should be handled or not, we extracted a second data set
{A1 , T T2 }.
Severity Frequency
Indeterminate 0.1%
Critical 17.5%
Major 22.8%
Minor 5.0%
Warning 54.6%
This section studies the relationship between priorities and severities, and refers to data
from data set T T2 (see Section 2).
The number of alarms associated with a trouble ticket varies from 1 to 1161, an
average of 5.2 with a standard deviation of 17. This is an indication that the relationship
between alarms and tickets is complex. In an ideal world, alarms should indicate a
problem and not individual symptoms. If this were true, the fan-out between tickets and
alarms would have been much lower.
In Figure 4 we can see that the distribution of alarm severities associated with a single
trouble ticket is low. If, for example, a trouble-ticket is associated with both a major
(2) and a warning (4) alarm, it has a difference of two steps, as illustrated by Figure
4. The associated alarms have the same severity in more than 50% of the cases.
Discussions with the network administrators led to the hypothesis that if we looked
at the maximum alarm severity associated with the trouble ticket, we would get good
correlation. Figure 5 shows the analysis of this assumption. It clearly illustrates that we
do not have a mapping between alarm severity and corresponding priority. For example,
we see that priority 4 is distributed across all severities, and it is largest in the warning
and critical severity. For further discussion on how badly severities and priorities
correlate, see Section 5.4.
After studying the weak correlation between alarm severity and ticket priority, we
looked for another correlation: the alarm type versus priority. We observe a strong
correlation between some alarm types and their priority, but for other alarm types there
is no correlation at all. So there is no direct and naı̈ve alarm algorithm associating
priority and alarm information.
Prioritization of Telecom Alarms using Neural Networks 75
• Maintenance of service models: formal models of network and service topology have
to be maintained, which is complex and costly. Also, the change rate of network
topology and service structures is challenging to handle.
• Maintenance of impact rules: correlation rules are typically expressed using Rete-
based expert systems [9], which require extensive programming.
• Capturing operators knowledge: to write the rules for the expert system, the devel-
opers need to have input from the network administrators. However, experienced
operators are often critical resources in the organization and cannot be allocated
time to formalize rules.
The goal of our work is to set priorities in alarms at the time of reception by using
neural networks. We let the neural network learn from the experienced network admin-
istrators rather then to building complex correlation rules and service models. As stated
by [10]:
The alarm management process can be viewed as a two-step procedure, see Figure
6. First of all, is the alarm important enough to create a trouble ticket? Secondly, if we
create the trouble ticket what is the priority? We focus on the latter in this paper.
Deciding if an alarm deserves to be handled or not seems to be a more complicated
question, based on some preliminary tests, information outside of the raw alarms is
probably used in the decision process.
We have integrated a neural network architecture into an alarm and trouble ticket
system. The neural network uses the manually assigned trouble ticket priorities and asso-
ciated alarms as learning data. When the alarm system receives an alarm, it interrogates
the trained neural network which generates a suggested priority to be put into the alarm
information. The priority indicates if the alarm should be handled or not and, if so, the
suggested priority.
The A1 data set was used to train the network to judge if a trouble ticket should be
created. This works since the alarm database contains a field for each alarm telling if it
has a trouble ticket or not. The T T1 data set was used to train the network to assign
priorities to the alarms. The trouble ticket database contains all relevant attributes of
the associated alarms.
We used the following alarm fields to construct the input to the neural network:
• Associated alarms count: the number of alarm notifications referring to the same
alarm.
The selection of the above alarm attributes was based on discussions with network
administrators to find the most significant attributes used for manual correlation and
prioritization. Additional text and Specific problem are encoded using the soundex
algorithm to remove the influence of numbers and convert the strings to equal length.
Prioritization of Telecom Alarms using Neural Networks 77
We used an open source neural network engine named libF2N2 [11]. The lib2f2n2
library uses linear activation
f (x) = x
for the input layer and the logistic function
for all successive layers. The neural network uses back-propagation [12] as the learning
mechanism. It only supports iterative back-propagation, not batch back-propagation.
Two variables play a special role during learning: the learning rate and the momentum
of the neural network. Learning rate indicates what portion of the error should be
considered when updating the weights of the network. Momentum indicates how much
of the last update that should be reused in this change. Momentum is an effective way
to move a network towards a good generalization, but can also be a problem if the
momentum moves us away from the optimum weights.
Test Epochs L N O LR M Tr E Te E
1 1200 3 200 6 0,01 0,3 3.1% 18,1%
2 680 3 100 6 0,03 0,3 2,4% 17,7%
3a 100 4 50 6 0,03 0,2 6,1% 16,0%
3b 1000 4 50 6 0,03 0,2 4,7% 16,9%
4 300 3 70 1 0,05 0,2 31,8% 12,8%
5 1000 2 100 6 0,05 0,2 3,1% 30,0%
learn to prioritize. Adding more neurons does not make the neural network prioritize
better. Notice how the error drops when decreasing the number of neurons between Test
1 and Test 2. Although the Testing Error drops considerably when using the 1 neuron
Output the high Training Error makes us reluctant to use it.
In Test 3 we can see that we do not necessarily get a better result from longer training.
More work is needed to find a suitable method of when to stop training. We have no
direct correlation between Training Error and Testing Error. The high Training
Error on Test 4 is accompanied by a low Testing Error and in Test 5 we have the
opposite relation.
Figure 7 shows how mean square error descends during training for test 1. All tests,
except test 4, produced similar graphs.
5.4 Results
Having identified approach 3b, see Table 4.3, as the most promising one, we decided to
statistically evaluate the success of this algorithm. For all the confidence intervals below
we have used a standard t-test.
We compared the neural network to four competing approaches:
2. The trivial severity approach, simply scaling the severity into the priority by mul-
tiplying the severity by 1.25.
3. Always selecting the statically optimal priority optimizing for “least average error”,
computed from the reference. The statically optimal choice turned out to be 3.
4. Selecting the priority at random for each alarm, but using the same distribution of
priorities as the reference.
The average errors and variances for the methods, together with the 99% confidence
interval on their mean errors are given in Table 4.4.
Prioritization of Telecom Alarms using Neural Networks 79
We have plotted the relative frequency of the errors in Figure 8. A negative error
errs on the side of caution, judging an alarm to be more serious than it actually is. A
positive error on the other hand is an underestimation of the importance of the alarm.
For the purpose of this study we consider both types of error to be equally bad. We can
see that while the neural network is the only method centered around zero, the others
generally overestimate the seriousness of the alarms slightly.
Having seen what appears to be a distinctive advantage for the neural networks, we
apply t-tests to try to determine how the effect of the other methods compares to A. We
did this by comparing the error pairwise for each alarm in the test set and computing
the 99% confidence intervals. The results are presented in the final column of Table 4.4.
It is spectacular how well the “always pick priority 3”-method does. This is because
it guesses in the middle, so it is often wrong, but usually just a single step. This rule is,
of course, worthless from a prioritization point of view.
80 Prioritization of Telecom Alarms using Neural Networks
The comparisons are all to the advantage of the neural network. It has a statistically
secure advantage over the other methods. This suggests a distinct advantage of using
neural networks for alarm prioritization. This becomes an even stronger conclusion when
one considers the distribution of errors in Figure 8.
6 Related Work
Data mining, or knowledge discovery in databases, is being used in different domains such
as finance, marketing, fraud detection, manufacturing and network management [13].
The major categories of machine learning used are rule induction, neural networks, case-
based reasoning, genetic algorithms, and inductive logic programming [6]. The following
problem types are typically addressed:
• Classification: the training data is used to find classes of the data set. New data
can then be analyzed and categorized. This is the main theme of our work, where
we are looking for ticket priorities based on the alarm data.
• Prediction: this focuses on finding possible current and future values such as finding
and forecasting faults in a telecommunication network. This is covered well by
related research efforts.
Prioritization of Telecom Alarms using Neural Networks 81
• Detection: focuses on finding irregularities in the data and seeks to explain the
cause. Within the telecom sector a common application is to detect churn and
fraud.
Bose et. al. [6] performed a study to analyze the usage of data mining techniques
across domains and problem types. They found that 7% of the usage was in the telecom
sector and almost evenly spread among classification, prediction and detection.
Gardner et. al. [15] illustrate the classification category. They use a self-organizing
map, a Kohonen network [16], to categorize alarms. In contrast with conventional ANN
networks, a self-organizing map does not require a correct output as training data. The
primary application is analysis and classification of input where unknown data clusters
may exist. The network is in a sense self-learning. This is in contrast to our prioritization
scenario where we have a complete output definition.
Most research efforts related to alarm handling focus on correlation [17, 18, 19, 20, 10].
This mainly falls into the prediction and detection categories above. Alarm correlation
refers to the process of grouping alarms that have a reciprocal relationship [21]. The
aim is “the determination of the cause” [8]. Wietgrefe et. al. [22] uses neural networks
to perform the correlation. Common for these efforts is that they look at the stream
of alarms and tries to find the root cause or the triggering alarm. For example, in the
Wietgrefe study, the learning process is fed with alarms as inputs and the triggering
alarm as output.
We are not trying to find the root cause of the alarms, neither to group them. In
contrast, our problem is in some sense a simpler one, but overlooked; to prioritize the
individual alarms. The main input of our analysis is the manual alarm prioritization in
trouble tickets along with the alarms. Previous efforts have mostly focused on the alarm
databases themselves. The use of the trouble ticket database in the learning process
makes the solution adapt to expert knowledge. Training a network with root cause
alarms is a fundamental challenge since it is hard to find a true output set.
Sasisekharan et. al. [23] combine statistical methods and machine learning techniques
to identify “patterns of chronic problems in telecom networks”. Network behavior, di-
agnostic data, and topology are used as input in the solution. This solution covers the
challenging aspect of problem prediction. It is more focused on data-mining techniques
and uses topology as input. It shows a strength in combining several approaches.
Levy [24] also combine data mining and machine learning in an implementation of
an alarm system. They come to the same conclusion as we do regarding the Pareto
distribution of the alarms: “there is a lot of value in the ability to get an early warning
on the 10% of causes that create 90% of field failures”. This further emphasizes the
foundation for the work presented in this paper where we are focusing on pinpointing the
relevant alarms to be handled.
Klemettinen [25] presents a complementary solution to alarm correlation, using semi-
automatic recognition of patterns in alarm databases. The output is a suggestion of rules,
82 Prioritization of Telecom Alarms using Neural Networks
users can navigate and understand the rules. This is in contrast with neural networks
solutions like ours, where there is no explanation to users why our network suggests
alarms to be handled, and the suggested priority.
Further, we have provided some characterization of the alarm flow, showing the long-
tail behavior of alarm types, providing an opportunity for further study of the possibility
of exploiting this from two sides, both from alarm filtering at the “non-important” side
and for alarm automation/enrichment at the “important” side.
The work presented in this paper was run using a historic database of alarms and
trouble tickets. The next step is to deploy the test configuration in a running system to
study how it adapts continuously. We will also apply the tests using data from a different
operator.
References
[1] M. Steinder and A. S. Sethi, “A survey of fault localization techniques in computer
networks,” Science of Computer Programming, vol. 53, no. 2, pp. 165–194, 2004.
[2] J. Wilkonzon and D. Lucas, “Better alarm handling- a practical application of human
factors,” Measurement and Control, vol. 35, no. 2, pp. 52–55, 2002.
[5] 3GPP, “3GPP TS 32.111: Alarm Integration Reference Point (IRP),” 2004.
[6] I. Bose and R. K. Mahapatra, “Business data mining, a machine learning perspec-
tive,” Information & Management, vol. 39, no. 3, pp. 211–225, 2001.
[7] S. Wallin and V. Leijon, “Multi-Purpose Models for QoS Monitoring,” in 21st Inter-
national Conference on Advanced Information Networking and Applications Work-
shops (AINAW’07), pp. 900–905, IEEE Computer Society, 2007.
[9] C. L. Forgy, “Rete: a fast algorithm for the many pattern/many object pattern
match problem,” IEEE Computer Society Reprint Collection, pp. 324–341, 1991.
[10] M. C. Penido G, Nogueira J.M, “An automatic fault diagnosis and correction sys-
tem for telecommunications management,” in Proceedings of the Sixth IFIP/IEEE
International Symposium on Integrated Network Management, pp. 777–791, 1999.
[15] R. D. Gardner and D. A. Harle, “Alarm correlation and network fault resolution
using the Kohonen self organising map,” in Global Telecommunications Conference
(GLOBECOM’97), vol. 3, 1997.
[19] G. Liu, A. K. Mok, and J. E. Yang, “Composite events for network event corre-
lation,” in Proceedings of the Sixth IFIP/IEEE International Symposium Network
Management, 1999, pp. 247–260, 1999.
[21] R. D. Gardner and D. A. Harle, “Methods and systems for alarm correlation,” in
Global Telecommunications Conference (GLOBECOM’96), vol. 1, 1996.
[22] H. Wietgrefe, K.-D. Tuchs, K. Jobmann, G. Carls, P. Fröhlich, W. Nejdl, and S. Ste-
infeld, “Using neural networks for alarm correlation in cellular phone networks,” in
International Workshop on Applications of Neural Networks to Telecommunications
(IWANNT), 1997.
[24] D. Levy and R. Chillarege, “Early Warning of Failures through Alarm Analysis-A
Case Study in Telecom Voice Mail Systems,” in Proceedings of the 14th International
Symposium on Software Reliability Engineering, p. 271, IEEE Computer Society
Washington, DC, USA, 2003.
Authors:
Viktor Leijon, Stefan Wallin, and Johan Ehnmark
c 2008, IEEE
85
86 SALmon - A Service Modeling Language and Monitoring Engine
SALmon - A Service Modeling Language and
Monitoring Engine
Abstract
To be able to monitor complex services and examine their properties we need a mod-
eling language that can express them in an efficient manner. As telecom operators deploy
and sell increasingly complex services the need to monitor these services increases.
We propose a novel domain specific language called SALmon, which allows for efficient
representation of service models, together with a computational engine for evaluation of
service models. This working prototype allows us to perform experiments with full scale
service models, and proves to be a good trade-off between simplicity and expressive power.
1 Introduction
Operators want to manage services rather than the network resources which are used to
deliver the services. This change of focus is driven by several factors; increased compe-
tition, more complex service offerings, distribution of services, and a market for Service
Level Agreements [1].
A result of this transition is an increasing need to predict, monitor and manage the
quality of the service that is delivered to the end users. However, the complexity of
understanding and modeling services is a serious obstacle.
We want to find a way to model Services, Service Level Agreements and the structure
underlying them.
Service modeling is intrinsically hard, since we need to express calculations, types and
dependencies. Current UML-based approaches tend to hide this without really providing
the expressive strength needed. On the other hand, using traditional object-oriented
programming languages gives the expressive strength but creates a gap between the
model and the domain experts. Time-dependent calculations are often complicated or
unnatural to express in these languages.
An implementation challenge is to manage the volume of service types and instances.
Service providers have large infrastructure and service portfolios. There can be several
million cells, edge devices, areas and customers.
Managing a large number of object instances with calculated Key Performance Indi-
cators, including indicators which are calculated over time intervals, have computational
challenges which are not addressed in current solutions. Time is an inherent dimension in
service monitoring and SLA management for several reasons. We need to be able to man-
age late arrival of data, there may be delays between the collection of a key performance
87
88 SALmon - A Service Modeling Language and Monitoring Engine
indicator and its introduction into the SLA system for instance due to batching. The
actual time-stamp must be used in the overall calculation of status which may require
recalculation. Operators also want to make “time-journeys”, looking backwards and for-
wards to understand how service quality has developed. Furthermore, SLAs contain time
variables in the form of requirements on time-to-repair and availability measurements.
It is vital to be able to provide different views for different users. Naive attempts
to model services use a tree structure where Key Performance Indicators are aggregated
upwards in the tree. However, different roles in the organization require different types
of aggregation views: per customer, per site, per area, per service, and ad-hoc grouping
of service instances.
The main components of our solution are a dedicated service modeling language and
a run-time environment for calculating the service status. The language is an object-
oriented functional language tailored to the domain-specific requirements.
This paper makes the following main contributions towards a useful service monitoring
engine:
• We give an overview of the design considerations that went into SALmon, a novel
language for writing service descriptions (Section 2).
• Finally we examine a few typical scenarios and how they can be handled in our
system (Section 4).
Inputs define a time-indexed variable that is mapped to an external data source. Typical
external sources are probes, alarms, performance data and trouble-tickets.
Anchors label connections to other class instances and hence provide the basis for build-
ing structures from service objects.
The definition layer only defines the name of the anchor and its multiplicity, so that
an anchor is defined to have either exactly one anchored instance or zero or more.
Properties are values that can be left undefined in a class definition to yield an ab-
stract base class. Properties can be defined or redefined in sub classes to model
differentiated service levels.
Attributes define calculation rules for parameters in a strict purely functional language.
The calculation rules have knowledge of which instance of the attribute is being
evaluated, and can use that together with attributes, properties and other anchored
objects to calculate values.
Code reuse is facilitated through an inheritance system where subclasses can override
and redefine attributes and properties
The definition layer cannot create new objects, only define classes. The sources
for inputs are not defined here. Different systems can feed the same input and using
undefined inputs will result in undefined results.
2.3 Example
We illustrate our language with a simple model with service objects for a mobile network.
The purpose of the model is to provide SLAs for mobile voice services in dedicated areas.
90 SALmon - A Service Modeling Language and Monitoring Engine
Area 1
Cell 1 Cell 2
The input data source is trouble-tickets which cover both technical problems derived
from alarms and customer complaints. A sketch of the sample models is in Figure 2.3.
Four classes are defined in Figure 2. The first class, GSMCell, defines a single input:
the number of open trouble tickets associated with the cell. The boolean attribute ok is
true only when the number of open tickets is zero.
The second class, GSMArea defines an anchor point for cells. It also defines another
attribute ok which depends on the ok attribute of all anchored cells. When the area is
instantiated it is anchored to the cells that cover an important area for an SLA customer,
such as an enterprise main office.
The third class, GSMService aggregates areas into a general service perspective.
The fourth class ServiceLevel contains rules to calculate downtime and conformance
to a service level agreement defined by properties. This represents the service sold to the
end-customer, and downtime is calculated as the total time with open tickets.
We define two different service levels in Figure 3, GSMServiceLevel1 and GSMServiceLevel2,
which are subclasses to GSMServiceLevel where the properties have been fixed.
We are now ready to show the instantiation of a small service in Figure 4. It builds
a service level service1 which monitors a GSMArea called area1 made up of two cells,
cell1 and cell2. This example corresponds to a customer who has bought a service
level agreement for their main office with a maximum outage of 24 hours per six month
period.
Service monitoring needs to be integrated with external tools such as alarm and
trouble ticket systems to notify operators about problems. This is handled by allowing
external systems to subscribe to attributes. This mechanism also allows us to separate
the presentation in the user interface from the design of the calculation engine.
SALmon - A Service Modeling Language and Monitoring Engine 91
class GSMCell
input openTickets
ok = (openTickets == 0)
end
class GSMArea
anchor * cells
ok = allTrue cells.ok
end
class GSMService
anchor * areas
ok = allTrue areas.ok
end
class GSMServiceLevel
anchor service
property OutageMeasurementPeriod
property MaxOutageTime
3 Prototype Implementation
We have implemented an early prototype of the SALmon language runtime and inter-
preter using the JavaTM J2SE Framework [2] and the ANTLR parser generator [3].
3.1 Classes
In the current implementation service models can be built from the basic building blocks:
3.2 Expressions
Inputs and attributes can both be seen as lists of time-stamped values. In this sub-
section we will refer to inputs and attributes as time variables viewed as lists of tuples
(V, T ) where V is the value and T is the time-stamp. With this view we abstract the fact
that the values of inputs are available as semi-static data from external sources while
attributes are calculated on demand by the runtime engine.
The expression for an attribute evaluates using the other available time variables,
namely
class Service
anchor system
// Request the status of the last day and return the worst one.
dailyStatus = worstOf system.status@(NOW, NOW-1day)
end
time-variable can also be retrieved by specifying a time range. Examples of how time
variables are evaluated are given in Figure 5
The need to handle lists of values arises as a consequence of two things: anchors
aggregating multiple sub-service instances and processing time-intervals of inputs or at-
tributes.
The list comprehension is provided through the common higher order list processing
functions map, fold and filter:
map applies a unary function on all items in a list and returns a list of the result.
fold reduces a list into a single value by recursively applying a binary function on a list,
for example when summing a list of numbers.
filter takes a list and returns only the values accepted by a predicate or unary boolean
function.
The function arguments of higher order functions can be supplied either as an anony-
mous function or a named helper function. Helper functions depend only on their explicit
arguments, and as such can be considered as purely functional. All functions are call
by value. Basic operations for arithmetic, boolean logic and comparison are also imple-
mented.
The default state of the runtime is a resting state. As a request for an attribute at a
certain time-stamp may depend on the calculation of other attributes, this will result in
what can be considered a directed graph of calculation units where non-connected units
can be calculated independently and hence in parallel.
In the prototype all data mapped to inputs reside in a database. Attaining satisfactory
database performance is one of the main issues under investigation.
4 Scenarios
This section illustrates how to apply SALmon for fulfilling a few typical requirements on
Service Monitoring systems.
class SNNode
anchor * children
property propRule, calcRule
property name
ownStatus = OK
status = snFunc propRule ownStatus children
5 Related Work
One of the most important sources for service and SLA modeling is the SLA handbook
from TM Forum [5]. It provides valuable insights into the problem domain but not to
the actual modeling itself.
TM Forum has also defined an accompanying service model, SID, “System Infor-
mation Model” [6]. SID is comparatively high level and models entities in telecom
operators’ processes. However, SID is being refined and moving closer to the resources
by incorporating CIM [7].
The Common Information Model, CIM, has an extensive and feature-rich model in-
cluding a modeling language MOF (Managed Object Format). Key strengths in CIM
are the modeling guidelines and patterns. However, CIM faces some major challenges
since the UML/XML approach tends to create unwieldy models. It is also aimed more
at instrumentation than end-to-end service modeling.
Some of the major players behind CIM are now working on the “Service Modeling
Language”, SML [8]. SML is used to model services and systems, including their struc-
ture, constraints, policies, and best practices. Each model in SML consists of two subsets
of documents; model definition documents and model instance documents. Constraints
are expressed in two ways, XML schemas defines constraints on the structure and contents
whereas Schematron and XPath are used to define assertions on the contents.
An interesting feature is that SML addresses the problem of service instantiation by
providing XSLT discovery transforms to produce instances from different sources. Other
attempts exist to specify service models as component interaction with UML collabora-
tions [9]. This kind of service modeling serves purposes closer to the design of systems
than service models for QoS metrics.
A simple and pragmatic model for a general service model is given by Garschhammer
et al. [10]. This work serves as a guide for modeling and identifies several important
research areas.
SLAng [11] is a language focused on defining formal SLAs in the context of server
applications such as web services. It uses an XML formalism for the SLAs. SLAng
identifies fundamental requirements needed in order to capture SLAs but differs from
our current effort in that it “focuses primarily on SLAs, not service models in general”.
When it comes to programming languages with an inherent notion of time, Benveniste
et al. [12] give an overview of the synchronous languages. These languages have the
concepts of variable relationships and of computing values based on the previous value
SALmon - A Service Modeling Language and Monitoring Engine 97
of a variable. However, they have no notion of retaining values after the computation,
and they have a discretized notion of time.
Perhaps most closely related to SALmon is the notion of stream data managers [13],
which take a more database oriented approach to the problem. This makes their syntax
less suitable for service models, and means that they have a stricter view on time progress.
However, a lot of the underlying work may be reused in the current setting.
References
[1] S. Wallin and V. Leijon, “Multi-Purpose Models for QoS Monitoring,” in 21st Inter-
national Conference on Advanced Information Networking and Applications Work-
shops (AINAW’07), pp. 900–905, IEEE Computer Society, 2007.
[7] Distributed Management Task Force, “CIM Specification.” Version 2.15.0, 2007.
[9] R. Sanders, H. Castejon, F. Kraemer, and R. Bræk, “Using UML 2.0 collaborations
for compositional service specification,” ACM/IEEE 8th International Conference
on Model Driven Engineering Languages and Systems (MoDELS), 2005.
[11] D. Lamanna, J. Skene, and W. Emmerich, “SLAng: A Language for Defining Service
Level Agreements,” Proc. of the 9th IEEE Workshop on Future Trends in Distributed
Computing Systems-FTDCS, pp. 100–106, 2003.
ISSN: 1402-1757
ISBN 978-91-86233-34-1
Luleå
www.ltu.se