KPN_Thesis_Final
KPN_Thesis_Final
MASTER OF SCIENCE
by
Thomas Broens
Graduation Committee
Internet and mobile phones have become essential in our lives. Without the internet and our
smart phones, large parts of our society would stop functioning or even collapse completely. It
is therefore of critical importance that the service is of the highest quality and that the deliv-
ery of service is not interrupted. Telecommunication providers are responsible for exploiting
and maintaining the infrastructure necessary for delivering mobile and internet connections.
It is not an easy market to survive in. Competition is fierce and the clientele demands con-
stant service of the highest quality. This is a big challenge for telecommunication providers.
In order to be able to deliver the best quality, maintenance is necessary. But maintenance is
often the cause of service interruption. Combined with the fact that technology is evolving
at such a pace that upgrades are coming faster and faster, the demand for maintenance has
never been higher. Efficiently organising these maintenance activities has therefore become a
priority.
The largest telecommunication provider in the Netherlands is facing the same challenges and
has therefore commissioned this research. Using KPN as a case study, the goal of this re-
search is to expand on the knowledge of efficiently organising maintenance activities in the
telecommunication sector. In the specific case of KPN, efficiently organising maintenance
activities means increasing the maintenance pace, without increasing the risks of impacting
clients. Currently, maintenance that can impact the clients of telecommunication providers is
performed at times when the impact is lowest. It is performed during so called maintenance
windows. These windows are in the middle of night, usually between 04:00 and 7:00 in the
morning. Determining what type of maintenance is performed during these windows is pretty
basic. If the maintenance activity has impacted the client in the last year, it will be classified
as a ’Normal’ change. This means it needs to be done during maintenance windows. But
because the amount of maintenance is growing and it often impacts clients, the windows are
getting crowded. This leads to an increase in costs and the threat of not reaching goals set by
KPN. The goal of this research is to find a way to unburden the maintenance windows of less
riskful activities, in order to create room for riskful maintenance activities that really need to
be done during maintenance windows. To achieve this goal, the following research question is
formulated:
How can the use of a low-level/micro risk model legitimise the classification of
changes in the telecommunication sector, in order to make change planning more
efficient?
The selected research approach to create an artifact which can answer the research question is
’Design Science Research’. By structuring the research following six activities, the necessary
objectives can be defined, the artifact can be developed, the artifact can be demonstrated
2
and the results evaluated. Using risk based maintenance methodology, which has been proven
efficient in other comparable sectors, the objective is to create a risk model which analyses
maintenance activities in more detail. This can then be used to reclassify maintenance activ-
ities in such a way that they don’t need to be performed during maintenance windows. The
type of risk model that fits the goals of this research is a Bow tie model. Based on historic
data an overview of threats, consequences and damage categories has been created for four
types of maintenance activities, as well as probabilities of these threats and consequences
happening. Using these probabilities, risk ratings have been calculated. The risk ratings can
then be used to compare maintenance activities with each other and define classifications.
The results of creating these bow tie models and calculating the risk ratings have led to four
recommendations for KPN.
The first recommendation is to start recording performance data on a micro level. Performance
data refers to data that shows how changes are being performed. Examples of this are reasons
of failure and consequences of failure. Because this data is detailed information and varies
for every change type, it classifies as micro level data. This is not yet happening at KPN and
the results of this research point out that recording and using micro data can help improve
the success rate of maintenance activities. Which in turn makes the change planning more
efficient. The second recommendation is that to be able to fully confirm that using micro risk
models legitimises classifications of changes, a more in depth evaluation is necessary. While
it has been shown that compared to the current method of classification, using the a micro
risk model provides more insight based on historic data, it has not yet been decisively shown
that is can actually be used to achieve its initial goal: increasing the efficiency of the change
planning. To prove it can actually increase the efficiency of change planning, the method
most be tested on more types of changes. For that to be possible, more micro performance
data needs to be recorded. The third recommendation is to make the risk rating calculations
automatic. This can be achieved in several ways, such as creating a piece of software. An
automated process would make using the method more applicable in the organisation. The last
recommendation is to combine the results of this research (and any expansion on this research)
with research in other domains. Looking at client management, contract management and
process optimisation in order to improve on change planning efficiency, could provide the
missing pieces of the puzzle to solve the problems KPN is facing.
3
Contents
2 Background 23
2.1 Risk management and maintenance in general . . . . . . . . . . . . . . . . . . . 23
2.2 Risk based maintenance methodology . . . . . . . . . . . . . . . . . . . . . . . 25
2.2.1 RBM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2.2 Risk assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3 Design & Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3.1 Basic bowtie diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3.2 Detailed bowtie diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4 Risk management in the telecommunication industry . . . . . . . . . . . . . . . 31
4
3.2.6 Damage categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2.7 Barriers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4 Demonstration 43
4.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.1.1 Bow tie details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.1.2 Sensitive data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2 Bow-tie models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2.1 1-10G model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2.2 1G model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2.3 NT model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2.4 WAP model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.3 Risk matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.3.1 The Z-score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.3.2 Colouring scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3.3 IT applications and mobile services risk matrix . . . . . . . . . . . . . . 49
4.3.4 Services: telefonie,internet, iTV, Digitenne, Wholesale transport & ac-
cess services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.3.5 Services secundair . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3.6 Reputation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.3.7 Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3.8 Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.3.9 Business impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.3.10 Services, telephony and internet for large businesses . . . . . . . . . . . 57
4.4 Risk rating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.5 Demonstration conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5 Evaluation 62
5.1 Objective 1 of the risk model: Providing insight . . . . . . . . . . . . . . . . . . 62
5.1.1 1-10G migration comparison . . . . . . . . . . . . . . . . . . . . . . . . 63
5.1.2 1G migration comparison . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.1.3 NT migration comparison . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.1.4 WAP migration comparison . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.1.5 Conclusion - Objective 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.2 Objective 2 of the risk model: Numerical substantiation . . . . . . . . . . . . . 64
5.2.1 Risk rating comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.2.2 Plan for further evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.3 Objectives for KPN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.3.1 Objective 1: Increase the percentages of FTR performed changes . . . . 67
5.3.2 Objective 2: Minimise the risk of impact on the clients . . . . . . . . . . 67
5.3.3 Objective 3: Increase the maintenance pace . . . . . . . . . . . . . . . . 67
5.4 Evaluation conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6 Conclusions 68
6.1 Sub question 1: What risk model is appropriate to achieve the required results? 68
6.2 Sub question 2: What information is needed to create the risk model? . . . . . 68
6.3 Sub question 3: How does a micro risk model help classify changes differently? 69
5
6.4 Sub question 4: How does changing classification of changes make change plan-
ning more efficient? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.5 Sub question 5: Does the risk model legitimise changing the classification of
changes? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.6 Main research question . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
B Threat percentages 85
B.1 Percentages 2018 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
B.2 Percentages 2019 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
E Representativity 98
6
List of Figures
E.1 Representativity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
7
List of Tables
8
Chapter 1
While a world where internet and mobile phones were not considered main necessities existed
a merely 25 odd years ago, this would now be unimaginable (Castells, 2014; Sarwar & Soomro,
2013). As businesses, governments and all organisations that play a role in making day to
day life as we know it possible could not function properly without internet, our whole society
would collapse if the internet would disappear (Aceto, Botta, Marchetta, Persico, & Pescapé,
2018). But even ’normal’ consumers would have a tough time adjusting to a life without
internet. Especially considering the impact of smart phones, who made access to the internet
possible for individuals even when on the move. Communicating with others would be a lot
slower, research and studying would be a lot more work, any act of administration would
become more complicated and the list goes on. Basically, life would get tougher to manage.
The infrastructure to make this all possible is provided by telecommunication providers. These
companies lay and maintain the necessary equipment to connect users, corporate and regu-
lar, to the internet and each other. Losing connection to the internet completely would be
catastrophic, but even a loss of connection for a couple of hours is considered by many as
potentially problematic. Therefore maintenance of the network is an essential task performed
by telecommunications companies. This creates a challenge for providers. While activities
such as updates, migrations and maintenance are important to prevent outages, preforming
these activities also cause disturbances in service provision. Combined with the fact that
technology is developing at an ever growing pace, it results in a bigger need for maintenance
activities to be performed. But the risks involved need to be managed carefully.
An issue telecommunication providers are facing because of this, is how to best organise these
activities. Most corporate clients have agreements in their contract which prohibits work on
the services they acquire outside of certain time windows. During those windows, providers are
free to perform the activities they need do, as long as the clients are warned in advance. These
windows are limited and with the amount of activities that need to be done and will need to
be done in the future, they run the risk of becoming bottlenecks. Performing these activities
outside of the agreed upon windows could be a potential solution, but comes paired with
risks. These risks need to be carefully managed, but should not restrict efficient organisation
of maintenance activities. It is in light of this that, the Koninklijke PTT Nederland N.V.
(KPN), the biggest communications service provider in the Netherlands, has an interest in
optimising the organisation of these activities. Therefore the goal of this thesis is to improve
on the efficiency of maintenance planning in the telecommunication sector, balancing the
importance of the maintenance activities with the risks. The following sections will expand
9
on the factors that have lead to this situation. Then the core concepts used during this
research are defined, which will lead to the knowledge gap that forms the foundation of the
research. This will serve as the basis for the justification of the research, research questions,
methods and tools.
There are different ways to maintain the current client base. The main of goal of all these
measures is to keep the client happy by increasing the quality of experience (QoE). In the
article ’Using big data to improve customer experience and business performance’ by Jeffery
Spies, simple representations of the customer and communications service provider (CSP)
lifecycle are given (Spiess et al., 2014). With a model proposed by Alcatel-Lucent, that
represents a comprehensive view of the QoE, key drivers of customer satisfaction are identified
across these lifecycles (Alcatel-Lucent, 2012).
It is in the consume phase of the customer lifecycle (Figure 1.1) where a large component
of QoE is determined (Spiess et al., 2014). This is the "in-service" experience of the client.
Meaning the quality of the network, quality of the service, the performance and the ease of
operation. The next important part for QoE is located in the support part. This is the "care"
experience for the client. Care stands for the categories of problems that impact the customer
10
experience. The next component of the holistic view is "perception", which relates to the the
awareness, interact and reward phases of the lifecycle. A big part of this is the brands image,
the perception of the offered value, loyalty programs and promotions. The last component is
"ease", which touches on the agree/get phase (activation) and the pay phase.
Knowing this helps organisations in the telecommunication market to know where and what
to focus on. As a large component of the QoE is determined in the "in-service" experience of
the client, many organisations put a lot of effort into improving the quality of their network
and services. This means trying to have clients, consumers and corporate clients, on the
best possible networks. Increasing reliability, speed and quality of their connection. With
the goal to have better performing networks than their competition. The main difficulty of
this is the fast pace in which technology is evolving. Leading to an exponential increase in
volume, velocity and variety of data from both users and communication networks (Musolesi,
2014). Meaning that new and better technology is available so fast, that organisations in
the telecommunication sector are in constant need of upgrading their network to handle all
this data. A consequence of this is that migrations, maintenance and upgrade activities are
increasing. This was already starting to be a problem 10 to 15 years ago and is only becoming
a bigger issues for the telecommunication market (Kamoun, 2005).
Another reason for wanting to migrate and update their networks is because of the amount of
legacy systems that are still being used. Wanting to have the highest quality and speed is an
important motivator, but decreasing the risk of network failure is just as important. Legacy
systems run a higher risk of malfunctioning and potentially putting down big parts of the
network. The problem of legacy systems overlaps with two other important factors for why
telecommunication service providers are putting a lot of effort into network maintenance. It
is important for the image of a telecommunication provider to have no failures. In case of
massive or regular interruption of service, this reflects badly on the image of the provider.
Which makes attracting and retaining new customers even more difficult. The second factor is
power consumption of legacy systems. Estimations say that ICT is responsible for two to four
percent of the world wide carbon emission (Lubritto et al., 2011; Vereecken et al., 2011). New
technology is more energy efficient. With the attention on the environment having massively
increased, many providers try to minimise the power consumption of their network. This also
has a big impact on the image of the provider.
High costs and never ending maintenance on the network in a market were it is already dif-
ficult to thrive, are not the only concerns for the telecom providers. Another issue is the
disruption of service delivery because of the maintenance performed on the network. Any
form of maintenance, upgrade or migration of an old network to a new network affects the
clients by creating short periods of down time. This could be a longer period when any of
these activities fail and cause errors.
It is for this reason that many providers in the Netherlands make use of what is called a
maintenance or service window. These windows are time slots in which maintenance work can
be performed with minimal impact on users and are different for every provider. Koninklijke
PTT Nederland N.V. (KPN), the biggest communications service provider in the Netherlands,
have defined their time slots either between 05:00 - 07:00 or between 00:00 - 07:00, depending
on the type of maintenance. The idea is that between those hours, a minimal amount of
users are actively using the network. So a disturbance in service delivery or an all out failure
11
because of maintenance work will have the least amount of impact. While this is a good idea,
several problems have come up for KPN. First of all, it is difficult to find enough mechanics
who are willing to work those hours. Which means not all work planned during these windows
can be performed. In addition, the costs are much higher for night work than for work done
during the day time. Another issue is that the time periods are fairly limited. Combined with
the increasing amount of work, these maintenance windows are becoming the bottlenecks for
certain goals to be reached. It is therefore that KPN is looking for a way to decrease the
amount of work being done during those windows.
Translating this to the telecommunication sector and their infrastructure results in the fol-
lowing. Preventive maintenance would be the migration of old networks to newer networks.
These types of migrations are done to improve performance by utilising newer technology and
to prevent older technology from failing. Corrective maintenance is replacing equipment that
has already malfunctioned. This happens for example, when older technology has not been
replaced on time. Predictive and reinforcement maintenance is not used in the telecommuni-
cation sector yet.
In order to go into more detail for these type of maintenance activities, information about
how this is done by telecommunications providers is necessary. But no such method, stan-
dardisation or information is available. Therefore we can only rely on informal talks at KPN,
who commissioned this research. One reason for the lack of this information that is often
mentioned during such talks is the competitiveness of the telecommunication market. Being
able to perform maintenance and migration activities more efficiently is believed to increase
the competitiveness of a telecommunication provider. The consequence of companies refrain-
ing to share this information, is that there is no research available. This means that there are
only two valid sources of information. KPN themselves and other industries where research
about maintenance activities are more readily available.
12
maintenance and risks. The first concept, maintenance windows, is essential for understanding
what is being researched and why. To be able to understand the delimitation of the research,
understanding the KPN network is important. The last concept that is important to this
research is the current policy and process in regards to the maintenance planning.
Maintenance windows are used when the work has an impact on the client, there are risks of
higher impact than expected (time wise and/or extent of impact) and communication with
the client is needed so they can prepare for possible disruptions. To minimise impact on
the client side, these windows are planned at night. The expectation is that clients are not
(less) active in the middle of the night and can therefore better process a disruption of the
service. Important to note is that there has been no research done by KPN, internally or
externally with clients, to determine the ideal time to perform these activities. The most
common maintenance window at KPN is between 04:00 and 07:00. But depending on the
type of maintenance and agreements with client, windows can vary. The hour this kind of
work is performed has an impact on mechanic availability and hourly rate, as night shifts
are more expensive. Another important limitation related to maintenance windows, is that
because of having to work during certain time frames, the amount work that can be planned
is restricted. This does not have to be a problem, but needs care full planning and clear
requirements to be used efficiently.
13
of request, the request is classified. The planning and amount of security checks is based on
that classification. The following types of classification exist (KPN, 2016):
• Normal change: A change that will follow the Change Management procedure starting
with a request for change, will be authorised conform the agreed governance, afterwards
the change will be planned, implemented and evaluated.
• Emergency change: Emergent changes are changes that must be introduced as soon
as possible to resolve a major incident or to prevent a major incident.
• Urgent change: Urgent changes needs to be introduced before standard lead time and
can wait for at least 24 hours.
• Service request: Generic varying types of demands that are placed upon the IT
department by users.
When a change has been classified as a standard change, it means that most of the time it can
be planned outside maintenance windows and won’t go through severe security checks and
discussions. A lists exists of the current changes that are classified as standard. For normal
changes a run book is filled in and delivered with the request. An important aspect of the run
book are the questions that relate to impact, experience, complexity and precautions. These
criteria are measured through KPIs, of which the following are the most important:
• Change triggered Be Alerts (CTBA): When a situation arises which has a certain
impact on service delivery for clients, has an impact on KPN or an impact on clients
aswell as KPN, it is called a Be Alert. There are five levels of severity of a Be Alert,
which is represented by the colour of the Be Alert. The colours that are used are green,
blue, yellow, orange and red and their severity are minor, moderate, significant, major
and critical respectively. The criteria that decides which colour Be Alert a problem is
classified as, can be found in the classification matrix in Appendix A. Through weekly
and monthly reports of how many changes and what type of changes have led to a Be
Alert, the change request process is managed. This is done by looking at weighted and
unweigthed numbers of change triggered Be alerts and the weighted down time because
of the Be Alert.
• First time right (FTR): A change is considered FTR when its has been implemented
in the communicated window, impacting customers as predicted and not rescheduled or
cancelled after communication. When looking at IT changes, the change should have no
impact on IT services and needs to be completed, without a rollback being necessary.
When a change has been labelled nFTR (not First time right), it does not necessarily
results in a Be-Alert. Consequences can vary from a change having to be rescheduled
to a red Be-Alert,
• Lead time: The time between starting and finishing a change is a crucial indicator for
the risk and impact of a change. This criteria also plays a role in deciding if a change
needs to be done during a maintenance window, as a long lead time means a long time
14
when service delivery is interrupted. Changes that cause clients to be impacted for a
longer time are less likely to happen during daytime.
The current way of determining the classification of a change is to look at the risks and impact.
This is done from a macro perspective, as it is based on a yearly performance of a change. If
a certain type of change has not resulted in client impact for over a year, it becomes eligible
to be classified as a standard change. Other factors, such as reasons for a change type failing,
probabilities of it happening again or the actual impact are not taken in account. While this
does not create problems when there is enough room during maintenance windows, it does
potentially put unnecessary pressure on the windows and increases the costs of performing
the change.
1.3.5 Conclusion
These core concepts are important to understand the problem telecommunication providers
are facing. The usage of the maintenance windows in the change request process and the
location of the maintenance works all play a big role in managing the risks and impact of
maintenance work. While the details have been taken from KPN, all telecommunication
providers face the same kind of problem, in which these concepts play an essential role.
Applying relevant methodology and theory in order to improve performance is the next step.
15
combined with terms related to the telecommunication sector, there are no results related to
risk based maintenance. While there is much to find about the competitiveness of the market
and the important role the quality of the network plays in it (Dahiya & Bhatia, 2015; Spiess
et al., 2014; Van den Poel & Lariviere, 2004; Verbeke et al., 2012), optimising the upgrade
and maintenance of the network is not much discussed.
This research has been commissioned by KPN because of the problems they experience in
practice organising maintenance. The maintenance workload of KPN is increasing and the
current method of determining risks and impact is leading to overfull maintenance windows.
This is slowing down maintenance goals and is all over a less efficient way of working. Main-
tenance windows are more expensive and are limited in time compared to day time windows.
One of the main reasons that in the current way of determining risks and impact, maintenance
windows are not being used effectively, is that risks and impact are determined on a macro
scale level. This has as a result that many changes are not performed as efficiently as possible,
as well as that changes the might not need to be planned during maintenance windows are
still planned during maintenance windows. In order to be able to change the classification
of those changes, the method to determine risks and impact needs to legitimise a change in
classification. Because the current method does not provide any means to legitimise new
classifications based on more detailed information, a new method needs to be used.
Therefore the goal of this research is to use existing literature of other sectors to develop/ex-
tend on risk based maintenance planning on a micro scale for KPN. By testing this method
at KPN, new insights can be generated for risk and maintenance management in the telecom-
munication sector. The results can be used in two ways. This research expands on RBM
methodology within a new sector and depending on the results gives an idea if it is a viable
methodology for the sector. The second way it expands on literature is that in current RBM
methodology, the planning phase is focused on maintenance that impacts internal procedures.
In the telecommunication sector the risks and impact that are being managed are focused
on external impact. This results in a different focus and therefore outcome of the method.
Whereas in standard RBM the result is a maintenance schedule based on the results of the
risk assessment, the results of this research will be a way for telecommunication providers to
classify their changes based on risks. Which in turn will affect how maintenance is planned
in the future.
As KPN is the largest telecommunication provider in the Netherlands that is facing difficulties
with their maintenance planning, this case is well suited to create and test similar methods
as used in comparable sectors.
The main goal of this research is to improve change planning in the telecommunication sector
by implementing Risk Based Maintenance methodology. Thereby extending on RBM method-
16
ology in the telecommunication sector by providing an example of RBM being implemented
in this sector. To achieve this a risk model is developed that entails the risk assessment steps
discussed in RBM literature. This risk model (adjusted to the company) can then be used by
telecommunication providers to help them classify their changes. For this thesis, the case of
KPN is used, making the model in this research tailored to KPN’s context. The risk model
can then be used to answer the main research question of this thesis.
How can the use of a low-level/micro risk model legitimise the classification of
changes in the telecommunication sector, in order to make change planning more
efficient?
Making the change planning more efficient can mean different things. The Oxford dictionary
defines efficient as "achieving maximum productivity with minimum wasted effort or expense".
Being able to classify changes based on a micro level of detail can lead to productivity in-
crease and will lead to a decrease of effort and expense. The first way this can lead to an
increase in productivity, is that by working on a micro scale provides insight into what is
causing a change to be performed nFTR. Action can then be undertaken to improve on these
causes. The second way it can increase productivity is by giving numerical substantiation for
a change to be reclassified. Meaning changes that are currently being planned in maintenance
windows, while their risks and impacts don’t need them to be, can be planned during the day.
Opening up space for other changes in the maintenance windows and increasing the amount
of work that can be done. The sub questions that help answer the main research question are
as follows.
Sub questions:
(b) What hazards, events, threats and consequences are identified for the different
changes?
For each of the selected changes, the components of the risk model need to be
identified. As shortly discussed in section 1.3.4, those are hazards, events, threats
and consequences. This will provide the necessary insight into improving FTR
percentages.
17
(c) What are the chances of these threats and consequences happening and what is their
potential impact?
The next step in the risk model is determining the probabilities of each threat and
consequences happening and what their potential impact is. This makes it possible
to compare the risks and impact of changes.
(d) What barriers are in place and what barriers could be put in place?
Barriers are another component of some risk models. The give extra insight into
what is already been done to prevent or mitigate threats and consequences. Deter-
mining what barriers are in place to prevent or mitigate threats and consequences
is important to determine what can still be done to improve.
The second question consists of five sub questions, which will lead to the construction of a
risk model. The risk model itself has two functions that will help improve efficiency of change
planning:
1. Provide insight in what can be done to improve current change activity in maintenance
windows.
2. Provide numerical substantiation for discussions about why certain changes have to be
planned during maintenance windows and others don’t.
18
The first point relates to the information that is gathered to be able to create the model.
Identifying all the threats of a type of change, calculating for each of them how often they
lead to a failed migration and the possible consequences gives insight in how your current
approach is functioning. Knowing this, gives the user of the model (most likely the project
manager in charge of this migration) the necessary information to act and improve. The
user will know on what kind threat he will need to focus to increase his success rate, in turn
wasting less effort and expenses.
After expanding the model with statistics, calculation can be done to calculate chances. These
can be further used to give risk and impact ratings to different types of changes. Comparing
these rating can then lead to new classification of changes based on a companies threshold
value for risks. This can be used to define how much of risk a company is willing to take to
be able to perform maintenance and migrations more efficiently. This is useful as it makes
comparing different types of changes easier.
In "positioning Design Science Research for Maximum Impact" by Gregor and Hevner, a DSR
knowledge contribution framework is presented (Gregor & Hevner, 2013). Based on the solu-
tion maturity and application domain maturity, a 2 by 2 matrix of research projects contexts
is created. The quadrant created when the solution maturity is high and the application
domain maturity is low is called the Exaptation quadrant. Research where knowledge and
solutions of other sectors is refined and used in a new sector falls into this category. As other
sectors than the telecommunication sector have successfully used RBM and one of the goals
of this research is to improve a part of the the maintenance process, it was fitting to use this
methodology in order to adapt RBM to be useful in a new field. The artefact being the new
process, based on a risk model.
There are several different DSR approaches developed for information system research. One
methodology presented by Peffers et al in ’A Design Science Research Methodology for In-
formation Systems Research’ suggests six activities for carrying out research based on design
science research principles (Peffers, Tuunanen, Rothenberger, & Chatterjee, 2007). These ac-
tivities are problem identification and motivation, defining the objectives for a solution, design
19
and development, demonstration, evaluation and communication. A brief explanation will be
given for each activity. An explanation of how this will be implemented for this research
question will follow.
Activity 2. Define the objectives for a solution. The second activity serves to define
what solutions are possible and feasible. The objectives can be quantitative or qualitative.
This helps define what data collection will be necessary. The goal of the created artefact is to
improve the change request processes in such a way, that it will help relieve the maintenance
windows. One way to achieve this, is by improving the request processes through the usage
of data. By using known data like risks and impact of a change request, the improved pro-
cesses will be able to distinguish what change request need to be handled during maintenance
windows. But more importantly, which change requests don’t need to happen during the
maintenance windows. For this to be successful, the current process will need to be analysed,
data about risks and impact will need to be gathered and a risk analysis framework will need
to be put in place. This activity provides the theory and sets all the conditions for answering
sub question one.
Activity 3. Design and development. The core activity in any design science discipline
is designing and developing the artifact. Such an artifact can be any designed object, that
contributes to solving the problem, as well as contributes to literature. To be able to this, its
desired functionality and architecture need to be determined. The desired artifact for KPN is
a risk analysis model which can be used to compare risks and impact of different changes. By
collecting historic data about the changes that have already been performed in the past, all
hazards, threats and consequences can be identified. Based on that information, classifications
can be made of changes that need to be performed in and outside of maintenance windows,
which could help relieve the maintenance windows. Activity three provides the answers to
sub question two.
20
changes and how the rates can be compared, sub question 3 is answered.
Activity 5. Evaluation. The evaluation activity is needed to evaluate if the created arte-
fact actually contributes to solving the problem. To evaluate the effectiveness of the new risk
model, a comparison is made between the old risk classification and the new risk rating. This
will show if the new risk model results match with the old model, while also showing how
they are different. The main objective of the new process would be to reduce the amount of
work planned during maintenance windows. That would be the first evaluation criteria for
comparing the solution objectives to the functionality of the artifact. Based on the result, the
decision can be made to return to activity three to improve the artifact or to continue to the
next activity. Evaluating the model and its results provides the answer to sub question four.
Comparing the new process and the old process shows how changing the classification can
improve the efficiency of the change planning. By combining the theory, the demonstration
and the evaluation an answer can be formulated to sub question five.
To summarise, the fist step will be a literature research into risk management and risk based
maintenance in different sectors. This will result in a specific research problem. The second
step will be to define the objectives and methods to reach those objectives. Step three will be
creating the artifact that has been chosen to reach the set objectives. In this case, that will
be a risk model intended to rate different types of changes in order to classify them. Step four
will be performing the calculation to rate the four example change types. The results will be
ratings and impact on different damage categories. Step five will consist of comparing these
results with the current method and classification. Concluding en reflecting in step six about
this research’s limitations and results.
1.6 Structure
The activities described in section 1.5.1 will form the structure of this research. Chapter
one and two contain activity one and provides the necessary context for activity two. In
these chapters the problem identification and motivation is given, as well as the necessary
background to provide theory and objectives for a solution. Chapter three describes the case
related context, necessary to understand and define the objectives for solving the problem
KPN is facing. Therefore it contains activity two. Chapter four will contain activity three
and four, design and development and the demonstration respectively. The information from
chapter three is implemented in the model, which is explained in chapter two. This will be
done for the selected changes. Chapter five will contain the evaluation activity, which will
be comparing the new model and methodology with the old way of working. This will be
followed by chapter six, containing the answers to the research questions and the conclusions.
The last chapter will contain the communication activity as well as a reflection. This will be
in the form of recommendations and limitations of the research. This structure can be seen
21
in Figure 1.3.
22
Chapter 2
Background
This chapter covers the literature needed to understand the problem and the methods used
during the research. It helps put the problem and its possible solutions in context. The theory
needed to answer sub questions one and two is explained here.
The energy sector has a comparable network and service to the telecommunication sector.
Clients pay a monthly fee to receive the service, in this case electricity, gas, internet and
mobile connectivity. This is provided by equipment and an infrastructure network in hands
of the providers. A difference is that in the energy market, the infrastructure network is often
the responsibility of another party, while in telecommunication sector the infrastructure is a
big part of the equipment. A large amount of money is spent in maintenance of production
equipment. For the US industries, it is estimated that over $300 billion is spent each year on
maintenance. This is the case, while it has been proven that a reduction in operating costs
of about 40-60% is obtainable through maintenance strategies (Dhillon, 2002).
That is why Risk-based maintenance methodology (RBM) has been designed and imple-
mented. It is a tool used to reduce the chance of equipment failure. Thereby reducing the
probability of consequences due to failure. RBM achieves this by supporting maintenance
planning and decision making (Krishnasamy, Khan, & Haddara, 2005). The idea is to iden-
tify the scope of the system, make a risk assessment and do a risk evaluation. Based on the
results the maintenance planning can be optimised. Meaning that the probability of failure is
affected through changing the maintenance interval. In turn influencing the risk. It can also
be used to look at how maintenance is done. The purpose of that is to understand what leads
to failure during maintenance, what the consequences are and how these can be mitigated
and/or prevented.
23
A sector that also uses RBM is the oil and gas sector. Offshore production platforms operate
wells and separate fluid form those wells into oil, gas and water. This is done by using dif-
ferent kinds of machinery. Each of these machines can break down, slowing down the process
and potentially causing harm. This harm can be done to people, nature or other equipment.
In order to minimise the risk of that happening and optimising maintenance RBM is used
(Bhandari, Arzaghi, Abbassi, Garaniya, & Khan, 2016).
Comparable to offshore oil platforms are offshore windmills. The operation and maintenance
costs of offshore wind turbines are a major contributor to the energy cost (Nielsen & Sorensen,
2011). Corrective maintenance is mostly used, as it is the most simple strategy. But as minor
parts malfunctioning can damage bigger parts and thereby increase the repair/replacement
cost, it is paired with larger uncertainty than preventive maintenance. Research has shown
that doing risk based preventive maintenance, decreases the amount of corrective mainte-
nance, decreasing total costs (Sorensen, 2009).
Another business sector where risk management plays an important role is the construction
industry. Everyone knows of construction projects that have exceeded the budget and time
frame that were set for the projects. This is a major issue in the business and has therefore
generated a lot of attention. There are many books, papers and research to be found about
how to best manage risk for construction projects. Many agree that risk assessment plays
a critical role in managing risk. (KarimiAzari, Mousavi, Mousavi, & Hosseini, 2011). An
important difference here is that maintenance does not play a role in the risks. They do
however, make use of risk assessment techniques to decrease and control risks.
The important difference with these sectors compared to the telecommunication sector, is that
most risk management and maintenance planning in these sectors are done to reduce expendi-
ture. Whereas the main motivation in the telecommunication sector for risk and maintenance
planning is to reduce impact on the client. There is a financial motivation aswell, because
failing to deliver service to clients can lead to fines. But the main fear is the impact on reputa-
tion. The telecommunication sector is very competitive and losing clients to the competition
because of a bad reputation is a real risk, as discussed in section 1.1. It is therefore that main-
taining hardware, software and the network is crucial. But maintenance activities come with
risks of also impacting the clients. This situation shows how important risk and maintenance
planning is for telecommunication providers. That is why it is surprising that there is almost
no literature on risk management of maintenance planning in this sector. While literature
can be found on other forms of risk management in the telecommunication sector. Exam-
ples of this are focused mainly on information security, such as "Sector-Based Improvement
of the Information Security Risk Management Process in the Context of Telecommunications
Regulation" by Mayer et al. (Mayer, Aubert, Cholez, & Grandry, 2013). Other forms of
risk management include compliance, technical, reputational, competition, health, country,
asset impairment, liquidity, exchange rate, counterparty, interest rate, equity, corporate gov-
ernance, personnel, credit, market, weather and fraud risk as discussed in "Risk Management
and Sustainable Development of Telecommunications Companies" by Gandini et al. (2014)
(Gandini, Bosetti, & Almici, 2014).
24
2.2 Risk based maintenance methodology
Risk management has been defined by ISO (The international Organisation for Standardisa-
tion) as "coordinated activities to direct and control organisation with regards to risk", where
risk is defined as an "effect of uncertainty on objectives" (ISO 31000: 2018, 2018). A crucial
element for successful risk management is risk assessment. This is defined as the "overall
process of risk analysis and risk evaluation" (Rausand, 2013). Where risk analysis exists out
of risk identification and risk estimation, as can be seen in Figure 2.1 (White, 1995). This
section discusses RBM, a risk management method, in more detail. Certain methods that are
relevant to this research are expanded on.
2.2.1 RBM
As shortly mentioned in section 2.1, risk based maintenance methodology has been designed
to manage the risks and costs of maintenance more efficiently. It was first proposed in 2003
by Faisal Khan and Mahmoud Haddara in "Risk-based maintenance (RBM): a quantitative
approach for maintenance/inspection scheduling and planning" (Khan & Haddara, 2003). Ac-
cording to this paper, the goal was to be be able to answer five questions related to integrity
and fault free operation of the system:
Answering these five questions can lead to cost effective maintenance and minimising the con-
sequences of a failure. The way to answer these questions is to follow the proposed method-
ology. It consists of three main modules: risk estimation module, risk evaluation module
and maintenance planning module (Khan & Haddara, 2004). Each of these modules exists
out of several steps. These modules can be split into two main phases, risk assessment and
maintenance planning based on risk. The first two modules are part of the risk assessment
phase and the third module is the maintenance planning(Arunraj & Maiti, 2007). It is the risk
assessment phase that is the most relevant for this research. RBM was initially designed for
plants and equipment, making the maintenance planning phase less relevant for the telecom-
munication market. The difference between both sectors is that most maintenance in plants
has an internal impact, while maintenance in the telecommunication sector often impacts the
clients. This makes the planning phase in RBM less interesting for the telecommunications
sector, as it does not account for client impact. Therefore the steps in the risk assessment
phase will be discussed in more detail.
25
appropriate depends on the cost and the availability of the necessary data or information.
Arunraj and Maiti share a list of risk analysis methodologies in their paper, categorised
into deterministic approach, probabilistic approach and a combination of both approaches
(Arunraj & Maiti, 2007). The methods that have been selected to be used in this research
are discussed in the following sections.
There are several difficulties when performing a FTA. Analyst need to have a good under-
standing of the system, the cause-effect process and the possible failures (de Oliveira, Marins,
Rocha, & Salomon, 2017). This approach also assumes that each branch exists out of mutually
exclusive, independent events. As failures seldom have a single cause, FTA can fail to identify
common cause failures (Ray-Bennett, 2018). Fault tree, aswell as event trees, are simplified
models of systems. They cut down systems in to detailed, separate parts, which has as a conse-
quence that emergent properties from the whole system can go unrecognised(Jain et al., 2018).
26
sequent splits as probabilities (Lees, 2012). Just like with a FTA, the main difficulties lie in
the need of having a thorough understanding of the system, the cause-effect process and the
possible failures.
Bowtie diagrams
The bowtie diagram is a combination of the FTA and the ETA and adds to it in the form
of barrier thinking (CGE Risk Management Solutions, 2015). The left part of the bow-tie
diagram is a simplified fault tree analysis. This leads to the top event of the Event tree, which
is shown in the right part of the bow-tie diagram in a simplified form. Added to both part
parts are barriers. In the left part, barriers are preventive measures, focused on preventing the
threats from happening. On the right, barriers are recovery measures, focused on mitigating
the consequences and/or the resulting losses and damage. Escalation factors and escalation
barriers can then be added. Escalation factors are factors that make barriers fail. The main
goal of a bow-tie diagram it to communicate, qualitatively, the risk. It is possible to add
quantification to a bow-tie diagram, just like with a FTA or an ETA, but the limitations of
this need to be taken into account. Combined with risk matrices and other techniques such
as ALARP (As low as Reasonably Practicable), a overview can be given of what measures
can best be taken to decrease risk and increase productivity.
• Hazard: While they have a negative connotation in daily life, a hazard in a bowtie
diagram is part of normal business. It is a situation that is necessary for a business
to work, but because of this situation there exists the possibility for harm to occur.
Examples of hazards are operating machinery or working with chemicals. They will not
lead to harm as long as they are under control, but there are ways for them to do harm.
• Top event: The event leading to the point that a hazard is not under control anymore is
called the top event. The moment a normal situation turns into an abnormal situation.
Nothing has has happened yet, but the company is exposed to potential harm. The
situation can still be brought back into control.
27
Figure 2.2: Example bowtie diagram showing all elements (CGE Risk Management Solutions,
2015)
• Threat: Factors leading to the top event are called threats. Threats lead directly to
top events and do that independently of each other.
• Consequence: Any unwanted scenario after the top event happening, caused by the
loss of control, is called a consequence. They are unwanted because consequences lead
to losses and/or damages.
• Barrier: To make sure the top event is not reached, so control is not lost, preventive
barriers can be put in place. These can take the form of hardware systems, design
aspects, human behaviour, etc. When the top event is reached and control is lost,
recovery barriers can be put in place to prevent consequences or mitigate them.
• Escalation Factor: These are the last facet of a basic bowtie. When preventive or
recovery barriers have been identified, escalation factors can be determined. These are
factors or conditions that make it more likely that a barrier fails. Escalation factors can
also have barriers put in place for them.
A basic bowtie diagram containing all these parts can already be put to use. The process of
gathering the necessary information and putting it into the diagram gives a easy to under-
stand overview of the situation. Having identified what possible threats are and the current
barriers in place (or not in place) to prevent them shows where potential improvements can
be made. The same can be said for the consequences side, as seeing what the potential con-
sequences are and what is in place to prevent or mitigate them gives a good impression of
where improvements are possible. In the case that more information is necessary to be able
to make decisions, such as the impact of certain consequences, more details can be added to
the diagram.
28
the original risk (Baybutt, 2014). As the main idea of the bowtie diagram is to identify and
determine what barriers are in place and what barriers could be put in place, ALARP can
be used to see if the cost (time, money, trouble) of adding a new barrier is acceptable. This
is done by looking at the inherent risk level, the risk reduction gained by introducing a new
barrier for that risk and the cost in time, money and trouble needed to implement that new
barrier.
Combining the concept of ALARP with a bowtie is a process of five steps. Steps one to three
are for determining current risks and risks reduction. Steps four to five are for conducting
the ALARP evaluation.
3. Determine the residual risk by adjusting the inherent risk with the risk reduction of the
barriers.
4. Investigate additional barriers to reduce risks further and estimate the cost to implement
them.
5. Weigh the residual risk against risk reduction and cost of additional barriers to determine
if the residual risk is ALARP or additional barriers need to be implemented.
To determine the inherent risk for the threat side (left side) of the diagram, two aspects need
to be taken in account. The likelihood of the threat occurring and its causal power. The causal
power of a threat is an indication of how likely the top event will occur if the threat occurs.
These two combined give an idea of how serious a specific threat is. On the consequence side
of the diagram it is similar. The top event has the same likelihood for each consequence,
but a different causal power. This difference leads to different scores of likelihood on each
consequence. Then comes the damage caused by each consequence on different categories.
The classic four categories are people, assets, environment and reputation. These scores are
then put into a risk matrix of which the two axes are likelihood and severity. This is visualised
in the Figure 2.3 and Figure 2.4.
Figure 2.3: Causality bow tie diagram (CGE Risk Management Solutions, 2015)
Having determined the inherent risk, the net step is to determine the residual risk by looking
at potential risk reduction through implementing new barriers. Adding barrier to the left
side of the diagram will lower the likelihood of the top event or threat happening, indirectly
29
lowering the likelihood of all consequences. Adding a barrier to the right side of the diagram
will directly influence the consequence the barrier is added to. This will either prevent the
consequence of happening, or mitigate the damage done. Influencing the likelihood will reduce
the risk of something happening, while mitigating measures will impact the scale or likelihood
of the damage. Bringing this together and determining the cost of implementing new barriers
is the final step to be able to decide if the risks are ALARP or not. If a risk is determined
to be ALARP, no action is necessary. When the risk is not considered ALARP, a selection
of barriers can be implemented. As this is not an exact science, there is a grey area where
the three variables (residual risk, estimated risk reduction and the cost to implement) interact.
Figure 2.4: Example of a risk matrix (CGE Risk Management Solutions, 2015)
The bow tie diagram is, as its main function, a visualisation tool. A well made bow tie
diagram should be able to communicate the threats, events, barriers and consequences to
somebody that is not an expert. Creating a complete overview of either the current situation
or an improved situation of a hazard in the business. Therefore there are many options to add
more detail to the diagram. Examples of this are classifications of the severity of consequences
or the effectiveness of barriers. While filling in these details in the diagram can be useful to
communicate more details, modellers should not forget that showing more details can also
lead to unnecessarily complicating the model. The most useful and most used options are
explained below:
30
create insight in how the company ensures adequate operation and availability.
• Barrier categories: Barriers can be classified into five categories. These are Be-
havioural, socio-technical, active hardware, continuous hardware and passive hardware.
• Barrier responsible person: The name of the responsible person can be added to a
barrier.
• Risk assessment: The risk assessment part can be added to consequences. This means
that for each risk matrix made, a level of risk can be determined for that consequence.
This is done for each damage category.
After having added all the information to the diagram, the modeller can choose what level
of detail to show. The information can also be exported in different parts if the amount of
information is cluttering the total picture.
31
Chapter 3
The following chapter discusses the case of KPN. What the exact problems are they are facing
and what objectives need to be reached to solve those problems. The necessary case data is
explained, as well as where it originates from. This provides the answer to sub question two.
"We are increasingly moving towards a 24x7 economy. This increases the need to
always be able to use our services. KPN therefore wants advice on how we can
use our maintenance Windows even better and thus serve our customers better.
The desired result is that this does not limit our migration pace, but can even
be accelerated in collaboration with our customers. Also consider the impact on
customers, processes, resources, technology and organisation."
There are many ways to approach this problem. Process optimisation, client side research
and risk based planning are a couple of examples. To decide if RBM methodology might be
effective, the problems KPN are facing need to be clearly defined. The next step is to define
objectives that will help solve the problems defined in the first step.
3.1.1 Problems
The problem statement created by KPN was the starting point for this research. In order
to determine how many different parts the problem exists of, internal talks were held within
different departments. More than 30 people spreading over five departments have been talked
to and have given their view on the problem. The conclusion of these talks is that there are
many aspects to the problem. Which means many different possible solutions. The main
aspects are client management (corporate and consumers), contract management and process
optimisation. Determining which aspect to focus on was the following step.
32
Based on the researchers education and experience, the impact and the time frame of the
research and the department issuing this research, the scope was determined on process op-
timisation. As the background of the researcher is not legal nor business and the time frame
for the research is limited, client and contract management are left out of this research. The
department who issued the problem statement, MDD (Migration, Decommissioning and Dis-
assembly), are a department who manage and perform the changes. Therefore they have a
larger interest in the process, as they make use of it. Finally, there is much room for improve-
ment with potential high impact on the objectives in the current processes. The final step
was to determine what processes would most interesting to optimise. In consultation with the
department in charge of the change processes (the service quality centre), the decision was
made to focus on change classification based on risks.
In the problem statement set out by KPN, the main problem KPN is facing is how to use their
maintenance windows more efficiently, in order to better serve their clients, while improving
or at least not impacting their maintenance pace. This problem exists of three important parts.
The first part is making better use of their maintenance windows. Comparing the amount
of time available in the maintenance windows and the work that needs to be performed now
and in the future shows that maintenance windows run the risk of becoming bottlenecks for
achieving certain goals. These goals are phasing out old equipment and networks. An exam-
ple of this is that KPN is aiming to have all their client migrated from the outdated copper
network to newer networks and stop providing ISDN by 2021 (KPN, 2017). But as mainte-
nance windows are limited and the time in a maintenance window is limited as well, achieving
this will become more and more difficult. Making better use of maintenance windows in this
context can mean two things. Either creating more room in the maintenance windows, or
using the time that is available more efficiently.
The second part is to better serve their clients. This reflects on two things. Clients need to
be able to use the service they buy from the provider 24/7. But the providers also want to
improve their network to be able to provide more stable and faster service to their clients.
These are difficult goals to combine, as performing maintenance to improve the network im-
pacts client service provision. This makes it difficult to perform a lot of maintenance work
outside certain hours.
As comes forward quite clearly in the first two parts, the last part of the problem is not limiting
the migration pace, even rather increasing the pace if possible. In the current situation, these
three aspects clash, creating the problem KPN is facing.
3.1.2 Objectives
Now that the problems KPN is facing have been defined, objectives can be determined. Look-
ing at the three parts of their problem, each objective can be focused on one of the problems.
33
to increase the success rate.
Having objectives helps define what the artifact will need to achieve. This makes it possible to
select a method for creating the artifact. Increasing the amount of maintenance activities and
improving the effectiveness of current activities while decreasing the risks of impact on clients
points to risk management methodologies (as discussed in section 2.1). Therefore this case
lends itself adequately to research if RBM methodology is an effective method for increasing
efficiency while decreasing risks. Using risk assessment will lend the insight necessary to
increase FTR percentages. The results can then be used to provide an overview of the different
risks and impacts. Comparing these results with current classifying methods can help increase
the maintenance pace by changing the classification of changes.
1. Increase the success rate of changes by providing insight into threats and
consequences
By collecting, structuring and visualising information about the threats and conse-
quences, project managers gain understanding of what problems they need to focus on.
With that knowledge, project managers can better react to threats.
Providing insight can be used to increase the success rate of performed changes at KPN.
Whereas calculating probabilities, impact and risk ratings can help to minimise client impact
while increasing the maintenance pace of KPN. Creating an artefact that can reach these
34
objectives would be helpful to reach the goals set by KPN. But without tailoring the artefact
objectives to KPN, both objectives can be valuable for any telecommunication provider. Hav-
ing a more detailed insight into risks and calculating risks are essential for RBM in any sector.
Therefore it has value for RBM methodology application in the telecommunication sector and
helps in solving the problem of organising change planning in the telecommunication sector
efficiently.
• DSLAM: stands for Digital Subscriber Line Access Multiplexer. It is a network device.
Which can connect numerous customer digital subscriber line (DSL) interfaces to a high-
speed digital communications channel. This is done by using multiplexing techniques.
A DSLAM acts like a network switch and enables telecommunication providers to offer
clients the fastest phone line technology with the fastest backbone network technology.
• CPE: stands for Customer-premises equipment. It is a terminal and all of its equip-
ment can be found on a subscriber’s land. The terminal is connected to the provider’s
telecommunications circuit. CPE refer to devices such as routers, network switches
and internet access gateways that enable users to access their providers communication
services.
• Mobile sites: Locations where antennae and electronic communications equipment are
placed to crate a cell in a cellular network.
It is in the access layer that all customers are connected to the KPN network. These devices
are located in street cabinets and connect around 150 to 300 clients to the network. CPE con-
nect big corporate clients to the network. The street cabinets and CPEs are in turn connected
to central offices. These are in the Metro-Acess/Metro Bridge layer. Depending on the amount
of connection aggregated at at central office, a central office is directly connected through to
the Metro-Core or connected to another central office. When a central office is connected
directly to the Metro-Core, it is referred to as the Metro-Access layer, else it is called the
Metro-Bridge layer. The Metro-Core layer is another aggregation step to bring connections
35
together. There are about 180 Metro-Core locations and about 1500 Metro-Bridge locations
(1100 Metro-Access and 400 Metro-Bridge locations).
The Back-Bone layer is slowly being faced out as the ETN network is being replaced by the
FCN network. In Figure 3.1, the new and old architectures are both shown. On the right is
the old architecture, where the Metro-Core locations are set up in a ring, each time aggre-
gating the connections. This has as a result that for reaching the ETN-Core, more capacity
is needed. With the new architecture, a one to one connection between Metro-Core locations
and the Peta-Core are created. Decreasing the amount of capacity needed to transport the
amount of connections. The Peta-Core is located in the ZARA layer, which is the top layer
of the network. It stand for the Zwolle, Amsterdam, Rotterdam and Arnhem. These are the
locations where the main servers of KPN are located and all services provided by KPN are
distributed around the Netherlands. Below The top four layers is the OTP network which
stand for "Optical transportation network". This network is used for fast and reliable connec-
tions over long distances. The different networks like ETN, FCN and PETA regulate the data
transported over OTP. A good comparison to help understand how the network is comparing
it to a highway. The OPT is the highway, the logistical network. Where ETN, FCN and
PETA can be compared to the traffic controllers.
Knowing the basics of how the network of KPN works helps to understand the scaling each
layer brings. This is important as each layer represents an increase in the amount of clients
impacted by maintenance work. It also provides some visualisation of where the migrations
discussed in this research take place and their potential impact.
3.2.2 Scope
In order to make a logical demarcation of what changes to look at during this research, three
factors were looked at. The first factor is the department this research is being done for. Mi-
gration, Decommissioning and Disassembly (MDD) is in charge of managing different types of
change activities. Migrating DSLAMs is one of their main activities. As discussed in chapter
3.2.1, these are mostly located in the access layer. But there are also types of DSLAMs located
in the Metro-Core and ZARA layer. Therefore it was decided to demarcate the scope to these
type of migrations. The second factor that was taken in account was the availability of infor-
mation. The information needed for the risk model is not readily available for many changes.
The reason for this is that KPN, in their current system, does not use micro information to
classify their changes. If a change or maintenance activity causes any form of disturbance in
service delivery, it is never classified as a standard change. Therefore project managers are
not instructed to collect micro information. Luckily, it is available for the type of DSLAM mi-
gration MDD is responsible for. The last factor was the representativity value of the research.
Creating a risk model for a type of change that is not representative for the changes done by
KPN would not be of scientific value. Because these migrations take place in different layers
(see section 3.2.3) and they are the third most performed migrations, there is no doubt that
the results will be representative. Appendix E shows the distribution of change types at KPN.
Another choice that was made in regards to the scope of this research, was to look at TI
(technische infrastructure) and not IT. At KPN, TI is defined as fixed hardware. This means
that the changes this research is looking at, have to do with hardware changes. Where
mechanics have to travel to the location of the hardware to be able to perform their work.
36
Figure 3.1: KPN network
When talking about IT at KPN, it means it is software related. The work is often done from
distance, sometimes even outsourced to foreign countries. Technicians don’t hinder each other
and it the work they do is rarely the reason maintenance windows are becoming more and
more packed. The MDD department also focuses on TI changes, making the demarcation a
logical choice.
1. 1-10G migrations: The main reason these migrations happen is because of capacity
management. The DSLAMs located in the access layer, metro-bridge layer and the
metro-core layer need to be upgraded from 1G to 10G. This is because clients that
have fiber to the home need more capacity. The switches located in the DSLAMs are
’upgraded’ to have 10G gates and during the migration client are taken from 1G gates
and plugged into 10G gates.
37
on the same locations and DSLAMs. The difference is that the reason behind the
migrations is life cycle management. Clients are migrated from old switches to new
switches. The older models have a higher risk of malfunctioning and repairing them
can be difficult because some of them are from old suppliers. Therefore it is better to
change them to new hardware.
3. NT migrations: NANT cards are located in DSLAMs and depending on the type of
card regulate the amount of uplinks. A DSLAM can be upgraded from two 1G uplinks
to four 1G uplinks. This is done to increase capacity. This happens when a NANT
A card is replaced by a NANT D card. When a DLSAM gets upgraded to have 10G
gates, the NANT card also needs to be replaced with a NANT E card. Therefore the
main idea to do this is capacity management. It also helps making the network more
redundant. When for example two of the four uplinks fail, the traffic can be rerouted
over the remaining two links.
4. WAP migrations: NWAPs are located in the ZARA layer. They connect the wholesale
parties to the Peta core of KPN. This is done for life cycle management. Wholesale
clients are disconnected from old hardware and connected to new hardware.
Each of these migration changes are a hazard because it is work that needs to be done on a
daily basis, but can inflict damage if they go wrong. They all share the same top event, the
event that leads to loss of control, which is when the migration is not performed FTR.
3.2.4 Threats
For each of these migration changes, threats leading to the change being performed nFTR
(not First Time Right) need to be identified. This information is an example of detailed data
that is not standerdly recorded for every type of change that is done. Project managers might
have an idea, but no detailed records. In this case, since 2018, an engineer/project manager at
KPN has recorded everything that led to these changes being performed nFTR. Barry Klasens
was partly in charge of organising the changes used as examples in this research. He recorded
the reasons that led to nFTR performed changes and organised them in to categories. This
not only gives an overview of all the threats, but also a frequency of each threat for the four
migrations change types. Listed below are all the possible threats with a short explanation
of what they entail.
1. Cabling: Any kind of problem with the cabling. This can be because of bringing the
wrong cabling, not having it in stock or having applied the cables in the right way.
2. BOP: BOP is the name of a piece of software that is needed during the migration.
When it malfunctions it can disrupt the migration.
3. DSLAM isolated: Because of the migration, some DSLAM lose connection and can’t
function anymore. This means that a rollback is necessary.
4. Material: When the the wrong material has been brought to the job or the right
material is not available due to supplier problems.
5. Gates: The gates are occupied, hindering the migration.
6. Pre-check: A problem has occurred during the pre-check or no pre-check has been
done. This means that the migration has to be rescheduled.
38
7. Switch: The switch is the hardware were all the cables are plugged in. It needs to be
activated before the cables are plugged in. When this has not happened yet but the
switches get plugged, it is not able to create an uplink.
9. Migration running late: Because of minor problems during the process, the migration
takes longer and not everything can be done before the end of the window.
10. Fiber: Either the fiber is defect or there is another issue with the fiber.
11. Agama: Software used during migration. Sometimes no session can be held.
12. DHCP: Dynamic Host Configuration Protocol needs connection to the server to func-
tion. Sometimes a connection with the server cannot be made. A mistake can also be
made during the DHCP request process in preparation of the change.
13. Post-check: During the post-check a problem is discovered and the change has to be
rolled back.
14. Flashcard: The flashcard that needs to be installed does not have the correct software
on it. The results is a mechanic that came to the job site for no reason, as the flashcard
needs to be updated.
15. Mechanic: There are a lot of different reasons why mechanics can cause a nFTR.
Everything that has to do with mechanics being late, starting late and planning problems
involving mechanics are put together in this threat. The reason for this is because these
threats never lead to big problems and are often unique or rare (car problems due to
running over a deer for example).
16. Cancelled: The change is cancelled because of a variety of reasons. (e.g. Freeze)
17. Engineering: This is the term used for a part of the preparation. When it is not done
properly, the mechanic arrives and cannot perform his job.
18. Script error: During the migration a script is used which can generate errors for
different reasons. Can be related to other systems, but also an error related specifically
to the current change.
19. Bop Down: This code is used when BOP is completely down. It does not generate
errors, but cannot be used.
20. No Permission: Permission for the change is not given. This can be because of
misinformation, wrong input or a change in situation.
21. Incorrectly scheduled: Mistakes made during the scheduling of the change leading
to the change being rescheduled.
23. BOP CA: This error code happens during preparation. CA stands for ’create alterna-
tive end points’ and means the migration needs to be rescheduled. This happens before
a mechanic is sent out.
39
24. BOP SWAP: When BOP gives an error because of the change. This has nothing to
do with BOP self, but when there is a problem during the change itself BOP gives an
error. This means a roll back is necessary.
25. Unknown: The reason for the change being performed nFTR is not know or has not
been properly recorded.
These are all the threats that have led to nFTR performance for the four hazards that are
being modelled. In Appendix B a table is shown in which can be read the percentage of
times a threat has led to a nFTR performance for each type of migration. The percentage
represents the share of each threat to the total failed migrations.
3.2.5 Consequences
To define all the possible consequence of a nFTR performance, two important sources of
information are looked at. Most importantly are the performance reports. Of which the
critical report is called the ’Change triggered be alert’ report. As discussed in section 1.3.3,
this report shows what changes have led to a Be Alert situation. Appendix A shows what the
possible Be-Alert situations are. The second source is what the project manager has defined
as a consequence. That has led to the following list of consequences.
1. Reschedule: When a change has not been performed FTR, but there is no impact on
service delivery to the clients because of, for example, a rollback being performed on
time, the change will be rescheduled. This can impact costs when a mechanic has been
sent out. It always impacts time lost, as the window is used to no effect and a new
window will be necessary to finish the change.
2. Be Alert Green
3. Be Alert Blue
4. Be Alert Yellow
5. Be Alert Orange
6. Be Alert Red
The Be Alert classification is explained in Appendix A. In any case, there will be an impact on
service delivery. Depending on the severity of the Be Alert, damage will be done to different
critical aspects of KPN. The same document used to tally the percentages of threats has been
used to calculate how often a specific threat has led to one of the consequences. The vast
majority has led to a reschedule, only on two occasions has a change of these types led to a
Be Alert in 2018.
40
of Be-Alerts the damages are external as internal. Based on the Be Alert matrix defined by
KPN in Appendix A, eight categories have been defined. Some have been grouped together,
as the requirements for each step of severity is equal.
3. Services secundair: These are less important services, like being to pause live tv
or ordering movies through KPN. This is also measured by impacted clients, but the
threshold value is much higher compared to the threshold value of primary services.
5. Cost: Comparable to reputation, every Be Alert has a cost associated to it. Depending
on what has been impacted and how long, certain fines can be given to KPN.
6. Security: Telecommunication providers are interesting targets for hackers. They can
either disrupt the lives of many people or companies by taking out services, or they can
try and steal data.
7. Business impact: When critical services are disrupted, this has a major impact. The
example of the police is related to business impact. Other critical services are described
in the ’Business Critical list’(BCL).
3.2.7 Barriers
The value of adding barriers in a bow-tie diagram is to show what measures have been taken
to counter certain threats and mitigate certain consequences. As looking at changes at a
micro level is not common at KPN, specific barriers have not been recorded. There might
be some barriers put in place, but that kind of data is not available. Standard preventive
measures, such as basic training for mechanics, are in place. But these do not add value to the
diagram like, for example, specialised training for a recurring problem. Therefore no barriers
have been added in the models. One of the added values of creating these bow-tie diagrams,
is that by creating the models, project managers create insight into what barriers might be
41
useful in order to improve on nFTR percentages. Examples of this will be given in Chapter
4, where the created models are discussed.
42
Chapter 4
Demonstration
This chapter contains the Bow-tie models of the selected changes, the risk matrices and an
explanation of how to use the results to compare the risks and impacts of changes. The
four Bow tie models are explained. This will be related to the first objective. Then, the risk
matrices are explained. These are the same for all the changes and are essential for comparing
the changes. This will be related to objective two. This provides the answer the sub question
three. For both the bow tie models and the risk matrices some limitations are explained in
the fist section.
4.1 Limitations
To understand the level of detail of the models and the numbers being used to calculate
the risk ratings, some explanation has to be given about their nature. The following two
sub sections explain what the limitations are for the models and the used data during this
research. The limitations are not part of the method and are only relevant to this particular
case.
The same reason can be used for the lack of barriers. As the recorded data was limited, dif-
ferent people in the organisation were talked to, to try and increase the knowledge about this
43
system. The only form of barrier that could be discovered was the possibility of a rollback.
The possibility of performing a rollback in case something does not work as intended, is used
in the current method to determine risk classification. When a rollback is performed, the final
status of the change is always a reschedule. This means that it does not prevent a consequence
of happening. At best it prevents a Be-Alert by redirecting the consequence. This is difficult
to show in a bow tie and more importantly, it does not add to this research. This is because
a rollback does not influence probabilities, because its effect are already incorporated in the
recorded results. It would only be useful if everything was recorded in more detail, because
then it could be used to influence the calculation and decision making.
In short, to be able to make more detailed bow tie diagrams, more information needs to
be recorded at a micro level. The limited diagrams created in this model already proof the
worth of doing this. If this way of working is implemented for the whole organisations, more
detailed diagrams can be made, which in turn can provide more insight in how to increase
performance.
In order to make risk matrices, impact and probability information is necessary. By multi-
plying the impact with the probability of the impact happening, results are calculated which
can then be used to rank the risks relative to each other. A complication for this research is
that the necessary data needed for these risk matrices, is sensitive data from KPN. Probabil-
ities of damage occurring and on what scale that damage is measured internally at KPN, is
not supposed to be public knowledge. Therefore the results of these calculations need to be
anonymised. Table 4.1 shows an example of a risk matrix. This fictive example will be used
to explain how a risk matrix is made, aswell as to show how the data is anonymised.
Rows zero to four represent the impact and columns A to E represent the probabilities. By
multiplying them with each other, the rest of the table is filled. The next step is to anonyimse
the data. One effective way to do this is to scale the results between zero and one, also called
normalisation. The method used to do this is called min-max feature scaling, which can be
seen in equation 4.1. To calculate the normalised value (x’) of an original value (x), subtract
the minimal value of the results from the original value and divide that by the difference
between the maximum value and the minimum value of the results.
44
x − min(x)
x0 = (4.1)
max(x) − min(x)
The results of using this method can be seen in Table 4.2. The lowest value in the top left
corner (A0) is equal to zero and the highest value in the bottom right corner(E4) is equal to
one. Anonymising the results this way, gives the options to hide the the sensitive data, while
still working with relevant results.
4.2.2 1G model
For the 1G migrations in 2018, sixteen threats have been identified. The two main threats
are ’Mechanics’ (14%) and ’Bop CA’ (27%). Combined they are responsible for 41% of the
nFTR performed migrations. The second group of threats that is responsible for 16% exists
out of ’Cabling’ (8%) and ’Fiber’ (8%). This can be seen in Figure 4.2. Creating barriers
focused on these threats will be the first step in decreasing the nFTR percentage. Appendix
C.2 contains more charts for a total overview.
45
Figure 4.1: 1-10G bow tie model Figure 4.2: 1G bow tie model
46
Figure 4.3: NT migration bow tie model Figure 4.4: WAP bow tie model
4.2.3 NT model
The NT migration Bow-tie model is different to the others, as only four threats have been
identified for 2018. The large majority (98%) of the nFTR percentage is caused by three of the
four threats, these are ’Migrations running late’ (31%), ’Flashcard’ (29%) and ’Mechanics’
(38%). The last 2% is caused by ’Script error’, which is not a recurring problem for this
migration. Improving on any of these three threats in the form of barriers will have a major
influence. The Bow-tie model can be seen in Figure 4.3 and a more detailed overview can be
found in Appendix C.3.
47
4.2.4 WAP model
As can be seen in Figure 4.4, thirteen threats were identified in 2018. The main four threats
are ’Cabling’ (25%), ’Fibers’ (15%), ’Engineering’ (10%) and ’Bop CA’ (20%), totalling to
70% of nFTR percentage. Creating barriers for these threats can potentially improve the
nFTR percentage. The other 30% are caused by occasional problems and mistakes, which are
difficult to solve with barriers. For a better overview, see Appendix C.4.
The method that has been selected to standardise these results is the Z-score. The reason
for this is that the results contain some outliers that have an influence on the results. This
makes the Z-score a more reliable method to standardise these results. Therefore the results
will first be normalised with the min-max feature scaling method and then standardised by
calculating the Z-scores. The Z-score is calculated by taking the mean of the population of
the raw score and dividing it by the standard deviation of the population, which can be seen
in equation 4.2.
x−µ
z= (4.2)
σ
The z-score results, which can be seen in Table 4.3, represent the distance between the orig-
inal value (’raw score’) and the population mean, measured in standard deviation. Z-scores
provide a way to compare the results to a normal population. It can be used to determine
where the results of the risk rating is compared to the average results mean risk rating (Abdi,
2007). A negative score represents that the original value is equal to z times the standard
deviation lower than the mean. In this case, scoring lower than the mean, means that the
risk rating is safer. A positive score represents a higher value than the mean and thus a more
dangerous risk rating.
There are some potential problems caused by using z-scores. Using z-scores can influence
the meaningfulness of the original ’raw’ scores. The original results often easier to interpret,
whereas z-scores don’t speak to the imagination. Another potential problem is that z-scores
48
Table 4.3: Example of a standardised risk matrix
can magnify small differences between the raw scores. Thereby creating unintended weights for
certain results. In this case, the results have been anonymised, which has already taken away
any meaningfulness of the raw scores. In the case of magnifying small differences, the ranges of
the different damage categories are comparable. Because the ranges don’t differ to much, this
does not cause a problem. Only when comparing the damage categories measured in euros to
damage categories measured in clients impacted, does it have a minor influence. When the z-
scores of these damage categories have been compared only with damage categories measured
in the same units, the results were altered slightly. The difference in results was minimal and
did not have any influence in the end result. Therefore the usages of the z-score method was
deemed applicable for this research.
49
the table, the results are fairly close to each other up till row three. The differences between
the results become larger from then on, showing that the highest few results can be considered
outliers. This justifies the decision to standardise the results through calculating the Z-score,
to be able to better compare results.
Mean: 0,125
Stand deviation: 0,250
The second table, Table 4.5, shows the standardised score calculated by using the Z-score
method. Based on those results and the comparison made between all categories, the colouring
scheme has been adjusted, as explained in Appendix D. The initial colouring represents a
standard colouring scheme. The colouring used in Table 4.5 is the results of comparing the
results of the eight categories of damages. The best way to interpret these results is as
follows. If a change type scores in a green area for a category of damage, it means that
the risks and impacts are acceptable for the change to become a standard change. When
the results are yellow, it means that risks and impact are slightly higher than what would
normally be accepted, but depending on the impact of changing the classification might be
worth taking the risk. Orange results mean that the risks are too high and might be improved
in the future, but are not ready yet for changing the classification. Red scores are not likely
to change to anything acceptable now or in the future and could therefore better be done
during maintenance windows.
50
4.3.4 Services: telefonie,internet, iTV, Digitenne, Wholesale transport &
access services
The results for the next damage category can be seen in Table 4.6. They are very similar to
the results for IT applications and mobile services, but has more green rated ratings in the
first two rows. It is also measured in clients impacted. While this could be used to argue that
the impact or the chance of impact is smaller than for IT applications and mobile services,
one should always keep in mind that you are comparing two different damage categories. It
is very much possible that IT applications and mobile services is viewed as a less important
damage category in general. As the results match a lot with the first damage category, the
same conclusion can be made about the outlying values.
Table 4.6: Services: telefonie,internet, iTV, Digitenne, Wholesale transport & access services
risk matrix
Mean: 0,124
Stand deviation: 0,250
Using the mean and standard deviation of these results shown in Table 4.6, the Z-score is
calculated. The results and the new colouring can be seen in Table 4.7. The Z-score results
are comparable, just like the normalised results.
Table 4.7: Services: telefonie,internet, iTV, Digitenne, Wholesale transport & access services
risk matrix Z-score
51
4.3.5 Services secundair
The failing of the secondary services that KPN provides, like pay per view and pausing live
television, is also measured in amount of clients impacted. When looking at the results in
Table 4.8 and the mean of the population, one can see that the risk ratings are higher compared
to the other two damage categories.
Mean: 0,164
Stand deviation: 0,260
The Z-score has been calculated using the mean and standard deviation, the results can be
seen in Table 4.9. There is a small difference in results between this damage category and the
first two. This is because both the mean and standard deviation are larger. Creating results
that are mainly green or red, or in other words, creating bigger variances between results.
52
4.3.6 Reputation
The choice was made to measure reputation damage in €. The way it is defined in the Be
Alert classification matrix is by the size of the region (local, regional and national) or by the
amount of clients. It falls under the category of KPN external damage, which also contains a
scale measured in €. Using the scale in € to define the impact for the different Be-alerts gives
workable numbers to do calculations with and effectively portrays the impact of reputational
damage as stated in the Be-Alert classification matrix.
Mean: 0,123
Stand deviation: 0,249
Surprising enough, Table 4.11 shows that the Z-score results are more riskful on average.
While the variances between results are not that big, except for the highest results, the risk
ratings are higher to begin with, resulting in more yellow and orange results.
53
4.3.7 Cost
The cost damage category is exactly the same as the reputation damage category, as the Be
Alert classification matrix also measures it in € and in the same amounts. This means that
Tables 4.12 and 4.13 are copies of the Tables that represent the reputation category.
Mean: 0,123
Stand deviation: 0,249
54
4.3.8 Security
Just like the reputation and cost category, security is measured in € and in the same amounts.
Resulting in the same risk matrix for original values and Z-scores.
Mean: 0,123
Stand deviation: 0,249
55
4.3.9 Business impact
Defining the severity scale for business impact through the Be Alert classification is com-
plicated. This damage category is focused on critical services. These have been defined
internally. These can be disrupted for either large corporate clients or normal consumers. In
the case of a large corporate, one client being impacted is already severe enough to count as
a Be Alert. But for normal consumers a lot more need to be impacted for it to count as a
Be-alert. Calculating with corporate clients results in illogical results, because the numbers
can’t represent the importance of the client. Therefore the decision was taken to use the same
severity scale as in section 4.3.4. The importance of KPNs core business and the importance
of services that have been defined as critical, are comparable and the last three steps in the
severity scale are already matching.
Mean: 0,164
Stand deviation: 0,260
This results in the same Z-score matrix as for services. For both these damage categories
it is important to notice that while the risk ratings are relatively lower and therefore rated
less riskful, the importance of the damage categories is not represented in the rating. The
impact for these damage categories is measured earlier, or in other words, less clients need to
be impacted for it to count as a Be-Alert. This results in relatively lower scores. The focus
in these risk matrices is on amount of impact, not on the importance of impact.
56
4.3.10 Services, telephony and internet for large businesses
Corporate clients not receiving core services is measured in the amount of corporate clients
impacted. As these are considered more important than consumers, the severity scale starts
lower and ends lower compared to the other damage categories. Therefore the initial nor-
malised results score less riskful than other damage categories. This can be seen in Table 4.18
and in the lower mean of the results.
Table 4.18: Services, telefonie en internet grootzakelijk risk matrix
Mean: 0,108
Stand deviation: 0,252
Because the mean of the results is lower compared to the rest, but the standard deviation is
comparable, this results in higher Z-scores on average. This can be seen in Table 4.19. In
other words, changes that have any chance of impacting core services to corporate clients will
have a difficult time having their classification changed.
57
4.4 Risk rating
Using the probabilities and the risk matrices, a rating can be given to each damage category
for each consequence. These can be seen in the Figures 4.1, 4.2, 4.3 and 4.4. The eight squares
under each consequence represent the eight damage categories discussed in the section above.
Based on how a consequence scores in the z-score matrices above, the eight squares are given
a colour and a rating. For example, if a consequence scores A1 for Reputation , this is equal to
a risk score of -0.49. This is based on the historic data of 2018. For each of these migrations,
the reschedule consequence has a very low chance of impacting any of the damage categories
and if it would impact them, the impact would be minimal. Therefore it scores A0 (low
chance of happening, low impact) in each of the risk matrices. Only for the damage category
’Costs’ does it score differently. Rescheduling incurs costs, but the costs are not very high.
And because most of the time a migration is performed nFTR it ends up being rescheduled, it
scores E0 (high chance of happening, low impact). This can be done for each of the migration
types and each of the damage categories based on the historic data of 2018. By adding the
results of each category to each other and dividing it by eight (the amount of categories), a
risk rating per consequence is calculated. This can be seen in Table 4.20.
To help understand the process and calculations better, a detailed example of the consequence
’Be-Alert Blue’ for 1-10G migrations is given in the following paragraph. For each damage
category the z-scores have been calculated in section 4.3. Starting with the first damage
category ’IT applications and mobile services’. In 2018, the level of impact and probability
of this happening was so that the z-score needed to calculate the risk rating can be found
in the matrix at D1. This is equal to -0.49. For the following seven damage categories, the
same logic is applied and the following Z-scores can be found to calculate the risk rating:
-0.50, -0.60, -0.47, -0.47, -0.47, -0.60 and -0.42. Adding these together gives the total score of
-4,03. Dividing it by 8, the amount of damage categories, provides us with the risk rating for
a ’Blue Be-Alert’ during 1-10G migrations, -0.50. This number can be measured on the same
colouring scheme the z-scores are measured on. Doing this for the four change types gives a
large part of Table 4.20.
The following step is to calculate the final risk rating. The easiest way of calculating this
would be to add the rating of each consequence to each other and divide by six (the amount
of consequences). But this gives the Be-Alert Yellow, Be-Alert Orange and Be-Alert red a
major influence in the total risk rating, while these have never been caused by these change
types. It is also not very likely that they will be caused by these change types. A solution
for this is to look at how often the consequences have been caused and use that to decide
how heavy the ratings for each consequences "weigh". In the case of 1G and NT migrations,
for the whole of 2018, nFTR have only led to reschedules. Therefore the total risk rating
equals 100% * the risk rating for a reschedule. For 1-10G and WAP migrations, 99.7% and
99.5% of the nFTR performed migrations respectively, have led to reschedules. The other
0.3% and 0.5% have led to a Be-Alert Blue. This means that to calculate the total risk rating
of 1-10G migrations, 99.7% * risk rating of a reschedule, will need to be added to 0.3% * the
risk rating of a Be-Alert Blue. For WAP migrations that would mean 99.5% * the risk rating
of a reschedule added to 0.5% * the risk rating of a Be-Alert Blue. For the 1-10G migration,
the formula to calculate the total risk rating looks like equation 4.3.
58
(−0.52 ∗ 0.997) + (−0.50 ∗ 0.003) (4.3)
The overall results of both these calculation for all four change types can be seen in Table
4.20. Based on these scores, change types can be compared as well as ranked in order of
most to least riskful. Depending on what needs to be decided, these scores can be used to
create more insight into the risks and impacts of different changes. Making decision making in
maintenance of telecommunication networks more based on probabilistic calculations instead
of one time occurrences.
Each of these change types score the same kind of risk rating, no to low risk (green). There
are two differences that can be seen, which are the risk ratings for the blue Be-Alert of the
1-10G migrations and WAP migrations. That is because these two migrations have had a blue
Be-Alert in 2018, therefore increasing the probabilities of that happening compared to the
other two migrations. These scores can be used to compare the risks of different migrations.
This can be done by their overall rating, but also by their rating for the different consequences
or even damages. Another interesting results is that for all the migrations, the risk ratings
are very alike. The main reason for this is that the these four types of changes fall under the
same domain. Therefore their impact is the same on the damage categories, as they enable
the delivery of the sames services. This would not be the case if they would be compared to
other types of changes. Another reason for the fact that the results only differ very little, is
that these changes are rather successful. Be-Alerts Blue and even Be-Alerts Yellow are not
very rare, these happen regularly at KPN. But for these changes, only two Be-Alerts in more
than a years time have been recorded, which has a big influence on the final risk rating.
59
Table 4.20: Risk ratings
60
4.5 Demonstration conclusion
This chapter demonstrates what can be done by modelling risks on a micro level. Creating the
Bow-tie models provides necessary insight in order to improve on the success rate of changes.
Whereas adding probability and impact severity can serve to compare different changes nu-
merically. Comparing changes based on these results can be very useful in order to define
classifications, but needs to be done carefully. For each company, type of change, category
of damage and way of calculating different opinions and preferences can have an influence on
how the results are interpreted. It therefore functions better as a tool to substantiate dis-
cussion, instead of giving clear cut answers. For these four migration changes it has become
clear that if a change in classification can increase the productivity meaningfully, the risks are
manageable. In the case that the maintenance windows are not too full and it might be useful
to only perform one or two changes during the day instead of during maintenance windows, a
comparison can be made. In this case the best advise would be to change the classification of
the changes in the following order: NT migrations and 1G migrations first, followed by 1-10G
migration and then WAP migrations.
61
Chapter 5
Evaluation
In the previous chapter the risk model and its results have been demonstrated. The following
chapter will evaluate these results by comparing them to the current way of working. The
first objective of the risk model is to increase the success rate by providing insight into what
threats and consequences have a large share in the failures. To evaluate this, a comparison
will be made for the four changes treated in this research, between results in 2018 and 2019.
The second objective of the risk model is to provide numerical substantiation for discussions
about change classification. This will be evaluated by comparing the current ratings of the
changes to the results of the new risk model. The last part of this chapter will be to evaluate
if the objectives for KPN can be reached by implementing this risk model. By doing this, sub
question four is answered.
It was difficult finding this detailed information, as no other project manager collects this
information. It was after contacting many different people that Barry Klasens was suggested,
62
as he might have the necessary data. Getting access to his data was not easy. Apparently
other colleagues had had a similar interest in his data and he did not feel at ease sharing the
details of the projects he was working on. The suggestion he would be criticised or judged
based on his collected data was created by his attitude. After some mediation, Barry agreed
to share his Excel records and explain his process. While explaining his process, he also dis-
cussed how collecting this data helped him give insight on what to focus on to solve the main
issues he was facing during changes.
In Appendix B the success rates, the failure rates and the share of each threat to the failed
change activities in 2018 and 2019 are detailed. These tables can be used to compare the
results of 2018 and 2019, to determine if the insight created by recording this data in 2018,
have impacted the results in 2019 positively.
The first comparison to be made is between the total results for each change type in 2018 and
2019. Table B.2 in Appendix section B.1 shows the results for 2018 and will be compared to
Table B.4 in Appendix section B.2, which shows the results for 2019. As can be seen, the
success rates in 2018 for 1-10G, 1G, NT and WAP migrations were 63%, 85%, 87% and 62%
respectively. For 2019, the success rates are 74%, 87%, 89%, 59% respectively. This means
that in 2019, an increase in success rate of 11% for 1-10G migrations has been achieved, an
increase in success rate of 2% for 1G migrations has been achieved, an increase in success rate
of 2% for NT migrations has been achieved and a decrease in success rate of 3% for WAP
migrations has been achieved in comparison to 2018. This adds up to an overal increase of
success rates of 12% in 2019 for these changes.
The second comparison, which is more interesting for this research, will be made between the
threat percentages that account for how many times a threat has led to a failed migration.
Table B.1 in Appendix section B.1, shows for each threat how often it has lead to a failed
migration in 2018 for each change type. Table B.3 in Appendix section B.2, shows the same
for 2019.
63
14.3% to 0% and the ’BOP CA’ threat has been decreased by 26.3% to 0.6%. It is important
to note that the results from 2019 are not complete, which has led to the ’Unknown’ threat
to be the most important threat in 2019. This might in a later stage be divided into other
threat categories, influencing the results being presented at this date.
As Barry is the only project manager at KPN of who is known that he records micro details
about his changes, such as threats, there is no other source which can confirm the effectiveness
of recording micro data. But based on his data, recording and acting on micro data helps
improve FTR percentages. Recording micro data for all the changes performed by KPN and
structuring it in bow tie models could therefore help increase the average FTR percentage for
KPN.
64
the second section will explain how this differs with the new method.
For each change activity that is planned, a runbook is created. This used to be an Excel
sheet which was sent to the Service Quality Centre (SQC), who then reviewed the runbook
and the organisation wide planning, in order to plan the change activity. Quite recently this
has been integrated in the new ticketing system. In both cases, a risk and impact section
needs to be filled in to determine the risk classification of that specific change activity. This
risk classification must not be confused with the change type classification. The risk classifi-
cation is measured for each change activity separately and results in a classification of ’low’,
’medium’ or ’high’. This is comparable to, what in the risk model of this research is called,
the risk rating. The change type classification is a classification for a whole type of change,
like the four types that have been used as examples in this research. These can be classified
as ’Standard’ or ’Normal’ (there are other classifications, but these are not relevant to this
research).
The three categories determining the risk classification are ’Impact’, ’Complexity’ and ’Ex-
perience’. Based on the amount of clients impacted in the worst case scenario of that change
activity, impact is classified as either ’low’, ’medium’ or ’high’. Based on the amount of parties
involved and the different technical elements being impacted, complexity is classified as either
’low’, ’medium’ or ’high’. Finally, experience gets the same kind of classification based on if
the change has been performed before, if it has been performed successfully before and if the
change has been tested or not. The three categories are then combined to give a risk classi-
fication. This classification can be influenced by the possibility of a rollback, the duration of
the change activity and if the downtime for customers is lower than 30 minutes. Resulting in
the final risk classification.
The main difference between the two processes is that the new method uses historic data to
give an overall risk classification to the change type. Not only to a specific change activity, as
in the currently used process. That is because in the current process, the risk classification
and the change type classification are defined separately. A change type can only be classified
as standard if there has been no client impact for at least a year due to that type of change.
Using the new process, a risk rating is calculated based on historic data, which can be used
as a reason to reclassify a change, aswell as give a measure of risk and impact.
Looking at the risk rating and risk classification specifically, the logic behind the calculations
are comparable. Impact is handled in more detail in the new process. It does not only take in
account impacted clients, but based on the damage categories you are comparing, also looks
at the costs. Having the different damage categories also gives the users more insight in what
is exactly being impacted. The complexity and experience are represented in the new process
by the historic data. Based on the results of the past, which are influenced by the experience
and complexity of the change, the probabilities for impact are redefined. Another important
difference is that new process is measured on a four point scale (green, yellow, orange and
red), instead of a three point scale (low, medium and high). The four point scale, like the
different damage categories, go in more detail than what the current process does. This is
65
because the new process needs more detail in order to be able to be used as substantiation for
reclassifying change types. Also, the four point scale is based on a numerical result. Which
means that changes that share the same classification, can be compared and ranked in order
of most riskful.
When comparing the new risk ratings to the current risk classification, the results match. This
means that for the four example change types (1-10G, 1G, NT and WAP migrations), the risk
classification and risk rating, are both classified as low. These results have been discussed and
tested further in cooperation with two stakeholders from the SQC department. This is the
department that manages everything that has to do with change requests in the organisation.
By varying the historic data used to calculate the risk rating and making similar changes in
the runbooks, the impact on the risk classification and rating can be further compared. The
impact on the risk classification and rating has been proved comparable for the four change
types used during this research.
Conclusion so far
After having performed and discussed this evaluation with the relevant stakeholders, the fol-
lowing can be concluded about the second objective of the model. The risk ratings calculated
with the micro risk model for the four example change types have been tested and seem to
overlap with current risk classification results. This means that the way risks are calculated in
the risk model matches with the expectations of KPN. But by using the new process, a more
detailed insight is provided on which decisions can be made. The different damage categories
and numerical values as rating, give the means to discuss in more depth how riskful a change
really is, compared to the current process. Therefore it can be concluded that objective two
has been achieved in part. It does not however, provide an idea of how usable this process is
for reclassifying changes, as that has not been evaluated.
To test how usable and effective this process is at reclassifying changes, the risk rating of a
standard change needs to be calculated. There is a list where all the standard changes are
recorded and most of those are performed outside of maintenance windows. Some exceptions,
that have no client impact but are potentially very riskful, are still performed in maintenance
windows. Calculating the risk rating of those changes using this process would be the perfect
test to see how effective this process can be. The necessary information to be able to do this
evaluation is a list of these changes and their historic data. The difficulty is mainly in getting
the historic data to make the risk model.
After having performed this step, another evaluation step that would be of value is measuring
the effectiveness of moving changes out of the maintenance windows. This could be measured
in multiple ways. Calculating the difference in cost of doing the work during the day instead
of during the night provides the potential cost reduction. The second way effectiveness can
be measured is seeing how much space is created in maintenance windows to perform other
66
change activity. Potentially taking goals that are in danger of not being met on time, out of
the danger zone. This would result in a more comprehensive evaluation.
67
Chapter 6
Conclusions
Based on the problem stated by KPN, the demarcation made for this research and its results,
a conclusion is given in this chapter. This will cover the answers to the research questions
presented in the introduction. In this chapter, the theory is combined with the demonstration
and evaluation, providing an answer to sub question five.
To identify what changes are good options, knowing what changes exists and how they differ
is important. KPN performs all kinds of changes on their network. These can be service, in-
frastructure, maintenance or migration related. Each of these changes are classified based on
their risks and impact. The main classification types are standard and normal changes. The
68
difference between them is that standard changes have not caused impact on clients during
at least a year, whereas normal changes have. Which means that standard changes do not
need to be performed during maintenance windows and are easier to plan. For these changes
to be representative, they need to represent a substantial part of the KPN workload.
After determining what kind of changes can be representative for KPN, it is important to
determine if the necessary information for this research is available. The necessary informa-
tion to create risk models consists of knowing what hazards, events, threats and consequences
exists for the changes. This means that project managers responsible for their changes need
to collect this information for a period of time. This is not common practice at KPN, as this
information is not used. Risks and impact are only looked at a macro level and not at a micro
level. For the selected changes, the project manager has been collecting the needed informa-
tion since 2018. A total of four hazards, four events, twenty five threats and six consequences
have been identified.
The next step is to identify the probabilities and impact. The probabilities can be derived
from the data that has been collected throughout 2018. The amount of changes performed
is compared to the amount of successful and failed activities. These percentages can then
be split up into amounts per threats and how often it has led to consequences. In order to
determine impact, the Be-alert classification matrix is used. This document describes the
different severities of Be-alerts and what their impact is. This varies from impacted clients to
damage expressed in money. All this data is put into the model and used in the calculations
to calculate risk ratings.
To determine what barriers are in place a micro approach needs to have been taken. Barriers
are systems or actions that have been put in place in order to prevent threats from happening
or to mitigate consequences. If it is not clear what threats are responsible for failures or what
the consequences are of a failure, it is not possible to consciously put barriers in place. As
a micro approach to risks and impact is not used at KPN, there is no information available
on barriers that have been put in place. One of the goals of the risk model is to provide the
necessary insight to effectively place barriers. For each of the four changes that have been
used as an example in this research, areas of focus have been identified for potential effective
barrier implementation.
Determining acceptable risks for KPN is not clear cut. The goal is to determine a maximum
acceptable risk rating in order to compare changes with each other and determine the classi-
fication. But this can change for each type of change. It also depends on who is interpreting
the results. Therefore it is not possible to set a max acceptable risk rating or a range.
6.3 Sub question 3: How does a micro risk model help classify
changes differently?
The third sub question needs to be answered because the change classification plays a big
role. Changing a change type from a normal classification to a standard classification means
that maintenance windows can be used more efficiently and costs of activities goes down.
Using a micro risk model should make this possible more often and therefore be of worth to
telecommunication providers.
69
The current way of determining if a change type can be classified as standard or normal is
by looking at the impact it has had in the last year. If a change has had impact on clients
in the last year, it is classified as a normal change. This means that planning work of that
change type needs to be accompanied by a run book and needs to be approved. Then it gets
accorded a place during a maintenance window. Standard changes don’t have a complicated
approval process. It will also be planned during the day instead of during a maintenance
window. Working this way does not take all aspects into account and has as a result that
certain changes are classified as a normal change while they don’t need to be. This is where
a micro risk model can help.
By using probabilities and impact during a risk assessment, an organisation can get a more
complete view of the risks. Identifying the actual probabilities of something going wrong and
actually impacting clients provides the necessary information to make logical and efficient
decisions. Looking at the results of the example changes, it can be seen that changes are
rated as yellow, slightly more riskful. This means that changing the classification can be done
without taking a big risk. If the impact of changing the classification is impactfull, the usages
of the risk model has created the opportunity to discuss this option. It can then be used
to compare the changes to select which change can be best classified into a standard change
while taking the least amount of risk. In this case that would either be NT or 1G migrations.
This results is to be expected, as there have not been any Be-alerts in 2018 for either one of
these changes.
Improving the success rate of changes is straight forward. The less activities that need to
be rescheduled, the more work can be planned and performed. Looking at the four example
changes, for each one of them a selection of threats on which needs to be focused has been
given. Improving on those threats by implementing the right barriers can greatly improve the
success rate.
The data that has been collected in 2018 for the four changes discussed in this research has
also been collected for 2019. This provides the means to test if having insight into threats
and consequences improves the success rate of change activity. By comparing the failures
rates for each change activity and the cause of those failures in 2018 with the same data in
2019, a decrease in failure rates has been measured. This can be seen in the total failure
percentages, which has decreased, but it is more interesting to look at the percentage share
for each threat. For example, in 2018 for 1-10G migrations, 19.3% of the failed migrations
was due to problems with the ’Cabling’. In 2019, that has been decreased to 1.6%. Proving
70
that having insight into the relevant threats can help project managers to increase the success
rate by targeting the biggest threats. This does not change the fact that new threats can
arise, impacting the success and failures rates. But knowing what those threats are and how
large their share is in the total failure rate can provide the tools to quickly react and improve.
The way the new method of classifying legitimises classification changes is through the usage
of numeric data. The first aspect of the model, increasing insight in what leads to failed activ-
ities, is easy to understand and straight forward. Identifying how often a certain threat leads
to failure is an easy measuring system to identify what needs to be improved. The complexity
increases when probabilities are combined with impact and the results are normalised. It does
provide numerical values which can be compared, but the values don’t provide a clear context.
When people discuss units such as costs in money or impacted clients, users understand what
is discussed. Normalised values, that tell how much a raw value differs from the mean with
the standard deviation as a unit, does not convey a context or an order of magnitude. This
makes it difficult to understand and use this method for users that are not specialised.
This method does however, provide a more in depth analysis of the risks and impact associated
to a change. It does therefore, in the hand of an experienced user, legitimise the classification
of a change better than the current way of working.
71
on different parts of the organisation, it can be said that the use of the model provides more
legitimisation than the current methods. Having this detailed information also provides the
tools to influence how changes are being classified. Which in turn leads to more efficient
change planning. What also can be concluded for sure, is that working with a micro risk
model improves insight in how the current change activity can be improved. Which does
positively influences the efficiency of change planning. All in all, using risk based mainte-
nance methodology does provide tools for the telecommunication sector to improve on their
maintenance planning.
72
Chapter 7
Using models and their results is never straight forward. Every type of model is a repre-
sentation of a part of a real system and is therefore partly based on assumptions. The way
a model is used, the information that is put into the model and the results are subject to
interpretation. The research also faces several limitations. The chapter will start by giving
recommendations to KPN.
73
it can be tested on other changes. This step by step evaluation provides the context in which
the model can be most efficiently used.
Another evaluation step that would help decide the exact value of using this model, is calcu-
lating what the exact profit is of changing change type classifications. By calculating the gains
in costs and more importantly, the gains in "room" during maintenance windows, a compre-
hensive comparison can be made. This help determining if changing a certain classification is
worth the potential increase in risk. For example, changing the classification of a change type
that is performed a couple times a year while this might increase the risks ran by KPN, might
not be worth it. But if the change is performed more than a 1000 times a year, the increase
in risk might be worth the decrease in cost and increase in maintenance window room.
74
been proven efficient in other sectors. This is the first example of RBM, with some slight
adjustments, being implemented in this sector.
The main difference between implementing RBM in the telecommunication sector compared
to other sectors, is that RBM in other sectors focuses mainly on internal impact. For ex-
ample, if a machine breaks down in a plant, production is stopped or slowed down, which
results in less products made. This does not directly impact clients of the product. In the
telecommunication sector, if maintenance or any adjustments to the networks fails, clients
will be impacted. This is not taken in account in standard RBM. The second difference is
that RBM contains a planning module, which is used in all the other sectors. Based on the
risk assessment, the final result of implementing RBM is a maintenance plan. The plan tells
the user when to perform maintenance on what machine, in order to decrease the risks of the
production being disturbed and increase the total output. As the planning is made by only
looking at internal impact, it is not useful to use it for the telecommunication sector.
In order to incorporate impact on the clients, the damage categories have been altered. In
general, there are four damage categories that are mostly used. These are People, Environ-
ment, Assets and Reputation. Changing them to represent different services delivered to the
clients, made the risk assessment more workable for the telecommunication sector. For the
final result, the planning module is not used, but it ends with a risk rating for the change
types. These can be used to change conditions (such as change classification) that impact the
final change planning.
Using RBM to influence the change planning in the telecommunication sector is interesting,
but there are some reasons why making it applicable in the telecommunication sector is chal-
lenging. The fact that the impact in not focused internally, as with other sectors, means that
there are several restrictions a RBM implementer need to deal with. There are agreements
and contracts with the clients for example, that complicate the matter. Another complication
is that when calculating the risk of a machine failing, the impact on the production process
is measurable and objective. This is not always the case in the telecommunication sector,
where the impact is not always clear, nor the severity of the impact. These challenges are not
a deal breaker for RBM in the telecommunication sector. But they are the reason that more
intensive evaluation is necessary for RBM to be embraced in the telecommunication sector.
Even if the results are not definite, some promising results have been shown. A lot more
research has to be done in order to make a convincing case for RBM in the telecommunication
sector, but this start could be a trigger for more tests. The model itself, as well as the
calculations, are not groundbreaking. It is the process in which the model and calculations
are used that is interesting. The idea to focus the process on a micro level, instead of a macro
level, is not common in the telecommunication sector. But the potential benefit of working on
a micro level, warrant the time spend on researching and implementing micro methodologies
in the telecommunication sector. This way of thinking and looking at the maintenance issue
in the telecommunication sector is new.
75
7.3 Limitation of the generalizability
The limitation of this research in context of generalizability, is that it is focused on one
provider. Because of the lack of information and data in regards of this topic, most concepts
originate from KPN, as well as processes that have to do with maintenance. This makes it
difficult to use the results for the whole sector. In discussing this matter with professionals
working at KPN, the conclusion was quickly made that even though most of the information
originates from KPN, most other providers work in a comparable fashion. There might be
slight differences, but in general the logic behind maintenance planning is very much compa-
rable. While this is a comforting thought, it is not the same as having confirmation by other
providers that this is really the case. Therefore it is difficult to generalise the results of this
research. For that to be legitimate, more research in this sector is necessary.
The main assumption that is made during this research in terms of generalizability, is that
working on a micro level is not common practice for all telecommunications providers. If that
is not the case, the results or the model are not less applicable, but it would decrease the
innovative value of the research. The potential gains in the FTR percentages due to recording
and acting on micro data will still be relevant. The same goes for giving risk ratings based
on a more detailed risk assessment. Another assumption is that all the telecommunication
providers work with maintenance windows or a similar concept. If that is not the case, pro-
viding a process which can help change risk classifications in order to unburden maintenance
windows, losses a part of its application value. A more detailed risk assessment can still
retain some value, but the goal for which it has been devised is lost in this case. The last
major assumption is that all telecommunication providers use risk and impact to classify their
changes. If they do not do that, but have another system, then the results are invalidated
and the models losses all its application. The model could still be used to start measuring
risks and impact, but for it to work properly, basic information about risks and impact need
to be available.
76
change. Putting in context what the total impact of a classification change can be. This would
also make the trade off between an increase of productivity against taking a bit more risk
more tangible. The reason that this would be of added value is because it provides a measure
of effectiveness. This can then be used to determine in what cases a change in classification
would be worth it. The reason this has not been added to this research is a lack of information
availability and a lack of time to create this data.
Another complication is this regard is that one could choose to look at the total risk rating,
but one could also look at the risk ratings of the consequences. Where an overal risk rating
could communicate that a risk is not eligible for a classification change, looking at the conse-
quences risk rating could say something different. In the examples used in this research, the
orange and red Be-alerts have had a big impact on the total risk rating. But considering that
there have never been any orange or red Be-alerts triggered by these changes and that the
risk of that happening is very close to 0, one could decide to change the overal risk rating to
almost no risk (green) instead of slightly more riskful (yellow). The reason that the red and
orange Be-alerts have such an influence is because the impact used to calculate the rating
increases a lot in those categories of Be-alerts. While this is something that KPN likes to
have represented in their risk assessments, it does not necessarily result in the best advice.
In this research no use has been made of weights to amplify certain aspects of changes. For
example, KPN wants to limit the down time their clients experience. In the performed risk
assessment, the down time of the different changes have not been taken in account. Two of
the migrations have a max down time of five minutes, while the other two have a max of one
hour. This is a major difference, which could impact the possibility of changing classifications.
What this section is trying to say is that the way the results are created, interpreted and used
is depended on the users. This means that organisations that will want to use a method like
this, will need in house knowledge of how to best interpret and use the results. Increasing
the difficulty of using this method in comparison to the overly risk preventive method used
currently.
77
a scope to solve this problem. Other parts where research is necessary to solve this problem
is client management, contract management and process optimisation. This will be discussed
in more detail in the ’Further research’ section.
78
References
79
of telecommunications companies.
Gericke, K., Klimentew, L., & Blessing, L. (2009, 01). Measure and
failure cost analysis: selecting risk treatment strategies. In (p. 61-72).
Retrieved from https://www.researchgate.net/publication/237049565_Measure
_and_failure_cost_analysis_selecting_risk_treatment_strategies
Gregor, S., & Hevner, A. R. (2013). Positioning and presenting design science research for
maximum impact. MIS quarterly, 337–355.
Hanna, A., & Rance, S. (2011). Itil® glossary and abbreviations. ITIL officialsite. Retrieved
from www.itil-officialsite.com/Publications/PublicationAcknowledgements
.asp
Hevner, A., & Chatterjee, S. (2010). Design research in information systems: theory and
practice (Vol. 22). Springer Science & Business Media.
ISO 31000: 2018. (2018). Risk management–guidelines. International Organization for Stan-
dardization Geneva.
ISO/IEC. (2006). International standard-iso/iec 14764 ieee std 14764-2006 software engineer-
ing; software life cycle processes &; maintenance.
Jain, P., Pasman, H. J., Waldram, S., Pistikopoulos, E., & Mannan, M. S. (2018). Process
resilience analysis framework (praf): A systems approach for improved risk and safety
management. Journal of Loss Prevention in the Process Industries, 53 , 61–73.
Kamoun, F. (2005). Toward best maintenance practices in communications network manage-
ment. International Journal of Network Management, 15 (5), 321–334.
KarimiAzari, A., Mousavi, N., Mousavi, S. F., & Hosseini, S. (2011). Risk assessment
model selection in construction industry. Expert Systems with Applications, 38 (8),
9105 - 9111. Retrieved from http://www.sciencedirect.com/science/article/pii/
S0957417410014739 doi: https://doi.org/10.1016/j.eswa.2010.12.110
Khan, F. I., & Haddara, M. (2004, 12). Risk[U+2010]based maintenance (rbm): A new
approach for process plant inspection and maintenance. Process Safety Progress, 23 (4),
252–265. Retrieved from https://doi.org/10.1002/prs.10010 doi: 10.1002/prs
.10010
Khan, F. I., & Haddara, M. M. (2003). Risk-based maintenance (rbm): a quantita-
tive approach for maintenance/inspection scheduling and planning. Journal of Loss
Prevention in the Process Industries, 16 (6), 561 - 573. Retrieved from http://
www.sciencedirect.com/science/article/pii/S0950423003000949 doi: https://
doi.org/10.1016/j.jlp.2003.08.011
KPN. (2016, June 17). Change management process (2016). Retrieved from
http://teamkpn.kpnnet,org/group/documents/groep-cics/6848e311-bab6-4b03
-8318-366c4f026e6b
KPN. (2017). Van isdn naar kpn een. Retrieved from https://www.kpn.com/zakelijk/
blog/van-isdn-naar-kpn-een.htm
KPN. (2018, August 1). Be alert classificationmatrix. Retrieved
from https://teamkpn.kpnnet.org/embeds-ajax/download-document/
lRjjM8DOyzmwnLfW26KI6iH0AHVfnmFzXh5k6czXPEWSeZeQdqpFAVeSkgW0-ZUC/
lBPMxSDc8kOwnLfW26KI6iH0AHVfnmFzXh5k6czXPEWSeZeQdqpFATXLaDpK0Sv_
Krishnasamy, L., Khan, F., & Haddara, M. (2005). Development of a risk-based maintenance
(rbm) strategy for a power-generating plant. Journal of Loss Prevention in the Process
Industries, 18 (2), 69 - 81. Retrieved from http://www.sciencedirect.com/science/
article/pii/S095042300500015X doi: https://doi.org/10.1016/j.jlp.2005.01.002
Kushnir, V. (1985). Risk: A probabilistic concept. Reliability Engineering, 10 (3),
80
183 - 188. Retrieved from http://www.sciencedirect.com/science/article/pii/
0143817485900204 doi: https://doi.org/10.1016/0143-8174(85)90020-4
Lees, F. (2012). Lees’ loss prevention in the process industries: Hazard identification, assess-
ment and control. Butterworth-Heinemann.
Lubritto, C., Petraglia, A., Vetromile, C., Curcuruto, S., Logorelli, M., Marsico, G., &
D’Onofrio, A. (2011). Energy and environmental aspects of mobile communication
systems. Energy, 36 (2), 1109–1114.
Mayer, N., Aubert, J., Cholez, H., & Grandry, E. (2013). Sector-based improvement of
the information security risk management process in the context of telecommunications
regulation. In European conference on software process improvement (pp. 13–24).
Musolesi, M. (2014). Big mobile data mining: Good or evil? IEEE Internet Computing,
18 (1), 78–81.
Nielsen, J. J., & Sorensen, J. D. (2011). On risk-based operation and maintenance of
offshore wind turbine components. Reliability Engineering & System Safety, 96 (1),
218 - 229. Retrieved from http://www.sciencedirect.com/science/article/pii/
S0951832010001705
Peffers, K., Tuunanen, T., Rothenberger, M. A., & Chatterjee, S. (2007). A design sci-
ence research methodology for information systems research. Journal of management
information systems, 24 (3), 45–77.
Rausand, M. (2013). Risk assessment: theory, methods, and applications (Vol. 115). John
Wiley & Sons.
Ray-Bennett, N. S. (2018). Systems failure revisited. In Avoidable deaths (pp. 79–107).
Springer.
Ruijters, E., & Stoelinga, M. (2015). Fault tree analysis: A survey of the state-of-the-art in
modeling, analysis and tools. Computer science review, 15 , 29–62.
Sarwar, M., & Soomro, T. R. (2013). Impact of smartphone’s on society. European journal
of scientific research, 98 (2), 216–226.
Sorensen, J. D. (2009). Framework for risk-based planning of operation and maintenance
for offshore wind turbines. Wind Energy, 12 (5), 493-506. Retrieved from https://
onlinelibrary.wiley.com/doi/abs/10.1002/we.344
Spiess, J., T’Joens, Y., Dragnea, R., Spencer, P., & Philippart, L. (2014). Using big data
to improve customer experience and business performance. Bell labs technical journal,
18 (4), 3–17.
Van den Poel, D., & Lariviere, B. (2004). Customer attrition analysis for financial services
using proportional hazard models. European journal of operational research, 157 (1),
196–217.
Verbeke, W., Dejaeger, K., Martens, D., Hur, J., & Baesens, B. (2012). New insights into
churn prediction in the telecommunication sector: A profit driven data mining approach.
European Journal of Operational Research, 218 (1), 211–229.
Vereecken, W., Van Heddeghem, W., Deruyck, M., Puype, B., Lannoo, B., Joseph, W., . . .
Demeester, P. (2011). Power consumption in telecommunication networks: overview
and reduction strategies. IEEE Communications Magazine, 49 (6), 62–69.
White, D. (1995). Application of systems thinking to risk management: a review of the
literature. Management Decision, 33 (10), 35–45.
81
Appendix A
The Be Alert classification matrix is an official document used by KPN (KPN, 2018). It
defines what classification of Be Alert is given to any kind of loss of service distribution. As
this document is an official document in use by KPN that details information that is not
supposed to be shared with the public, the data has been taken out. That is the reason that
two of the three figures in this Appendix contain black boxes. While this prevents getting
a complete idea of how Be-Alerts are classified, it does give an idea what of what services
impact is being measured on.
The table has been cut in to three parts. The first part, Figure A.1, is focused on client
impact. Based on the amount of impacted clients and the importance of the service that is
malfunctioning, a problem in service delivery is classified as one of the five Be Alert categories.
The classifications are based on the colours green, blue, yellow, orange and red, representing
the severity levels minor, moderate, significant, major and critical respectively. The second
part, Figure A.2, is focused on the impact on KPN. The classification is based on security
impact, damages in € and service disruption. The last part, Figure A.3, explains how to
upscale a Be Alert and when a Be Alert manager needs to consider this.
The different types of impact on the client can be seen in Figure A.1 in the left column.
These are business impact, primary services, secondary services, government-related services
and IT applications. Depending on the importance of the service the amount of impacted
clients differs. This can go up till more than a million clients impacted. These values are
used in determining the risk ratings of different change types. For government-related services
other measurements are used. But as government-related services like for example 1-1-2 are
critical services, changes related to them will never be classified as standard changes. No risks
are to be taken with these services. Therefore they are less relevant for this research. IT
application is measured in impacted clients, aswell as accessibility of locations, accessibility
of applications and down time.
82
Figure A.1: Be Alert classification matrix - Impact on the client
83
Figure A.2: Be Alert classification matrix - Impact on KPN
Figure A.2 shows the different conditions regarding Be Alert classifications for impact at
KPN. The types of impact at KPN are internal impact, external impact, security and legal
obligations. Loss in revenue and damages in claims is measured in €. Societal impact and
damage to the image of KPN due to failures of services are measured in amount of clients
and area of effect. Security is split up different types of attacks. Governmental obligations
are measured in quality of service and outages. For this research, governmental obligations
are not of importance, as changes that have an impact on the services will never be classified
as standard.
The last part of the table can be seen in figure A.3. This is used to show when a Be Alert
manager needs to consider upscaling the classification.
84
Appendix B
Threat percentages
85
Table B.2 shows the total amounts of performed migrations. This has been split up in failed
and successful migrations. This gives an idea of how many migrations of this type are being
performed during maintenance windows, as well as showing how much improvement can be
made in the success rate.
Table B.2: Success and failures percentages 2018
86
B.2 Percentages 2019
Table B.3 shows an overview of all the threats for each migration in 2019. The percentages
for each threat represent their share in the total amount of failed changes.
Table B.4 shows the total results for five types of migration changes. Four of those changes
also have had their results measured in 2018. The results are shown in real amounts aswell
as in percentages.
87
Appendix C
In this appendix, for each type of migration, an overview is given of the importance of each
threat and consequence on different scales. First a percentage is given that shows how often
a threat has led to a failed migration. Then this is put in perspective by showing it next to
the amount of successful migrations. The last chart shows how often a failed migration has
led to a certain consequence and how many migrations were successful.
C.1 1-10G
C.1.1 Threats percentages failed migrations
The first thing that can be seen is that there are five main threats that make up almost 80%
of the causes for a failed 1-10G migration. These are cabling, BOP CA, BOP, material and
fiber. The other threats play a lesser role, as they only happen sporadically.
88
C.1.2 Threat percentages all migrations
The success rate of the 1-10G migrations in 2018 was 63.4%, making the percentages of failures
36.6%. In this chart the 36.6 percent is split up in all its causes and is shown next to the
success rate.
89
C.2 1G
C.2.1 Threats percentages failed migrations
In the case of 1G migrations, there are two threats that play a big role. For a total of 41.2%,
problems with Bop CA and mechanics are the leading cause for a 1G migration to fail in
2018. The other 68.8% is made up of 18 different threats in varying amounts, with cabling
and fiber problems sticking out, just like with the 1-10G migrations.
90
C.2.2 Threat percentages all migrations
The second chart shows that all in all, 1G migrations have only failed 14.81% of the time.
Which means that improving on the two main threats will bring the success rate up past 90%.
91
C.3 NT
C.3.1 Threats percentages failed migrations
NT migrations in 2018 that failed, failed because of four reasons. These can be seen in Figure
C.7. This makes it easier to improve performance, as solving or improving on one of the three
major causes of nFTR has a major impact on performance. The fourth cause, script error, is
only 2.4% of the time the reason of nFTR and can be regarded as non recurring problem.
92
Figure C.8: Percentages per change
93
C.4 WAP
C.4.1 Threats percentages failed migrations
Looking at Figure C.10, there are three main causes for failure, four minor causes and a
couple of sporadic causes. The three main causes account for approximately 60% of the failed
migrations. The minor causes take almost 32% for their account. This means a broad focus
is necessary to bring the nFTR percentage down.
94
Figure C.11: Percentages per change
95
Appendix D
Different approaches were possible to determine the most logical colour scheme. Starting with
the amount of colours used to communicate different results. The choice to use four colours
instead of three is because the goal of using this model is to increase insight in risk ratings.
By only using three colours to represent low, medium and high risk, some of the subtlety of
risk and impact ratings is lost. The differences between low risk and medium risk changes
are sometimes not as outspoken and using an extra colour to define that difference can be
useful when discussing change type classifications. Not having the tools and information to
have such a discussion is one of the reasons for developing a more micro risk model.
Another approach was to use all the results do define the colour scheme, instead of using
unique values only. The results using all the values instead of only the unique values, did not
however, match with KPNs views on risk ratings. This is because recurring values, like the
96
ones for costs, reputation and security just make those recurring values weigh heavier. This is
not a good representation of the real situation and resulted in less logical results. Therefore
the decision was made to use unique values, as described in the beginning of this Appendix.
• Green - No to low risk: Risk ratings that are coloured green are almost risk free.
Change types rated as such should be the first options considered to be discussed when
wanting to change change type classification in order to create more room during main-
tenance windows.
• Yellow - Low to medium risk: Ratings coloured yellow are considered more riskful.
These change types need improvement in order for them to change their classification
type to standard. In the case that there is not enough time to first improve them, a
calculation needs to be made to see if it is worth taking them out of the maintenance
windows. Based on the reduction of costs, room created during maintenance windows
for other work and the increase in risk, a decision can be made.
• Orange - Medium to high risk: When a change type has a rating that is coloured
orange, many improvements need to be made. These change types are riskful and need
to be performed during maintenance windows. In no case is it acceptable to try and
change the classification of these kind of change types. There might be a chance that
in the long run improvements are made which positively influences the risk rating, but
that will take time.
• Red - High to unacceptable risk: Change types rated red are very riskful and have
a big impact. There is no chance that these change types will improve enough in a short
time for them to become less riskful. Even in the long run it is probably better to not
take the risk and keep performing them during maintenance windows.
97
Appendix E
Representativity
The table shown in Figure E.1 has been created from an export of the ticketing system used at
KPN (ASTRID). This export contains all the tickets created in 2018 and the planned tickets
in 2019 up to November. Filtering out all the tickets of 2019, as the data used in this research
all originates from 2018, the table shown in Figure E.1 was created. The four migrations
types used in this research fall in the domain ’Ethernet’. That is the highlighted value in the
table. Comparing this value to the other values, leads to the conclusion that these kinds of
changes are the third most common changes.
98
Figure E.1: Representativity
99