0% found this document useful (0 votes)
74 views

Analytics in Incident Management A Clustering Approach-FinalPaper

Analytics in Incident Management a Clustering Approach

Uploaded by

pant_rahul
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
74 views

Analytics in Incident Management A Clustering Approach-FinalPaper

Analytics in Incident Management a Clustering Approach

Uploaded by

pant_rahul
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Analytics in Incident Management : A Clustering

Approach
Rahul Pant Kalyana Chakravarthy Bedhu

Ericsson Group Analytics Ericsson Group Analytics


Ericsson Ericsson
Bangalore,India Seattle,U.S.A.
[email protected] kalyana.chakravarthy.bedhu@erics
son.com

Abstract— In a multinational setup there is a diversity of


tools, technologies, hardware etc. and a vertical taking care of
issues or faults occurring in these systems or products, which Our approach to identify major groups of issues which
provides first level support. The focus is more on resolving the needs management focus is a co mbination of text cleansing
issues asap, however historically less focus is on categorizing based on domain knowledge, natural language processing
issues properly which fails to give an overall picture of the techniques and unstructured text clustering. The techniques
major issue categories. In this paper we look at the problem of bear semblance to methods discussed in paper [1], [2] but
identifying major issue groups for management focus by add a business and domain context of IT landscape.
analyzing both unstructured incident text and structured
metadata about incidents. Using natural language processing The rest of the paper is organized as fo llows. In section
techniques, clustering approaches and the domain knowledge II, we briefly discuss the business problem and the data
of IT Operation within our organization, we are able to mining questions in hand. Section III gives a brief overview
identify major cluster of issues and also the sub-clusters within of the dataset and provides some descriptive statistics about
them. This business case hence has provi ded an actionable data. Section IV exp lains the methodology used and the
insight to the management results obtained. In Section V we discuss the business benefit
of this application and the future scope of work as well as the
Keywords—text mining, clustering, incident management etc. current limitations.

I. INT RODUCT ION II. BUSINESS PROBLEM


The IT Incident Operations team within our organization Before proceeding on the analysis methods, we would
present a short description of the business problem at hand.
supports a diverse landscape of tools, technologies,
The data for this problem is all the IT Incidents raised by
hardware etc. and takes care of issues or faults occurring the colleagues within the organization. The focus of the
in these systems or products. Within the operations unit analysis was limited to the high priority and very high
priority incidents where management wanted more
a front level support team raises issues on behalf of actionable insights. There are 1000+ global systems and
end-user and updates the ticket details in an Incident services which are used by internal employees. Incase there
is a degradation of any sort, a support desk or support tool is
Reporting System. There is currently a management available to raise incident and this queue of incoming issues
focus in reducing the number of high priority issues , is tracked every minute.

getting insight like major areas of concerns and trend Looking at the text manually the first level support
assigns the ticket to a specific queue for action. In most
of tickets over time. scenarios it is the skill and experience of indiv idual/team
The current method of identifying major areas which ensures ticket is acted within time or assigned to right
queue. The focus till date had been towards improving the
of concern is based on human judgement which is skill of individuals and team rather than understanding what
cluster of issues are being seen so that appropriate focus can
prone to error as well as bias. The Incident are
be put in certain areas. In summary, it was a reactive
analyzed manually by a team every month for approach in place, that an issue is solved as soon as it comes
instead of a proactive one i.e. to understand which
reporting purpose and insights derived. Certain area/system/service should be improved to reduce the tickets.
features of each incident are usually added like With a data driven focus within the organization, and also
Location, Type of Incident, Reporter, Date/Time etc. due to cost optimizat ion constraint, the management is
asking the right questions about improvement scope both in
for reporting purpose but they do not utilize the terms of processes and people. Their primary ask is:
unstructured text which contains the issue details. 1. Can we identify major groups of issues which need
management focus?

XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE


2. Can we also identify sub-groups within those major since it is based on the guess of the reporter and
groups of issues to get a better understanding of generally found not reliable by the resolution team.
issues?
3. Can we understand how cluster of issues have A. Descriptive Statistics
changed over time?
The dataset consists of 10000+ incidents reported around
4. Can we put an automated system in p lace for all
the globe fro m 112 d ifferent countries. The site related
incoming tickets queuing?
issues spanned across 2000 d ifferent sites, and the number
For reasons of brevity, continuity of ideas and of tools, systems and modules against which issue were
confidentiality only the first two questions are discussed in reported were around 100.
this paper. Also, all references to the incidents, including any The management requirement was to focus on issues of
codes, numbers, internal names , have been obfuscated to Critical and High priority nature in the first phase and hence
dummy values for confidentiality purposes . the authors restricted the dataset to the 3000+ reported
issues in last 24 months.
III. DESCRIPTION OF DATASET
The dataset is from the incident management system
which is managed by the IT Operations team. This data is IV. M ET HODOLOGY AND RESULTS
currently extracted on a daily basis to a data warehousing The focus of analysis was to identify major cluster of
system, however its usage is restricted for standard reporting issues which should be acted upon. However significant
purposes to the management in a tradit ional Management metadata information was found missing across different
Information System (MIS) system. tickets, so we focused primarily on the Incident Text to
For the purpose of the study, the authors were given a understand the following:
snapshot of all existing tickets in the data warehouse for last 1. Are there any extractable patterns being used in tickets
2 years. The objective of the study was to identify a which can help identify groups of issues ?
modelling technique which can then be applied to the live
data on data warehouse – and then solve the business data 2. Co mpare the Problem Text with the resolution to
mining questions in hand. The snapshot was provided to understand the ticket quality in general
provide a stable set of records to the researchers to identify 3. Are the systems, tools, location related details being
the best models. Another source of data requested while mentioned in tickets?
working on this mining problem was about "Company
Sites". This dataset had quality issues and was used partially. The answer to above questions mainly lay in subjective
For the purpose of modelling, it was decided to only use analysis and we randomly sampled 200 incidents and came
Python for modelling and use RShiny and Tableau software to the following conclusion after discussing with the SME.
for visualization and application deployment.
1. There are patterns defined for raising tickets however
the compliance to format is relat ively seen. Multiple teams
follow their own formats which vary by region and keep on
The incident data contained the following information:- changing. Hence a global structure which can be mined is
• Incident Number not defined. Even in 100 tickets we saw rarely a similarity in
structure. Few examp le of similar ticket but different
• Problem Text patterns:
• Location Information(Reporter Location) • [System Name]: Issue in connectivity
• Incident Reporter Details • No connection possible in [System Name]
• Ticket Priority • [Failure]: No SAP connect [System Name]
• Product Reported

2. Problem Text matched closely to the actual issue,


The Site data had the following fields: compared to the Product Issue defined by the first level
support team. Hence analysis of text looked like the best way
• Site Name forward.
• Site Location 3. The exact system or tool having issue was usually part
• Other Site Features of the text

In addition to the incident information and site 4. The abbreviations which were used were mainly
information above, the authors were given additional “master related to system names and hence the rare occurrence of
data” sets – related to tools and systems available within terms like RBS did not necessitate another metadata creation
their organization. This dataset was not exhaustive and had for our analysis.
to be modified and appended in the project lifecycle based on Our conclusion from the qualitative analysis was that the
business input at different stage of analysis. Incident text look rich enough to deep dive and use existing
The authors were advised by Subject Matter Experts clustering methods to find similarit ies between tickets. As a
(SM E) fro m business to ignore the 'Product Reported'
first step of understanding the data we looked at the word • All special characters were removed
frequency cloud, as shown below.
• SME provided similar words were co mbined e.g.
(Collab, Collaboration, Collaborative), (connect,
connectivity, connection).
• Careful removal of stopwords was done, as some
stopwords were valid and important names within
organization like: 'ONE'.
• Removed mention of months in data
• Lo wercase all words and removed the numeric
values
• The first word was usually found having
significance to understand about the ticket, and it
Figure I: Text as Wordcloud was extracted as another feature for analysis purpose

"Backup" looked like a clear issue, however after SME At this point we had the textual data available in a clean
discussion it was understood that all global systems where format to initiate clustering. We refer to the literature around
ticket is raised by support, always mention whether we have Text Clustering[1] and papers related to text clustering in a
a “Backup Available” or not. Thus we excluded this, and different domain [6],[3]. We also referred to multip le
other similarr inputs provided by SME from our data. applications of hierarchical clustering in text [2][5]. After
careful reading of papers and keeping simplicity of
Other understandings from looking at word clouds and
implementation in mind, we restricted to Partitional(K -
similar visualization were:
Means) and Hierarchical Clustering options. The limitation
• Months are also being mentioned within ticket, of K-Means, although its much quicker, is its requirement to
maybe by automated alerting tickets know the number of clusters beforehand. However, we did
• Telecom co mpany names, our customers, are also not have an idea about the expected clusters.
mentioned in the tickets, and it was decide to extract such
information too, like Airtel, MTN etc. B. Modeling Technique
• Lemma and Stemming might help as we see terms Hierarchical Clustering was preferred over partitioning
like(available, availability), (perform, performance), (alert, methods as document hierarchical clustering builds a tree of
alerting, alerts) clusters.It can further be classified into agglomerative and
• City names are also frequent and identifying the city divisive approaches, which work in a bottomup and top -
and country information may provide further insights down fashion, respectively. In the author's case, they went
with the agglomerative clustering approach, which iteratively
We decided to extract the following informat ion from each
merges two most similar clusters.
ticket text wherever possible
As a first step a term frequency inverse document
• City Names , Country Names frequency(TFIDF) matrix for the prepared incident text was
• SAP System, non SAP System and Tool Names created. This step prepares the text as numerical input for
relative co mparison. Multip le parameter tuning steps and
• Company Names iterations were carried(in python sklearn) and we finally
This information should help in better cluster creation and went with the configurable items as shown in Table I:
exp loring the created clusters later from different
dimensions. Some information required matching with TABLE I.
available metadata(like Company Names), creating or Parameters
improving metadata( like City, Country) and using patterns
T oken Frequency Upper Limit 90%
to extract data (like System Names)
T oken Frequency Lower Limit 0.25%
A. Data Preparation for Clustering Inverse Document Frequency Used
Following steps were performed for preparing data for User Defined
clustering. T okenization
Function
• All IP's, common ly known extensions (.in,.co m, N-Gram 1
.companyname.se) were removed beforehand as Parameters for TFIDF Matrix Creation in Python
most tools or systems will have that extension
• Keyword replacement done. Few examp les: We then computed and stored cosine similarity of above
Replaced City and Country with keyword 'Location', matrix docu ments. There were mult iple distance
Tools and system names with 'InternalSystem', measurement options for hierarchical clustering like single
Company names with keyword 'Ourcustomer'. lin kages, complete linkages, ward linkage etc. Ward’s
method, or minimal increase of su m-of-squares method,
• All identified sites being replaced by keyword
defines proximity between two clusters as the magnitude by
'Location' too.
which the summed square in their jo int cluster will be
greater than the combined summed square in these two
clusters. The clusters were found to be relat ively
homogeneous when using Ward distance criteria. The last
step in the clustering process was creating a dendogram and
finding the best number of clusters and sub-clusters.
While doing the clustering, we als o tested the results from
K-means to understand the types of clusters getting created.
The results gave credence to our theory that there are sub -
clusters within major clusters and considering them as a
separate entity/group will not be correct. An examp le of 2
such groups is [Cluster 1: Access Slow, Cluster 2: Connect
Timeout, Cluster 3: Connect Failed]. The issues in these 3
clusters belong to the same group of Connection Issues as a
major category. Hence the approach of hierarchical
clustering deemed fit for the problem at hand.
Visual inspection of dendogram and quality analysis of
clusters and sub-clusters for homogenity was done to
identify the best 2 cut points for creating the clusters and
sub-clusters.

C. Cluster Naming and Quality Analysis


Looking at incident text data for cluster homogeneity was a Table II: Clusters and Sub-Cluster
tough manual task, and we attempted few simple semi-
automated validation techniques . Many tickets had most
useful information in the first 1-2 words of the ticket, hence E. Results
we extracted them as separate features. We also created a list A total of 20 Clusters were identified, with total of 40+ sub-
of keywords or action words from the term frequency(with clusters, some shown in Table II. 10 Parent Clusters had only
manual pruning) and added another feature as the keywords a single child (sub-clusters), and the "WAN Issues" cluster
identified in each ticket. The cluster results were then had the maximu m number of sub-clusters. Since “WAN
validated by looking at these keywords and the first words. Issues” are usually a bigger chunk of issues that operations
This approach, although not being a concrete evaluation team had to handle the results looked encouraginghad a
metric like variance explained by the clusters , proved to be positive feedback fro m the business. Few Clusters and Sub -
reliable evaluation method by subject matter experts . clusters are shows here, and others not shown due to business
constraints. The results were published as a Tableau
Dashboard for easier consumption and analysis, a snapshot
D. System Architecture
shown in Figure III.
The purpose of the design is to model the system before
its creation. Here we present the system. The system is
mainly d ivided into four components one for domain
knowledge next t wo for cleaning and pre-processing and last
for clustering incidents and finding subgroups .(Figure II)

Figure III: Major Clusters

Figure II: System Architecture


V. SUMMARY AND CONCLUSION [9] S. Murali Krishna and D. S. Bhavani(2010). “An Efficient Approach
for T ext Clustering based on Frequent Itemsets,” European Journal of
The acceptance of the solution within the organization Scientific Research, vol. 42, no. 3, pp. 385-396, 2010.
was a significant achievement fro m author's perspective.
This helps business units within organizat ions realize the [10] Hsin Li, T a & Liu, Rong & Sukaviriya, Noi & Li, Ying & Yang,
effectiveness of techniques and methods available in the field Jeaha & Sandin, Michael & Lee, Juhnyoung(2014). Incident T icket
of machine learn ing. It also helps our analytics unit to push Analytics for IT Application Management Services. 2014 IEEE
forward the agenda of data-driven thinking. International Conference on Services Computing, SCC
2014.10.1109/SCC.2014.80
Our current work though, is not without limitations. From
a research perspective, many options are available which can
improve the current results. We have looked at hard
clustering techniques and not looked into the realm of soft
clustering due to limited time at hand. Additionally, we note
here that 2 groups having a significant number of incidents
could not be clearly labelled. It would be in the scope for
future work.
The current technique also needs to be evaluated over
time to understand if the changing clusters over time are
being captured. A larger dataset of historical data is also
requested for further evaluation.

VI. A CKNOWLEDGEMENTS
The authors would like to extend thanks to all SM Es
within the IT Operations team for their data and business
knowledge – in particular Mr. Kunjan Sharma and Mr.
Akhilesh Kr. Sinha. Without their support and inputs this
use-case would never have started off – let alone complete.

VII. REFERENCES

[1] Aggarwal C.C., Zhai C. (2012) A Survey of Text Clustering


Algorithms. In: Aggarwal C., Zhai C. (eds) Mining Text Data.
Springer, Boston, MA

[2] El Hamdouchi, Abdelmoula & Willett, Peter. (1989). Comparison of


Hierarchic Agglomerative Clustering Methods for Document
Retrieval. Comput. J.. 32. 220-227. 10.1093/comjnl/32.3.220.

[3] Zainol, Zuraini & Marzukhi, Syahaneim & Nohuddin, Puteri &
Noormanshah, Wan & Zakaria, Omar. (2017). Document Clustering
in Military Explicit Knowledge: A Study on Peacekeeping
Documents. 175-184. 10.1007/978-3-319-70010-6_17.

[4] Reddy, Srikanth & Kinnicutt, Patrick & Lee, Roger (2016). Text
Document Clustering: The Application of Cluster Analysis to Textual
Document. 10.1109/CSCI.2016.0222

[5] Benjamin C.M. Fung and Ke Wang and Martin Ester(2003).


Hierarchical Document Clustering Using Frequent Itemsets in 2003
SIAM International Conference on Data Mining.

[6] Intel White Paper: Reducing Client Incidents through Big Data
Predictive Analytics

[7] Beil, Florian and Ester, Martin and Xu, Xiaowei(2002).Frequent


Term-based Text Clustering. Eighth ACM SIGKDD international
conference on Knowledge discovery and data mining.
10.1145/775047.775110

[8] Andreas Hotho and Andreas Nürnberger and Gerhard Paaß(2005).A


brief survey of text mining.LDV Forum - GLDV Journal for
Computational Linguistics and Language T echnology

You might also like