11. Prasad et al., 2018, Intelligent Autonomous Systems for Software Engineering - An Example
11. Prasad et al., 2018, Intelligent Autonomous Systems for Software Engineering - An Example
Abstract— In this ever-growing competitive world, However, for User queries/IT staff generated incidents,
traditionally large corporation IT systems need substantial natural language processing and classification is critical to
effort and cost to maintain and manage systems engineering is identify tickets that can be auto healed. Organizations that
called as IT Service management. With advancement of have knowledge management mechanisms can be part of this
Artificial Intelligence (AI), IT service management can draw solution – where a speech bot/chat bot can assist support
benefits and drive efficiency the way an incident or ticket is engineer in guiding through analysis, and identifying right
managed. AI can not only help in incident resolution steps (like resolution. Three clear areas of AI focus are relevant here,
incident creation, recording, response, resolution and closure), right from (1) interaction or discussion services including
but also can ensure no-recurrence or no-repeat of the incident.
Speech to Text or unstructured communication modes
This opportunity is the focus of this paper. In this paper, AI
concepts starting from knowledge management, corpus
(Email, Text) to structured data, (2) Natural Language
creation and defining machine learning algorithm to automate classification to derive intelligence out of unstructured data
all service management actions will be discussed. Readers of and (3) self-learning algorithms to improve resolution
this paper will learn how to apply AI in IT Service confidence through Machine Learning.
management and identify approaches to define organizational
specific machine learning algorithms. This paper also provides
a bird’s eye view of what is the structured approach for
applying AI in various IT fields, with Service management as
an example field
I. INTRODUCTION
To optimize the IT spend for maintenance and divert it Figure 1 Auto (Record, Triage & Resolve)
for more digitization and business transformation initiatives,
there is a significant focus currently in automating IT service Above diagram (Figure 1) is a high-level view of where
management. IT Service Management cost is mostly in further optimizations can be brought in, which is detailed
monitoring of the applications and resolving the user issues further in following sections
manually. With the advances of AI for Software Engineering,
there is a significant focus on reducing manual intervention II. AUTO RECORD
in resolving tickets and monitoring, so that the support team
can focus on more value add to the clients – analyzing A. Enabling Service Desk for complete and Detailed
business relevance in applications. Incident Recording
The tickets for application maintenance can either be
In Service management, one common problem is
events/alerts from system, or can be user queries. The way
incomplete or vague description of the incident/ticket. This
optimization and improvement can be done differs widely in
causes the service desk or what is called as the level 2 ticket
the above two categories. Artificial Intelligence can equally
resolution team to go back to the user multiple times to get to
be applied in automating both Preventive and Reactive
the exact issue and root cause. The current analysis shows
maintenance.
that close to 50% of the turnaround time of tickets is wasted
In case of operational alerts/events from system that’s
in waiting for this additional information.
brought into a big data platform – these are normalized event
The information that is required for impact analysis can
data, and the classification algorithms need not consider
only be currently asked by the application subject matter
natural language processing. There is a huge possibility of
experts. The queries asked for each incident, and their
predicting the event/alert by applying AI that can be an
responses are currently tagged as additional information for
incident based on which auto recording and preempting such
each ticket.
incidents is possible. This is hence an area, where “Zero
It is important to enable the service desk to ask the
manual monitoring” and “Zero effort maintenance” can be
relevant questions to the user while the ticket is raised. The
targeted. Manual monitoring of various environments and
following steps are implemented for bridging this gap
applications can be automated.
Authorized licensed use limited to: b-on: UNIVERSIDADE NOVA DE LISBOA. Downloaded on November 23,2024 at 16:30:34 UTC from IEEE Xplore. Restrictions apply.
B. Questions based on the known ticket description corpus Thus, our known ticket description/additional
Based on the ticket description available, categories are clarification repository has detailed set of questions
defined. Since every user puts across the incidents differently, identified for each ticket category which will be the basis for
the same group/category of incident can be expressed support to service desk as detailed in next section.
differently – with no words matching. The relevance of a C. Chat bot based support for service desk
particular word in categorization is different for different
words. To explain – let’s take the example of “Access Issue”. As soon as the high-level ticket description from the user
The word access is primarily relevant in categorizing this is noted down by the service desk, the AI based model,
issue as an authorization ticket. The relevance of the word categorizes the incident. According to category assigned, a
“issue” is less. To ensure right set of features are identified chat bot assisted queries/and auto recording of user responses
and fed into the classifier, relevance of each word need to be are offered to service desk. The queries asked by the bot will
identified. This can be a simple function, where the be from the queries logged as per section 2.1.1. There will be
occurrence of the word in a specific bucket, to the few subsequent queries, based on the answers received from
occurrence of the word across different tickets is identified. the user as well.
This factor should be a key input before any classification. If The ticket description entered based on these queries and
this is not done, probable categories identified by classifier the recorded responses from the user will be detailed enough
will have all those buckets where word “Issue” appears for the L2 support team to quickly start with the analysis
(according to our previous example) instead of waiting for additional information. This acts as a
In sophisticated Latent Semantic Analysis systems, the key step in eliminating waste waiting effort from the value
counts are usually modified so that rare words are weighted steam mapped for incident resolution and helps in providing
more heavily than common words. For example, a word that quicker closure for the users.
occurs in only 5% of the tickets should probably be weighted III. AUTO TRIAGE
more heavily than a word that occurs in 90% of the tickets.
The most popular weighting is TFIDF [5] (Term Frequency A critical challenge in the incident resolution is the
– Inverse Document Frequency). Under this method, the triaging issue that results in waste of effort and resolution
count in each cell is replaced by the following formula. delays through wrong assignment of the tickets to the teams.
Auto triage is classification of tickets based on the incident
TFIDFi, j = (Ni, j / N*, j) * log (D / Di) where description and this is becoming a highly demanding area to
Ni, j = the number of times word i appears in specific automate.
category (the original cell count). To improve productivity, classification of tickets based
N*, j = the number of total words in ticket dump on the incident description and auto triaging has become a
D = the number of different categories standard offering in any application maintenance delivery.
Di = the number of categories in which word i appears Based on the application/product supported, and a high-level
Keeping a threshold for the TFIDF, will ensure relevant category of the ticket, assignment group is auto allocated in
features (Words) are picked up for classification. This is most incident management tools. However, with focus on
done on both incident description and for the additional cost optimization, the resolver groups are many more. L1.5
notes/clarifications of past ticket dump. and L2 support teams for the same application support
Latent Semantic Analysis (LSA) filters out some of the handles different category of tickets. The structured fields
user entered text noise and also attempts to find the smallest based on which the business rules are set for auto triaging
set of concepts that spans all categories of tickets. Bag of will not be able to differentiate the kind of tickets supported
words [5] approach is followed, where the order of words by non-application SMEs of L1.5 and assign accordingly.
isn’t important, only frequency is. Concepts are identified, There are several categories of tickets which gets
not just by one word, but an array of word coming together. reassigned multiple times, based on the analysis done by
Even though LSA is not a direct semantic analysis, the each team. For e.g.; an application non-availability will be
technique of correlation identification of various words in the first assigned to an application team, which then may get
ticket to its category, supports semantic analysis. For e.g.: reassigned to network, and finally may be resolved by the
Access failed, and login issue will both heavily correlate to database team. Based on most frequently resolving groups,
the same category, and hence net effect of LSA is a semantic and the log file entries, such tickets can be rightly triaged by
analysis in NLP. an AI based model. The detailed categorization and sub
The incidents in the “Known ticket description Corpus” categorization done will help in achieving the same. The
thus has category defined. detailed description captured during Auto record explained
The same techniques detailed above is applied on the in section 2.1 also is a key input for this categorization.
additional notes/clarifications received to understand the For quicker response time to end customers as well as
different queries/area of information. For e.g.: the error avoiding the effort leakage of ticket reallocation, it is
code/message shown on screen, the user security role set up important to have a machine learning based ticket
etc. These additional questions are identified for each classification and triaging.
category of tickets.
112
Authorized licensed use limited to: b-on: UNIVERSIDADE NOVA DE LISBOA. Downloaded on November 23,2024 at 16:30:34 UTC from IEEE Xplore. Restrictions apply.
A. Reactive Auto Resolution
The feature enhanced incident description and resolution
considered in a neural network model will support in
identifying the best resolution possibility.
This classification and resolution identification can help
in one of the below.
If the incident resolution process is using
scripts/automated, the same can be triggered once the ticket
is categorized and resolution identified that meets the
confidence factor.
If the resolution of a repeating incident is not resolved by
scripts, but resolution is identified using the model, then the
productivity of the L2 support team can be improved. This
can be achieved by linking to the knowledge base, and a chat
bot support to the engineer. In case of knowledge
Figure 2 Optimization in user created incidents
management tool based on business process flows, chat bot
can interact with the L2 support engineer with right
Support Vector Machine (SVM) will help classify the questions. Based on the answers and available log
tickets based on the nature as detailed in the incident information, there is a high scope of productivity
description. The target is to assign a ticket with the word – improvement. With the model directly integrated to
error, connectivity to the network team while access knowledge management tool, the steps followed in each
application should be to the L1.5 team. During the real-time incident can be tracked with more details. This in future will
ticket triaging, the incident will be considered as data point, help, in expanding the model of repeating the tasks recorded
and the category/assignment group identified will be specifically for each such similar incident.
assigned. While resolving the ticket and providing information to
If the error percentage in the model is high, this model the support team, details of the ticket are captured and
can co-exist with keywords and phrases identified for each updated in the knowledge management tool. This can be
category. Weightage of keywords/phrases is based on utilized to the full extent, leaving exploration of automation
number of groups where the word occurs and the total endless in this case.
number of groups. All the words in the ticket description will In case, ticket is not repeating/ or no conclusion can be
be considered for identifying the keywords, and only the made by the model, manual analysis and resolution is
ones with higher weightage will be considered and rules required. However, findings from first level machine
created. Weightage need to be considered based on the analysis could be a good input based on maturity of machine
uniqueness of words with respect to assignment groups. learning. This could provide additional productivity levers
In all auto triaging models, the calibration of the model even in manual resolution mode.
can be done based on newer entries of incident list and its The training for the model need to be revisited in
assignment group. Leveraging multiple techniques helps scenarios where there are application behavioral changes due
reduce the error in ticket classification. to updates, New business/ technology scenarios – New
In case of user created tickets, Latent Semantic Analysis product launch / market expansion and Technology upgrade
filters (NLP-bag of words) must be leveraged before ticket and integration.
classification algorithms like SVM is applied. The incidents B. Proactive Auto Resolution
thus auto categorized can then be assigned to the right
Proactive Auto recording of incidents is possible if the
resources, based on the group and availability.
events in event log (Big data platform) can be correlated and
IV. AUTO RESOLVE succeeding event and action required can be predicted.
Analytical tools will be listening to various sources –
Auto resolution is about fixing the incidents without monitoring tools, user interfaces, ticketing tool, App/DB/file
manual intervention. There are two types in auto-resolve. (1) servers, exchange servers and other infrastructure monitoring.
Reactive and (2) proactive automated resolution. Reactive is All those events captured across sources has a good amount
fixing it post the incident occurs and proactive is resolution of irrelevant warnings and information. These can be
of a predicted issue. considered as noise, as they don’t result in any incident that
The auto resolution of tickets has a direct positive impact needs intervention. Currently, there is a lot of effort wastage
on Operational risk benefits due to reduced human errors in in filtering this noise. We can inspect events in real time to
resolution and intervention. In case of environment assess the relevance that helps in identifying real, actionable,
monitoring, these can result in 24X7 automated monitoring higher-level incidents. This filtering is a key process, else
support of environments. the auto recording of incidents will result in large volume of
incidents.
This can be achieved using Naïve Bayes algorithm. The
Naïve Bayes classifier approach is applied to the large data
113
Authorized licensed use limited to: b-on: UNIVERSIDADE NOVA DE LISBOA. Downloaded on November 23,2024 at 16:30:34 UTC from IEEE Xplore. Restrictions apply.
stores routinely collected in day to-day IT operations, and parameter. The factors like network proximity, similarity in
enable IT Operations Analytics tools to check whether an events, based on past similar analysis and closures should all
event is expected or not. Unlike going through chain rule be factored in. The system also should learn, if a support
with AND conditions, all the attributes are considered engineer correlates few events during the analysis of any
independent for the analysis. The features of the above incident in the past.
example can be affected component, Criticality, Number of Once relationship is established, in cases where one or
alerts in the past period, environment etc. two events appear in a chain of associated events, the
For all the different features identified, posterior possible events can be predicted. The ones that require action
probability is an average value (based on history), can be raised as an incident. So, even
X = (X1. Xn). before users find an application down, a possible server
This probability or degree of certainty can be looked at unavailability incident will be logged and taken care of. This
for acting as a filter for the noise. can substantially reduce the user raised high priority
Reducing operational noise improves the accuracy and incidents around availability, performance of the application,
confidence and helps in support team focusing on the right Reports/Load failures due to job/batch issues, Interface/file
issues quicker and most of the events are correlated. For e.g., transfers etc.
a connectivity issue between a DB server and App server Thus, identifying relevant event, predicting future
will result not just in the event from network, but also will incidents helps improve application availability (thus,
have events raised by dB, App, and multiple run time error customer satisfaction) with an added benefit of reducing cost
from an n tier application. In addition to the system of manual interventions.
generated alerts, there will be user events as well. These The incidents thus auto recorded can be auto resolved as
events should not be considered individually, but clustered well, if the resolution of past incidents are recorded in the
and correlated as a meaningful transaction/situation and system. These may be simple bots executed – like a server
resolved. So, if one of the events in a chain of events has boot, disk space clearing, job re runs etc.
started, model can predict the following ones.
This can be achieved by understanding the relationship of
these events using Apriori algorithm, which is an V. CONCLUSION
unsupervised machine learning for association rule creation. Authors feel that there is huge potential to enhance
Self-joining, and Pruning is done k times, where k is the algorithms discussed in all the steps of application
number of items, in the last iteration that you get with maintenance. Automation and AI infused application
frequent item sets. maintenance is the future for software service delivery.
Based on the confidence threshold set, association rules Organizations can take the topics discussed in this paper,
can be generated. This can be leveraged in two ways tailor the approach and drive actions in application
Associate the events coming up in the tool to show as a maintenance that can help deliver efficiency and cost
single situation optimization.
Predict the possible succeeding events/incidents and
proactive measures can be added as a resolution step in the VI. DISCLAIMER:
ongoing event/incident. This Whitepaper has been published for information and
illustrative purposes only and is not intended to serve as
advice of any nature whatsoever. The information contained
and the references made in this Whitepaper is in good faith
and neither Accenture nor any of its directors, agents or
employees give any warranty of accuracy (whether
expressed or implied), nor accepts any liability as a result of
reliance upon the content including (but not limited)
information, advice, statement or opinion contained in this
Whitepaper. This Whitepaper also contains certain
information available in public domain, created and
maintained by private and public organizations. Accenture
does not control or guarantee the accuracy, relevance,
timelines or completeness of such information. Accenture
does not warrant or solicit any kind of act or omission based
on this Whitepaper. The Whitepaper is the property of
Figure 3 Proactive Auto Resolution Accenture and its affiliates and Accenture be the holder of
the copyright or any intellectual property over the
For IT operations analytics, (Source/Host + Time span + Whitepaper. No part of this document may be reproduced in
Event type + Event code) combination can be searched for any manner without the written permission of Accenture.
other events coming frequently together. Correlation of Opinions expressed herein are subject to change without
events should not be done only by considering time as a notice.
114
Authorized licensed use limited to: b-on: UNIVERSIDADE NOVA DE LISBOA. Downloaded on November 23,2024 at 16:30:34 UTC from IEEE Xplore. Restrictions apply.
REFERENCES [5] Bag of Words & TF-IDF
https://deeplearning4j.org/bagofwords-tf-idf
[1] ARTIFICIAL INTELLIGENCE Third Edition Paperback - by Kevin
Knight (Author), Elaine Rich (Author), B. Nair (Author)
[2] Modeling and Reasoning with Bayesian Networks 1st Edition
by Adnan Darwiche (Author)
[3] Machine Learning: A Probabilistic Perspective (Adaptive
Computation and Machine Learning series) 1st Edition by Kevin P.
Murphy (Author)
[4] 6 easy steps to learn Naïve Bayes Algorithm by Sunil Ray
https://www.analyticsvidhya.com/blog/2015/09/naive-bayes-
explained/.
115
Authorized licensed use limited to: b-on: UNIVERSIDADE NOVA DE LISBOA. Downloaded on November 23,2024 at 16:30:34 UTC from IEEE Xplore. Restrictions apply.