Exception Based Modeling and Forecasting.: Conference Paper
Exception Based Modeling and Forecasting.: Conference Paper
net/publication/221447683
CITATIONS READS
10 481
1 author:
Igor A. Trubin
Peter the Great St.Petersburg Polytechnic University
24 PUBLICATIONS 50 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
PERFOMALIST web app tool to build IT-Control charts for visualizing weekly patterns and anomalies View project
My paper "Is your Capacity Available?" has been accepted for the imPACt 2016 by CMG Conference to be held November 7 - 10, 2016 in La Jolla CA. View project
All content following this page was uploaded by Igor A. Trubin on 20 September 2014.
1. Introduction
(I have previously experienced similar excitement Why? Because in a good environment most of the
during my teaching career when I would see some production boxes are not trending much and you
sparks of understanding in the eyes of my students. could only enjoy this type of healthy future
I hope all the readers of this paper have those predictions. As a result, we had to scan all those
sparks as well.) thousands of charts manually (eyeballing) to select
1
only those that had real and possibly dangerous D. The starting historical time point should take
trends. into account the significant events such as
hardware upgrades; virtual machines,
Some of these experiences are discussed previously databases and application migrations; LPAR
in CMG papers [1] and [2] where “… authors used reconfigurations and so on.
the SAS language to automate the production of E. “Bad” data points should be excluded from
resource projections from business driver inputs historical samples as “outliers”.
(see Figure 3). If projections are stored in a
database, automatic “actual vs. projected” RULE A: “Right Summarization”. Data collection
comparisons can also be displayed…” must provide very granular data (10 sec -15 min at
least). However, for analysis data should be
summarized by hour, day, week or month. It is not a
good idea to try and produce trend forecast analysis
against the raw and very granular data even using
good “time series” algorithms. On the other hand, if
you have only hourly or daily snap-shots type of data
the forecast could be very misleading even after
summarization.
2
RULE C: “Statistical model choice”. Let’s leave (The stepar and trend=2 are default values and
the detailed discussion of this subject to real they mean “stepwise autoregressive” with the “linear
statisticians/mathematicians and here let’s just trend model”)
formulate the basic rule: start with linear trend. Play
with other statistical algorithms if they are available RULE D: “Significant Events”. The standard
and use them only if absolutely necessary. forecast procedure (time series algorithm based)
might work well where the history is consistently
The chart in Figure 5 shows the same data with reflecting some natural growth. However, often due
different future trends because three different to upgrades, workload shifts or consolidations, the
st
algorithms were used. That figure shows that the 1 historical data consists of phases with different
method tries to reflect weekly oscillation patterns. The forecasting method should be adjusted
(seasonality) while others try to keep up along with to take into consideration only the latest phase with
the most recent trends; the last one is the most a consistent pattern.
aggressive.
3
It is a known approach to mark a resource as
potentially running out of capacity by using future
trend intersection with some obvious threshold.
However, this approach does not work when the
threshold is unknown. Below we will discuss another
method that does not have this weakness.
4
That, for instance, allows you to see where behavior and some examples of this analysis were
system performance and business driver published in another CMG paper [5].
metrics correlate simply by analyzing control
charts. But the most efficient use of this metric is to filter the
top most exceptional resources in terms of unusual
- RULE B: “Do Not Mix Shifts” is easily usage.
demonstrated by the weekly/hourly control
chart because it visualizes the separation of Publishing this top list in some way (e.g. bar charts
work or peak time and off time. shown on Figure 9 or 16) along with links to control
charts significantly reduces the number of servers
- RULE C: “Statistical Model Choice” that require the focus of Capacity Planning or
means playing with different statistical limits Performance Management analysts.
(e.g. 1 st. dev. vs. 3 or more st. dev.) to tune
the system and reduce the rate of false
positives.
5
“ExtraValue” or EV metric records, as the Which presentation is better? It depends. The
most recent negative value of this metric common recommendation is to use the daily peak
indicates time when the data actually started for OLTP or web application servers and daily
trending up. average for back-up and other batch oriented
application servers.
To illustrate how this works, let’s look at a couple of
case studies with actual data. One day, server But even looking at the Daily Peak trend forecast,
number 9 has hit the exception list as shown on the future looks good. Why? Because RULE D is not
Figure 9. Clicking on the control chart link, which the applied and the entire history was used for
web report should have on the same page, brings up forecasting. But the history is obviously more
the control chart (shown on Figure 8). That indeed complicated, and it’s a good idea to analyze only the
shows some signs of exceptional server behavior: last part of the history. It can be seen clearly just by
eyeballing historical trend chart. Could that decision
- Some hourly exceptions occurred on be done automatically? Certainly, if one looks at the
Monday. history of “ExtraValue” or EV metric on Figure 12.
- During the entire previous week the actual
CPU utilization was slightly higher than Note that the most recent negative value of
average (green mean curve). ExtraCPUtime metric (which is the EV meta-metric
- On Friday the upper limit (red curve) derived from CPU utilization metric) points exactly to
reached the 100% threshold for a few hours, the point of time when CPU utilization started to
which indicates that in the past the actual grow. Basically, to find a good starting point for
data might be at 100% level on other analyzing the history one needs to find the roots in
Fridays; and Friday average curve is higher the following equation:
than on the other days.
EV (t ) 0
Based on this information, it is a good idea to look at
the historical trend. But which metric statistic is
better suited for that: daily average or average of where EV for this example is “ExtraCPUtime”
peak hour? Let’s look at Figure 10 where both (unusual CPU time used) function of time t. (in days)
statistics are presented:
If EV metrics are recorded daily in some database,
this equation could be easily solved by developing a
simple program using one of the standard
algorithms. The solution for the real data example
shown above is t =~ 04/22.
Figure 10 – Trend Forecast Chart of Daily Average Figure 11 – Corrected Trend Forecast
vs. Daily Peak Hour Average CPU Utilization
6
Figure 12 – History of “ExtraValue” Metric (ExtraCPUtime) vs.
Daily Average of CPU Utilization
Figure 14 - ExtraTime data analysis One of the unique parts of this method is the
following. If a metric does not have an obvious
Some oscillations are seen around the most resent threshold (e.g. I/O, paging or Web hits rates) the
negative EV value, but that might be tuned out as approach works anyway and the trend-forecast will
those cases are using 1 st. dev. threshold which is be built only for resources (e.g. disk, memory or
too sensitive. And of course by the term “recent” one Web application) that recently started dangerously
should assume at least a few weeks or more to have trending up. Additional modeling may be needed to
enough data for meaningful trend analysis. estimate what drives the increase and how to avoid
potential problems.
7
Figure 16 – Web Report about Top Servers that Released CPU Resource
8
This simple correlation model shows us that the This model also allows for a more complex analysis.
maximum number of hits per second that this server To apply the method explained in the previous
can handle is about 18. This is a meaningful result paragraph of this paper – calculate the historical
and if the application’s support team anticipates a starting point based on the most recent negative EV,
higher hit rate in the near future based on and look at the most recent trend which is
specifications, stress test results, customer behavior, apparently the worst case scenario as shown on
forecast and/or business projections, this server will Figure 20:
need more processing capacity to meet the
requirements.
9
6. Summary 7. References
Capacity management in a large IT environment [1] Merritt, Linwood, "Seeing the Forest AND
should perform forecasting and modeling only when the Trees: Capacity Planning for a Large
it is really needed. This saves a lot of man-hours Number of Servers", Proceedings of the
and computer resources. United Kingdom Computer Measurement
Group, 2003
Exception Detection techniques along with an
Exception Database could be used to automate the [2] Merritt, Linwood and Trubin, Igor, Ph. D.,
decision making process with regard to what needs “Disk Subsystem Capacity Management,
to be modeled/forecasted and when. Based on Business Drivers, I/O
Performance Metrics and MASF”, CMG2003
MASF Control Charts have the ability to uncover Proceedings.
some trends showing actual data deviations from an
historical baseline. The most recent negative EV [3] Jeffrey Buzen and Annie Shum, "MASF --
(ExtraValue of exceptions meta-metric first Multivariate Adaptive Statistical Filtering",
introduced in CMG’01) is an indicator of the moment CMG1995 Proceedings, pp. 1-10.
of time when it is good to start the trending analysis
of an historical sample. [4] Igor Trubin, “Capturing Workload Pathology
by Statistical Exception Detection System”,
A common way of raising a future capacity concern CMG2005 Proceedings.
by calculating future trend intersection with some
constant threshold does not work for metrics without [5] Igor Trubin, “Global and Application Levels
obvious thresholds. The Statistical Exception Exception Detection System, Based on
Detection approach helps to produce the trending MASF Technique”, CMG2002 Proceedings.
analysis necessary for those cases.
[6] Igor Trubin, “System Management by
Workload pathologies (e.g. run-aways or memory Exception Part 6”, CMG2006 Proceedings.
leaks) should be excluded from an historical sample
in order to improve the forecasting. The Exception [7] Kevin McLaughlin and Igor Trubin, “Exception
Detector provides data (dates and hours) for that. Detection System, Based on the Statistical
Process Control Concept", CMG2001
Application data (e.g. web-hits) vs. Server Proceedings.
performance data (e.g. CPU utilization) correlation
analysis gives a priceless opportunity to add some [8] Igor Trubin and Ray White: “System
meaning to forecasting/modeling studies and that Management by Exception, Part Final”,
analysis can be done using standard spreadsheet CMG2007 Proceedings.
tools.
10
8. APENDIX:
Where
(U ( h, t ) UCL( h, t ))dh,U UCL 0
S
0,U UCL 0
(U ( h, t ) LCL( h, t ))dh,U LCL 0
S
Figure 22 – 2D Model 0,U LCL 0
+ -
In a general case S and S as shown on Figure 24
For this model the formula for EV calculations is have the following geometrical meaning: it is the
area between the actual data curve (U) and the
S ,U (t ) UCL(t ) 0 statistical limit curves (UCL and LCL). They should
be calculated only on intervals where the actual
EV (t ) S ,U (t ) LCL(t ) 0 metric is outside of the UCL - LCL band. If the metric
+ -
0,UCL(t ) U (t ) LCL(t ) U is within the band, then both S and S as well as
EV are equal to zero.
+ -
where S = U(t)-UCL(t) and S =U(t)-LCL(t)
11