2014 - Capacity Planning and Allocation For Web-Based Applications
2014 - Capacity Planning and Allocation For Web-Based Applications
Volume 45 Number 3
June 2014
Ferdous M. Alam
Supply Chain Business Intelligence, Nestle USA Inc, 800 North Brand Boulevard (5A24),
Glendale, CA 91203, e-mail: [email protected]
ABSTRACT
Motivated by the technology division of a financial services firm, we study the problem
of capacity planning and allocation for Web-based applications. The steady growth in
Web traffic has affected the quality of service (QoS) as measured by response time
(RT), for numerous e-businesses. In addition, the lack of understanding of system
interactions and availability of proper planning tools has impeded effective capacity
management. Managers typically make decisions to add server capacity on an ad hoc
basis when systems reach critical response levels. Very often this turns out to be too
late and results in extremely long response times and the system crashes. We present
an analytical model to understand system interactions with the goal of making better
server capacity decisions based on the results. The model studies the relationships and
important interactions between the various components of a Web-based application
using a continuous time Markov chain embedded in a queuing network as the basic
framework. We use several structured aggregation schemes to appropriately represent
a complex system, and demonstrate how the model can be used to quickly predict
system performance, which facilitates effective capacity allocation decision making.
Using simulation as a benchmark, we show that our model produces results within 5%
accuracy at a fraction of the time of simulation, even at high traffic intensities. This
knowledge helps managers quickly analyze the performance of the system and better
plan server capacity to maintain desirable levels of QoS. We also demonstrate how to
utilize a combination of dedicated and shared resources to achieve QoS using fewer
servers. [Submitted: October 21, 2011. Revised: February 26, 2013. Accepted: March
13, 2013.]
† Corresponding author.
535
536 Capacity Planning for Web-Based Applications
INTRODUCTION
In this article, we address the capacity planning and allocation decision that a
manager of a Web-based application system has to face on a daily basis. Our
study was motivated by the capacity and performance management problem in the
technology division of a Fortune 100 financial services firm. A typical Web-based
application system consists of Web servers, application servers, various external
servers (e.g., databases and mainframe computers), and Web-application logic
that controls the flow and sequence of customer requests through all the elements
of the system. The performance of the system varies significantly based on the
architecture, application logic, and capacity. The application server is central to
any Web-based application as it serves as the traffic controller and controls the flow
of information between the end-user and across the several tiers in the network.
An important decision that a system manager faces is determining and main-
taining the balance between incoming demand for services and application server
configuration. Managers have to maintain sufficient server capacity to guarantee
a desired quality of service (QoS). QoS is measured by response time (RT) or
the time it takes to respond to user requests. As demand increases, QoS deterio-
rates if additional capacity is not added. Due to the high performance expectations
associated with online services, it is important to have the ability to anticipate per-
formance degradation and quickly react to changing conditions in order to ensure
QoS. In the real-world application that we studied, managers added capacity on an
ad hoc basis without clearly understanding the resulting effect on RT. Also, due
to the lack of a clear understanding of the relationship between RT and capacity
levels, it was difficult to anticipate the rate of deterioration in QoS as demand
increased. As a result, the systems would reach near crash situations very rapidly
and capacity additions at that point yielded the desired results very slowly.
In most computers systems, the relationship between server capacity and RT
is nonlinear due to underlying system interactions and the effects get magnified
under high traffic conditions. Capacity planning and allocation requires a thor-
ough understanding of system interactions and performance in order to meet and
maintain satisfactory customer service, and to minimize the operating and capital
costs. It is typically feasible to measure the performance of an existing system for
a given configuration and traffic load. However, it is quite challenging to predict
the performance of new systems in planning stages or even existing systems under
highly varying loads. Due to a lack of good analytical approaches to capturing the
capacity footprint on the application servers, managers relied on knowledge from
past crash situations and tend to overestimate server capacity, thus increasing op-
erating costs. Hence, managers constantly are faced with the conflicting objectives
of meeting customer expectations in terms of QoS and keeping IT costs reasonable
and under control (Almeida & Menascé, 2002b).
Mohan et al. 537
This problem is not unique to the financial services firm that we studied, but
is prevalent in most Web-based applications. The constantly increasing use of the
internet for personal and business activities will only aggravate this problem for
managers of Web-based applications. For example, in the United States, online
shopping between November 1 and December 31 in 2012 increased by 14% com-
pared to the same time in 2010 and resulted in 42.3 billion dollars in sales (Lipsman,
2013). This directly translates to capacity management issues for managers han-
dling the e-commerce applications for these businesses. Another application that
is bound to increase the need for effective server capacity planning and allocation
is electronic health records (EHRs). The American Recovery and Reinvestment
Act of 2009 (ARRA) and the health care bill passed in 2010 in the United States
have provided incentives and mandates for the “meaningful use” of health care
information technology (HIT) in improving the quality and cost effectiveness of
care. This includes the implementation of EHRs and computerized physician order
entry systems as critical building blocks for hospitals to avail themselves of incen-
tives and avoid penalties. This is increasing the growth of electronic traffic in the
Web-based HITs, which will impact storage and server capacities in a significant
manner. Hence, it is critical for managers to understand system behavior and plan
and allocate server capacity based on this understanding.
Application servers are the central element of any Web-based system. Each
application server contains a limited number of channels that can be used by
incoming requests. While customer requests are processed in multiple stages and
routed through the application server, the application server channel serving an
arriving request is typically “locked” and is not available for other arriving requests.
When all channels are being used, the application server is completely busy and
cannot respond to other incoming requests. For example, when a customer accesses
online banking services to check an account balance, the request is first routed
to an application server through a Web server. The application server collects
the username and password information and sends the request to an external
security server to authenticate the user. During authentication, the application
server channel that handles the request is “locked” and cannot be used for other
arriving requests. When authentication approval is received, the application server
sends the account balance request to an external database server. When account
balance information is received, it is sent back to the user and the application
server channel is released. In this example, the first stage processing is completed
at the application server when it collects information from the user. The second
stage consists of two services completed by the security server and the database
server. While the second part of the service is being completed at one or more
external servers, the application server channel that handles that request remains
locked and cannot be used to serve another customer request. We refer to this
as “resource locking.” Resource locking is a very important phenomenon that
significantly complicates the interactions among servers and has to be considered
explicitly during performance analysis and capacity planning.
An increase in electronic traffic leads to higher traffic intensity at the appli-
cation server as well as the external servers. Higher traffic intensity at the external
servers results in application server channels remaining locked longer and even-
tually leads to servers crashing. As a result, service levels start deteriorating and
538 Capacity Planning for Web-Based Applications
potential revenue could be impacted until the problem is fixed. To address this,
Bucholtz and Wright (2001) discuss “hot servers” that could provision capacity on
the fly. These servers contain a selection of critical applications that are loaded and
can be brought online within a very short time. However, Web administrators and
managers still need a model or an approach that can predict system performance
under various loads so as to help find the “right timing” to bring hot servers online.
This research develops an analytical model of a Web-based application system to
measure RTs and to develop insights for performance and capacity management.
The objective is to develop a model that can predict performance quickly enough
that it can be used as a process monitoring tool. The results from the model at
frequent intervals can help identify increasing trends in RTs and signal the need to
bring additional capacity online. The concept is similar to using a quality control
chart to monitor processes and take action when the process tends toward being
out of control. Specifically, we answer the following questions:
LITERATURE REVIEW
There are two classes of studies on performance modeling and capacity planning
for Web services. The first models the Web server performance at a high level and
is useful for identifying performance trade-offs and making higher level server siz-
ing, additions, and allocations. The second class of models captures the low-level
details of the Hypertext Transfer Protocol (HTTP) and Transmission Control Pro-
tocol/Internet Protocol (TCP/IP) protocols and software components. The second
class of studies generally tends to include system interactions to a greater extent
than the first class of models. In this section, we summarize some of the relevant
research in both classes of models and identify the gaps in literature addressed by
our research.
The first class of models (Menascé, 2002; Cao, Anderson, Nyberg, &
Kihl, 2003; Liu, Heo, Sha, & Zhu, 2006) describes Web-based applications as a
single-tier architecture. The Web/application server is modeled as an M/G/1 queue
and all of the downstream processing is combined into the service time of this
queue. Almeida and Menascé (2002a) present a general methodology for capacity
planning for Web services. The authors highlight the fact that capacity planning
techniques rely very heavily on accurate performance prediction. Cao et al. (2003)
use an M/G/1 queuing model with processor sharing service discipline to model a
Web server. They derive closed-form expressions for average RT, throughput time,
and blocking probability, and test their model against an experimental setup. Liu et
al. (2006) also use an M/G/1 processor sharing queue to predict performance, and
use an online adaptive feedback loop that enforces admission control to ensure QoS.
These single-tier architecture models do not entirely capture the effects of
downstream congestion on Web/application server capacity because resource lock-
ing is not addressed in this stream of research. However, the use of a general
distribution to approximate the cumulative service time and processor sharing ser-
vice discipline are important contributions. The second class of models that we
summarize later, attempts to address the concept of resource locking directly and
to address performance modeling and capacity planning at a lower level.
Resource locking is similar to blocking, but is more restrictive. Blocking
of an upstream resource occurs when the downstream queue is full. However, a
resource can be locked even if the downstream queue is empty. Resource locking
occurs when the upstream resource is waiting for a response from a downstream
resource. In the case of multiple upstream servers, they all become blocked simulta-
neously. Onvural (1990), Perros (1994), and Balsamo, Persone, and Onruval (2001)
540 Capacity Planning for Web-Based Applications
present works that study queuing networks with blocking and have proposed exact
solution techniques for very simple and special cases, and several approximate so-
lution techniques for cyclic queuing networks. Approximate analysis for queuing
networks with simultaneous resource possession (entities can hold two resources
simultaneously) has been studied by Jacobson and Lazowska (1982) and Freund
and Bexfield (1983). In both studies, the second or downstream resource is used
only by one class of customers. The models cannot be directly extended to include
secondary subsystems that also receive external customers from other sources.
Layered queuing networks (LQN) and stochastic rendezvous networks
(SRVN) have been used to model software architecture systems with multiple
layers of servers (Woodside, 1989; Woodside, Neilson, Petriu, & Majumdar,
1995; Neilson, Woodside, Petriu, & Majumdar, 1995; Rolia & Sevcik, 1995).
Rolia and Sevcik (1995) have used LQN and have developed the method of layers
(MOL) to estimate the performance of distributed applications. Omari, Franks,
Woodside, and Pan (2005) have developed a solution procedure for LQN with
replicated subsystems to model large client–server systems with several identical
subsystems. Omari, Franks, Woodside, and Pan (2006) extended this methodology
to consider parallel subsystems in the network. However, none of these consider
resource locking.
Reeser and Hariharan (2000, 2002) present an analytical model for Web
server performance evaluation. They do consider resource locking, but the down-
stream resources or external servers receive entities only from the upstream re-
source (application server). Our model allows the downstream resources to process
arrivals from the application servers as well as other sources in the network. Ur-
gaonkar, Pacifici, Shenoy, Spreitzer, and Tantawi (2005) have presented a model
for multilayer internet services. However, the authors acknowledge not modeling
two critical issues that affect performance, namely, multiresource capture at a layer
and resources held simultaneously at multiple layers. The latter is the focus of our
study that we define as “resource locking.” Ramesh and Perros (2000a, 2000b)
have presented a model for distributed software systems that considers client–
server communication. They consider a mix of synchronous (client gets locked)
and asynchronous (client does not get locked) messages when a client commu-
nicates with a downstream server. The authors present a method to estimate RTs
when there are no asynchronous messages. Then they extend their method to ac-
commodate for the asynchronous messages using a service reduction technique.
Comparison with results from equivalent simulation models shows that the method
produces good results (average deviation up to 10%) for traffic intensity up to 70%.
The authors indicate that the accuracy of their method would decrease considerably
at higher traffic intensities. Also, the model presented can only calculate aggregate
RTs for synchronous and asynchronous messages. Moreover, the authors consider
servers of unit capacity and no external entities, and communication between the
servers can occur only if they are located on adjacent layers. Reeser and Hariha-
ran (2000, 2002), 2005, Urgaonkar et al. (2005), and Ramesh and Perros (2000a,
2000b) consider systems with Poisson arrivals, exponential service times, and
first-come, first-serve (FCFS) service discipline.
Mohan, Printezis, and Alam (2009) have proposed a Markovian model for a
very simple Web-based application. They consider a system with one application
server, one external server, and a single service step for arriving entities. Their
Mohan et al. 541
model allows locking entities from the application server (referred to as type-1
entities) and nonlocking entities from other external sources (referred to as type-2
entities) to be processed at the external server. They explain why they used a direct
modeling approach as opposed to standard queuing models. If the application server
was approximated as a G/G/c queuing system, the total service time of entities
would be the sum of the service time in the application server, the waiting time in
the external server queues, and the service time in the external servers. However,
estimating the mean and variance of the waiting time in the external queues is
complicated due to resource locking. They also show that because of resource
locking the waiting time of type-1 and type-2 entities in the external queues are
not identical. Real-world Web-based application systems are significantly more
complex, with several application and external servers, and service consisting of
several steps and therefore, the Mohan et al. (2009) model cannot be used to
evaluate the performance of such systems as is.
In this study, we use the basic model proposed by Mohan et al. (2009) as a
building block for modeling a more complex and realistic Web-based application
system that can estimate RT accurately and quickly for a wide range of traffic in-
tensities and system configurations. Our model is similar to the first class of models
in that we address a higher level performance prediction and capacity planning and
allocation problem. However, we address the important system interactions such as
resource locking, similar to the second class of models. Thus, our research bridges
the gap between the two streams of research in Web-server capacity planning and
management.
MODELING APPROACH
Web-based applications collect user-specific information from multiple sources
and display it back to the end-user. When the end-user requests a particular set
of information, the Web server routes the request to one of the many application
servers in the system. The application server then translates the user request into
transactions and determines the number and sequence of service steps necessary to
process the request. The routing of each request is dependent on the type of request
and also on the processing result of the previous service step. Each service step is
performed by a corresponding external server. When an external server completes
its processing the request is routed back to the application server. The application
server then sends it to the next external server for further processing or back to the
user if service is complete. Figure 1 describes the flow of entities through a typical
Web-based application system.
The system contains Na application servers and Ne external servers. Each
external server performs a unique service that consists of two parts. The first part is
completed at the application server, and the second part at one or more appropriate
external servers. The two parts of the service are represented together as a service
block in Figure 1. During the second part of the service, the application server
channel remains locked. The application server channel that handles an entity
remains locked throughout the entity’s stay in the system. When all service steps
are completed, the locked application server channel is released and the entity
leaves the application server. The external servers receive requests from different
sources. We define arrivals from the application server to an external server as
542 Capacity Planning for Web-Based Applications
Figure 1: Web-based application system and entity flow through the system.
Application Server
Service 1
AS-1 λ2 [1]
μ a [0]
λ1 [1] ES- 1
μa [1]
me [1]
μe [1]
λ1 AS-2
μ a [0] Failed
Λa Web
Server Successful
End User
Service Ne
λ2 [Ne]
AS-Na ES- Ne
μ a [0] μa [Ne]
me [Ne]
λ1 [Ne] μe [Ne]
type-1 entities and arrivals from other sources as type-2 entities. The dashed line
in Figure 1 shows the path of a type-1 entity. Entities from every application server
follow similar paths. If service for a type-1 entity fails at an external server, the
entity returns to the application server and further processing may be halted. Next,
we describe the notation used to describe our model.
Notation
Subscripts and indices:
a = application server (subscript)
e = external server (subscript)
s = service (index)
Parameters:
μa [0] = service rate of the application server for the initial preprocessing of
type-1 entities
μa [s] = service rate of the application server for the first part of service s
ma = number of parallel channels in each application server
λ2 [s] = arrival rate of type-2 entities to external server s
μe [s] = service rate of external server s
me [s] = number of parallel channels in external server s
ρe [s] = traffic intensity of external server s
ρa = traffic intensity of each application server
Performance measures:
Model Assumptions
The objective of this article is to develop a model that captures the important char-
acteristics of the underlying real-world application, while providing an efficient
methodology to quickly analyze system performance and make capacity decisions.
In order to achieve this, we make the following assumptions.
entities arriving at the external server. So, the modified arrival rate of type-2 entities
to the external server is λ2 [s] = λ2 [s] + (Na − 1) × λ1 [s].
Assumption 2: Service times at the application and external servers are expo-
nential and service discipline is FCFS.
The second class of research we described earlier assumes exponential ser-
vice times and FCFS service discipline to keep the analytical models tractable
when addressing complicated system interactions such as resource locking. The
results from these models show a good representation of real-world performance
metrics. A recent research study (Deslauriers, Ecuyer, Pichitlamken, Ingolfsson,
& Avramidis, 2007) on telephone call centers with call blending uses an M/M/n
queue with FCFS service discipline to predict the performance of the call center.
Extensive testing and comparison with real-world data from a call center showed
that the difference between the models and exact simulation was less than 1% for
important performance measures.
1 i+1
pi,j = λ2 pi,j −1 + λ1 pi−1,j + μe pi+1,j
λ2 + λ1 + μe i+j +1
j +1
+ μe pi,j +1 , (1)
i+j +1
where μe = Min(me , ie + j )μe is the total service rate of the external server and
ie = Min(ma , i) is the number of type-1 entities in the external server at state
(i, j ). The steady-state probabilities of the truncated state space are then estimated
numerically by limiting the number of type-1 and type-2 entities in the system. We
use modified versions of the two-dimensional CTMC and Equation (1) to represent
a service block in Figure 1.
Mohan et al. 545
Figure 2: Transition diagram for a system with application server, one external
server, and no service in the application server.
meμe(ie/(j+ie))
meμe(ie/(j+ie))
i
meμe(j/(j+ie))
(i-1, j-1)
λ2
(i-1, j) (i-1, j+1)
λ2
meμe((j+1)/(j+ie))
Throughput for
ns= 1, 2...ma type-1 entities,
with n type-1
AppSrvr ExtSrvr s entities in the LDRS[s]
μa [s] me[s], μe[s] network is μact [s, ns]
calculated and
assigned to
Type-2 μact[s, ns]
Type-2 arrival (λ2 [s]) departure
entities in the network. We assign this throughput as the conditional service rate
for the LDRS, given that there are ns type-1 entities.
Even though our model includes service within the application server, it is
sufficient to use the two-dimensional CTMC because we condition on the number
of type-1 entities in the network. We define the state of the system as (ie , j ), where
ie is the number of type-1 entities in the external server and j is the number of type-2
entities in the system. The number of type-1 entities in the application server is then
ns − ie . In state (ie , j ) the total service rates of the application server and external
server are, respectively, μa = (ns − ie )μa [s] and μe = Min(ie + j, me [s])μe [s].
The transition diagram of the system is similar to that shown in Figure 2,
with λ1 = μa and λ2 = λ2 [s]. The expression for the steady-state probabilities
are
1 ie + 1
pie ,j = λ2 [s]pie ,j −1 + μa pie −1,j + μe pie +1,j
λ2 [s] + μe + μa ie + j + 1
j +1
+ μe pie ,j +1 . (2)
ie + j + 1
We modify the service rates for all LDRS to accommodate service repetitions.
We divide the service rates of LDRS[s] by VR[s] to estimate the aggregated service
rate for VR[s] visits. The service rates for LDRS[0] are set to μact [0, n0 ] = n0 μa [0],
where 1/μa [0] is the initial processing time of entities in the application server. We
then modify these LDRS to replace the service blocks and simplify the represen-
tation of the system as shown in Figure 4. Figure 5 provides a detailed description
of the LDRS algorithm.
Failed
Failed
LDRS[0] LDRS[Ne]
End User λ1 μact [0, n0] μact [Ne, nNe]
[ ] [] [] [] []
[ ] [] [] []
[]
[ ]
[ ] [ ] []
In each step of the SAM, we aggregate one LDRS node to the previously
aggregated node. We start with ALDRS[0]with service rate μagg [0, n] = nμa [0]
for n = 1 to ma . We then aggregate LDRS[1] with ALDRS[0] to derive a new
aggregated node ALDRS[1] with estimated service rates, μagg [1, n]. The process
continues until we aggregate all LDRS into a final aggregated node ALDRS[Ne ]
with service rate μagg [Ne , n]. Aggregation in step s is performed using the one-
dimensional CTMC illustrated in Figure 6.
Mohan et al. 549
μagg [s-1, n] μagg [s-1, n-(ie-2)] μagg [s-1, n-(ie-1)] μagg[s-1, n-ie] μagg[s-1, n-(ie+1)] μagg[s-1, 1]
0 ie-1 ie ie+1 n
μact [s, 1] μact [s, ie-1] μact [s, ie] μact [s, ie+1] μact [s, ie+2] μact [s, n]
1
pi e =
μact [s, ie ]
(ASR[s − 1] × μagg [s − 1, n − ie + 1] + μact [s, ie − 1])pie −1
+(ASR[s − 1] × μagg [s − 1, n − ie + 2])pie −2
for ie = 2, 3, . . . , n. (4)
The value of p0 is calculated from the normalizing equation nie =0 pie = 1.
We determine the steady-state probabilities
numerically, and calculate the through-
put of the network as μagg [s, n] = nie =0 pie × μact [s, ie ]. This is the service
rate of the new aggregated node ALDRS[s]. We calculate the service rates
for n = 1, 2, . . . ma and the success rate of entities in the aggregated node as
ASR[s] = ASR[s − 1] × SR[s].
By repeating the above procedure Ne times we combine all services into a
single ALDRS[Ne ] with service rate μagg [Ne , n]. We then use the CTMC to model
node ALDRS[Ne ] with arrival rate λ1 . The state of the system is defined as i, the
number of entities in the system and the probability of the system being in state i
is defined as pi . We then solve for the steady-state probabilities and calculate the
average performance measures.
550 Capacity Planning for Web-Based Applications
Variance Estimation
Initial experimentation indicated that this method produced results very close to
the ones obtained using simulation for traffic intensities of up to 70%. For higher
traffic intensities, the deviation from simulation results increased significantly.
The main reason was an overestimation of the variance of the total service time of
type-1 entities in the final aggregated node. When the variance of the service time
is overestimated, the average waiting time in queue and the average total time in
system are also overestimated. Therefore, in addition to the service rates, we also
need to estimate the variance of the service time for the aggregated node at each
step. The procedure that we use to estimate the variance in each aggregated node
is described later. This method significantly improves the estimation of variance,
but we need to note that it does not entirely eliminate the overestimation. The
following additional parameters, all related to type-1 entities, are used in the
variance estimation procedure:
ST a [s] = service time for part of service s in application server; for exponential
service time E [ST a [s]] = 1/μa [s]; Var [ST a [s]] = (1/μa [s])2
ST e [s] = service time for part of service s in external server; for exponential
service time E [ST e [s]] = 1/μe [s]; Var [ST e [s]] = (1/μe [s])2
RT[s] = total residence time of type-1 entities in service s; E [RT[s]] =
1/μa [s] + W1 [s]
s
ST[s] = total service time of type-1 entities up to service s; ST[s] = RT[i]
i=0
VWq1 [s] = variance of waiting time in queue of type-1 entities in external server
s
At step s of the SAM, we solve the queuing system with the node ALDRS[s]
and arrival rate λ1 using the one-dimensional CTMC above. The number of entities
in the system, i, defines the state of the system. There are Min(i, ma ) entities in
service and the remaining entities wait in queue. We derive the expressions for the
steady-state probabilities using Equations (5) and (6) and solve them numerically
limiting the number of type-1 entities in the application system to na .
1
p1 = λ 1 p0 , (5)
μagg [s, 1]
1
pi = (λ1 + μagg [s, Min(i − 1, ma ])pi−1 + λ1 pi−2
μagg [s, Min(i, ma )]
for i = 2, 3...na . (6)
a
The value of p0 is calculated using the normalizing equation ni=0 pi =
1 and the average total time in service is calculated as E [ST[s]] =
na
i=0 Min(i, ma ) × pi /λ1 .
Mohan et al. 551
λ1 λ1 λ1 λ1 λ1 λ1
μagg [s,1] μagg [s,ma-1] μagg [s,ma] μagg [s,ma] μagg [s,ma] μagg [s,ma]
The total time is the sum of type-1 entities’ service time in ALDRS[s − 1]
and residence time in LDRS[s]. Therefore, the average residence time of entities
in LDRS[s] is estimated as E [RT[s]] = (E [ST[s]] − E [ST[s − 1]]) /VR[s]. The
residence time RT[s] is the sum of the service time inside the application server,
waiting time in external server queue, and the service time in the external server.
Assuming that the components of ST[s] are independent, the mean and variance
of the total service time can be calculated using Equations (7) and (8).
E [ST[s]] = E [ST[s − 1]] + E [ST a [s]] + Wq1 [s] + E [ST e [s]] , (7)
Var [ST[s]] = Var [ST[s − 1]] + Var [ST a [s]] + VWq1 [s] + Var [ST e [s]] . (8)
The only unknown component in Equation (8) is VWq1 [s]. From Gross
and Harris (1998), for an M/M/me [s] queue with arrival rate (λ1 [s] + λ2 [s])
and service rate μe [s], the waiting time in queue distribution is Wq1 (t) =
r c p0
1 − c!(1−ρ) e−(me [s]μe [s]−(λ1 [s]+λ2 [s]))t . If we assume that the relationship between
the mean and the variance of the waiting time of type-1 entities in the external
server queue is similar to the relationship between the mean and the variance of
an M/M/c queue, it follows that the variance of the waiting time, V W q1 [s], can be
expressed as
2W q1 [s]
VWq1 [s] = − (Wq1 [s])2 . (9)
me [s]μe [s] − (λ1 [s] + λ2 [s])
Repeating this procedure Ne times we obtain the final aggregated node
ALDRS[Ne ] with mean E [ST [Ne ]] and variance of total service time Var [ST [Ne ]].
Figure 7 presents the variance estimation at step s of the SAM. Finally, with the es-
timated mean and variance of service time, the final aggregated node is modeled as
an M/G/ma queuing system with arrival rate λ1 , expected service time E [ST [Ne ]],
and variance of service time Var [ST [Ne ]].
Next, we need to evaluate the average waiting time in queue, WQ, and the
average time in system, W , for type-1 entities. If WQE and WQD are, respectively,
the waiting time of entities in queue of an M/M/ma and an M/D/ma queue with the
552 Capacity Planning for Web-Based Applications
[ ] [ ]
[ [ ]]
[ ] [ [ ]] [ ]
same service rate as the M/G/ma queue, then Boxma, Cohen, and Huffels (1979)
show that
√
(1 − ρ)(ma − 1)( 4 + 5ma − 2)
WQD = 0.5WQE 1 + , (10)
16ρma
with simulation helps us to understand the loss of fidelity in the analytical solu-
tions. If the resulting gap from simulation is acceptable, the analytical model tends
to be more useful because it can produce results much faster than a simulation
model.
We tested our model for a variety of configurations with and without service
repetition. The baseline data that we have used for the experimentation were derived
from the real-world application that motivated our study. We collected data related
to one specific application related to account summary at the financial services
firm we were studying. The application-specific data were the types and number
of services required for different types of requests arriving at the application,
frequently used service routings, types and number of external servers, and service
times at the application and external servers. We collected actual data for a period
of 3 months and determined the distributions for service times. In cases where the
actual service distributions were not Poisson, we assumed them to be Poisson with
the same mean, because our analytical models handle only Poisson distributions for
arrival and service distributions. The service rates for the application and external
servers that we have used in our experimentation (100 per second and 200 per
second for the application servers and 100 per second for external servers) are the
approximate rates at which the two types of application servers dedicated to the
account summary application and the external servers were functioning at during
our data collection phase.
The main parameter that influences RTs is the traffic intensity at the appli-
cation and external servers. Traffic intensity intrinsically captures the capacity of
the servers and the demand. Hence, we used a simple experimental design to vary
the traffic intensity at the application and external servers. Our data reflected the
change in traffic patterns during typical daily operations and we used this as a
guide for varying traffic intensities at the servers in our experimentation. We also
used the information obtained from our contacts at the firm. Spikes in demand do
happen at times, and in order to study the effect of these increases, we allowed
the traffic to increase beyond levels observed normally during the data collection
phase. Specifically, during the 3-month data collection phase, we observed 10 in-
stances when the system completely crashed due to extremely high traffic intensity.
Thirteen percent of the observations occurred when traffic intensity was between
85% and 90%. These were the occasions that significantly affected RTs.
We also present stability analysis results and discuss the computational re-
quirements of our algorithm in comparison to simulation. We define the deviation
in RT of our model from the simulation as follows:
Model RT - Simulation RT
%Deviation = × 100%.
Simulation RT
After establishing the validity of our model, we discuss the use of the model
to develop managerial insights and to aid capacity planning and management.
1 1 38, 46, 53, 58, 62, 65, 5 100 1–3 100 200 100 5 44–53%
66.5
2 5 200, 235, 265, 285, 300, 5 100 1–3 100 500 100 10 70–82%
310, 315
3 5 250, 300, 330, 350, 365, 10 100 1–6 200 500 100 10 75–88%
375, 380
4 5 150, 180, 205, 225, 235, 10 100 1–10 200 600 100 10 75–85%
240, 245, 247
15
10
% Deviation
Set 1
Set 2
5
Set 3
Set 4
0
-5
0.5 0.6 0.7 0.8 0.9 1
Traffic Intensity
sizes. Problem sets 1 and 2 require three services, and sets 3 and 4 require six and
ten services, respectively. In all cases, we adjust the arrival rate of type-1 entities
such that the application server traffic intensity varies from 50% to 95%. Table 1
summarizes the various parameters used for the four problem sets. Figure 9 shows
the deviation in total RT between our model and simulation.
The results indicate that the deviation of our model from the simulation stays
below 5% for traffic intensities up to 95%. The deviation in set 2 is higher than in set
1 due to an increase in the number of application servers. Problem sets 3 and 4, with
a larger number of services, have lower deviations than problem sets 1 and 2. As
the number of services increases, the squared COV of the aggregated total service
time becomes smaller, and as a result, the overestimation of variance (during the
sequential aggregation) will have a smaller impact on the overestimation of the total
RT. Even with a large number of services, the overestimation of variance and hence
the deviation of RT becomes magnified at high traffic intensities. Experience from
real-world instances indicates that the nonlinear increase in RT occurs at traffic
Mohan et al. 555
Table 2: Parameters for evaluation of model for systems with repetitive services.
Parameter for Services
intensities below 95%. Therefore, the decision to intervene and add capacity or
reroute requests has to happen at traffic intensities below 95% where our model
consistently produces accurate results. Also, utilizing our model as a quality control
tool to monitor for increasing trends in traffic helps managers to adaptively add
capacity and prevent the system from reaching extremely high traffic intensities
and encountering crashes.
Stability Analysis
We next evaluate the robustness of our model for different configurations and
traffic intensities. We choose set 7 with six service steps and six services (i.e.,
no service repetition) and set 8 with 15 service steps and six services (i.e., with
service repetition). The following parameters are common to both sets: Na = 5,
ma = 10, μa [0] = 50. The load from type-1 entities is adjusted so that the traffic
intensity of application servers varies from 50% to 95%. For each set, we use three
556 Capacity Planning for Web-Based Applications
15
10
% Deviation
Set 5
5
Set 6
-5
0.5 0.6 0.7 0.8 0.9 1
Traffic Intensity
subsets of problems based on the percentage of load on the external servers from
type-2 entities. These are low external load (LEL; 30% of external server capacity
from type-2 entities), medium external load (MEL; 50%), and high external load
(HEL; 70%). Table 3 presents the configurations and the resulting range of traffic
intensities on the external servers.
Figure 11 presents the deviation of our model from simulation.
The results indicate that when the load on the external servers is low to
medium (LEL and MEL) our model produces very good results even at high
application server traffic intensity. The deviation stays below 5% for application
server traffic intensities up to 95%. However, when external server load is high
(HEL) the deviation increases rapidly at around 80% to 85%.
We have experimented with a wide range of configurations and traffic in-
tensity levels and we have generally found that our model produces good results
Mohan et al. 557
25 25
20 20
% Deviation
15
% Deviation
LEL 15
LEL
10 MEL 10 MEL
5 HEL 5 HEL
0 0
-5 -5
0.5 0.6 0.7 0.8 0.9 1 0.5 0.6 0.7 0.8 0.9 1
Traffic Intensity Traffic Intensity
Set 7 Set 8
(within 5% error) when external server traffic intensity does not exceed 90% and
the application server traffic intensity is at, or below 95%. Also, the most significant
parameters that influence the deviation are the traffic intensity of the application
and external servers. Our model is not significantly affected by the capacity of the
application servers, the number of external servers, the capacity of the external
servers, or the number of services service steps and service repetition.
Computational Effort
The accuracy of the results produced by our algorithm depends on the values
of na and ne [s] that determine the size of the state space when solving for the
steady-state probabilities. The computation time depends heavily on the choice of
ne [s] that is used for the calculation of the service rates for the LDRS[s]. If we
choose a small value for ne [s] when solving for the steady-state probabilities of a
truncated state space, we lose a large percentage of λ2 [s] (arrivals to the external
server) and as a result we overestimate the service rates of the LDRS[s]. Therefore,
to accurately estimate service rates for the LDRS[s] we use values for ne [s] such
that the percentage of lost arrivals, λ2 [s], is less than 0.1%. When external server
traffic intensity is high, we need to choose a large value for ne [s], which increases
the computation time.
In order to estimate the computation time required by the simulation model,
we run the simulation long enough to produce stable results; specifically the
simulation is terminated only when the half width of the 90% confidence interval
for the average time in system is approximately 2% of the average. We consider
four problem sets from Table 3 (Set 7-LEL, Set 7-MEL, Set 8-LEL, and Set 8-
MEL) and express the computation time required by our algorithm as a percentage
of the time required by simulation. For problem sets with high HEL, the traffic
intensity at the external server goes over 100% and the external system becomes
unstable, which also affects the application server. We observed results to this
effect in Figure 11 as well. For the HEL scenarios, we obtained results using our
algorithm. Because we did not obtain stable results using simulation, we have
not included the HEL scenarios in the comparison. The results are presented in
Figure 12.
558 Capacity Planning for Web-Based Applications
Figure 12: Computation time comparison between our algorithm and simulation.
Set 7-LEL
20 Set 7-MEL
Set 8-LEL
10 Set 8-MEL
0
0.5 0.6 0.7 0.8 0.9 1
Traffic Intensity
25 25
20 20
% Deviation
% Deviation
15 15
Set 9 Set 9
10 Set 10 10 Set 10
5 Set 11 5 Set 11
0 Set 12 0 Set 12
-5 -5
0.5 0.6 0.7 0.8 0.9 1 0.5 0.6 0.7 0.8 0.9 1
Traffic Intensity Traffic Intensity
(a) Deviation of our model from simulation (b) Deviation of NLA model from simulation
server as an M/M/me queue and estimate the mean and variance of the total time
in the external server system. We then estimate the mean and variance of the first
part of the service to determine the total RT for that service. Finally, we add the
service times for all services to estimate the mean and variance of the time taken
to complete all services. We can now model the application server as an M/G/ma
queue to evaluate the average waiting and average RT within the application server.
The queue discipline is first-in, first-out in both models. However, when we ignore
resource locking, entities that return to the application server after processing do
not have a dedicated channel and hence, would have to wait with all other requests.
We refer to this model as the no-locking approximation (NLA) model and compare
the performance estimated by this model with our proposed model. We use RT as
the comparison criterion and calculate the deviation of both the NLA model and
our model from the simulation model. We measure RTs as the duration between the
start of service at the application server and the completion of all external services
and return to customer. We consider four different problem sets (sets 9, 10, 11,
and 12) with different system configurations. The common parameters used for
the problem sets are μa [0] = 100, μa [s] = 100, and μe [s] = 100. Table 4 presents
other parameters and the varying arrival rates and traffic intensities. Figures 13(a)
and (b) present the deviation of our model and the NLA model from simulation.
The results in Figure 13 indicate that for the NLA model the RT is much
higher than our model. The results seem to be counterintuitive at first sight, but do
reflect the underlying situations realistically. When the model considers resource
locking, each arriving customer request gets a dedicated application server channel,
560 Capacity Planning for Web-Based Applications
which is a true reflection of the real system. When the entity returns from an
external server after processing, the application server is ready to process it with
the dedicated server channel. Hence, the RT is a function of only the delays
and processing times at the various external servers. On the other hand, when
we choose to ignore resource locking, there are no dedicated application server
channels. So, when the entities come back to the application server after processing
at the external servers, they do not get a higher priority and are queued with several
other arriving entities. Now, the RT is a function of processing times and delays
at external servers, as well as delays and waits at the application server. At higher
traffic intensities, the problem gets exaggerated more and RTs are very large.
In fact, Figure 13 shows that the error of the NLA model is more than
double the error of our model at higher traffic intensities. This shows that it is
very critical to consider the effect of resource locking at higher traffic intensities.
Choosing to ignore the effects of resource locking in capacity planning will lead
to maintaining a lot more servers than necessary to maintain the desired QoS. This
would eventually result in a lot of unused capacity and tied up capital.
MANAGERIAL IMPLICATIONS
Managers of Web applications strive to maintain high levels of customer service and
satisfaction while minimizing operating expenses through efficient management
of resources. They generally try to keep utilization and system congestion within
a certain range (often at a predetermined RT) in order to minimize the probability
of server failure. Hence it is critical to understand the trade-offs between operating
cost, efficient capacity utilization, RT, and reliability, and use this knowledge
effectively to deliver high-quality and reliable service at a reasonable cost.
μe [s]
Figure 14: Response time of entities with increasing load for different values of
Na .
1400 1400
1300 1300
Response time (ms)
Na=3 Na=3
1200 1200 Na=4
Na=4
1100 Na=5 1100 Na=5
1000 Na=6 1000 Na=6
900 Na=7 900 Na=7
800 Na=8 800 Na=8
700 700
20 30 40 50 60 70 80 90 20 30 40 50 60 70 80 90
Arrival rate of entities to AppSrvr Arrival rate of entities to AppSrvr
Set 13 Set 14
which the nonlinear increase in RT will occur. This could give advance warning
about imminent server failures, if no action is taken. With an a priori analysis
like this, the system manager has the ability to react and/or plan in advance, and
connect additional application servers to the system or reroute the requests, in
order to avoid system overloading and failure.
To improve reliability, it is important to assess the traffic intensity the system
will be able to handle with the existing capacity. It is also important to estimate the
capacity requirement for a certain arrival rate. The model presented can be used
for both. For example, for set 14, the results presented in Figure 14 indicate that
if the target RT is below 1000 ms and the peak load is 60 arrivals per second, five
application servers are required. Figure 14 can also be used to determine the load
the system can handle with a given number of application servers.
70 70
60
Arrival Rate (/sec)
60
System 1 1, 10 2, 8, 12 3, 9, 13 4, 7, 14 5, 11 6, 15
System 2 1, 9 2, 7, 13 3, 8 4, 10, 14 5, 12 6, 11, 15
1800 1800
5 5 5 5 5 5 4 4 4 4 4 4
Response Time (ms)
1600
1600 1600
1400 1400
1200 Sys 1 1200 Sys 1
1000 Sys 2 1000 Sys 2
800 800
4 5 7 6 4 4 7 7 7 7 7 7
600 600
0 4 8 12 16 20 24 0 4 8 12 16 20 24
Time (hour) Time (hour)
(c) Dedicated (4 and 3) and 3 hot servers (d) Dedicated servers (7 and 5)
evaluate the performance of their applications and react quickly. Our modeling
and solution methodology addresses this need and can provide very reliable results
relatively quickly.
CONCLUSIONS
In this article, we have presented a Markovian model for performance evalua-
tion and adaptive capacity planning of Web-based computer application systems.
Specifically, our model addresses the concept of resource locking directly. We use
RT as the performance metric and measure the deviation of our model from equiv-
alent simulation models. Our model produces excellent results and the deviation
from simulation stays below 5% when external server traffic intensity does not
exceed 90% and the application server traffic intensity is at or below 95%. These
conditions reflect real-world Web application scenarios very well and hence the
value and applicability of our model. Another advantage of our model is that it
produces results much faster than simulation, especially at high traffic intensity
levels. This is particularly important when attempting to quickly assign hot servers
to applications that are reaching crash thresholds. We showed how our model can
be used to determine how many servers are needed as well as when and where they
should be added to maintain target RTs across all applications in a firm.
While the model is quite versatile and useful in answering several managerial
questions, there are a few limitations. The proposed model deals with only one
class of type-1 entities. In Web-based computer application systems, there can be
multiple classes of type-1 entities and the routing can depend on entity classes as
well as on the current state. Extending our model to accommodate multiple classes
of type-1 entities and state-dependent routing is an appealing extension for further
study. Another extension is considering unequal capacities for the various appli-
cation servers. Finally, arriving entities might be assigned to specific application
servers based on certain rules such as least busy server or round robin. While we
are currently comparing the proposed adaptive capacity management method to the
static fixed capacity, it would be interesting to explore other capacity management
options and compare the performance of those methods to the proposed method.
We leave these as directions for future research.
REFERENCES
Almeida, V., & Menascé, D. (2002a). Capacity planning: An essential tool for
managing Web services. IEEE IT Professional, 2(2), 33–38.
Almeida, V., & Menascé, D. (2002b). Capacity planning for Web services: Metrics,
models and methods. Upper Saddle River, NJ: Prentice Hall.
Balsamo, S., Persone, V. D. N., & Onvural, R. (2001). Analysis of queueing
networks with blocking. Boston, MA: Kluwer Academic.
Boxma, O. J., Cohen, J. W., & Huffels, N. (1979). Approximations of mean wait-
ing time in an M/G/s queueing system. Operations Research, 27(6), 1115–
1122.
Mohan et al. 565
Bucholtz, C., & Wright, R. (2001). 20 tools you need to beat the economy, CRN
Magazine, accessed January 29, 2014, available at http://www.crn.com/
news/channel-programs/18823815/20-tools-you-need-to-beat-the-
economy.htm
Cao, J., Andersson, M., Nyberg, C., & Kihl, M. (2003). Web server performance
modeling using an M/G/1/K*PS queue. Proceedings of the 10th International
Conference on Telecommunications, Papeete, Tahiti: IEEE, 1501–1506.
Deslauriers, A., L’Ecuyer, P., Pichitlamken, J., Ingolfsson, A., & Avramidis, A.
(2007). Markov chain models of a telephone call center with call blending.
Computers & Operations Research, 34(6), 1616–1645.
Freund, D. J., & Bexfield, J. N. (1983). A new aggregation approximation procedure
for solving closed queueing networks with simultaneous resource possession.
Journal of the ACM, 25(1), 214–223.
Gross, D., & Harris, C. M. (1998). Fundamentals of queueing theory. New York,
NY: Wiley.
Jacobson, P. A., & Lazowaska, E. D. (1982). Analyzing queueing networks with
simultaneous resource possession. Communications of the ACM, 25(2), 142–
151.
Lipsman, A. (2013). 2012 U.S. online holiday spending grows 14 percent vs. year
ago to $42.3 billion, accessed January 29, 2014, comScore, Inc., available
at http://www.comscore.com/Insights/Press_Releases/2013/1/2012_U.S._
Online_Holiday_Spending_Grows_14_Percent_vs_Year_Ago_to_42.3_
Billion
Liu, X., Heo, J., Sha, L., & Zhu, X. (2006). Adaptive control of multi-tiered Web ap-
plications using queueing predictor. Proceedings of the 10th IEEE/IFIP Net-
work Operations and Management Symposium, Vancouver, Canada: IEEE,
106–114.
Menascé, D. (2002). Trade-offs in designing Web clusters. IEEE Internet comput-
ing, 6(5), 76–80.
Mohan, S., Printezis, A., & Alam, M. F. (2009). A framework for modeling Web-
based applications with resource locking. International Journal of Opera-
tional Research, 6(3), 289–303.
Neilson, J. E., Woodside, C. M., Petriu, D. C., & Majumdar, S. (1995). Soft-
ware bottlenecking in client-server systems and rendezvous networks. IEEE
Transactions on Software Engineering, 21(9), 776–782.
Omari, T., Franks, G., Woodside, M., & Pan, A. (2005). Solving layered queueing
networks of large client server systems with symmetric replication. Proceed-
ings of the 5th International Workshop on Software and Performance, Palma,
Illes Balears, Spain: ACM, 159–166.
Omari, T., Franks, G., Woodside, M., & Pan, A. (2006). Efficient per-
formance models for layered server systems with replicated servers
and parallel behaviour. Journal of Systems and Software, 80(4), 510–
527.
566 Capacity Planning for Web-Based Applications
Ferdous Alam has been working in the Supply Chain Performance Analytics &
Optimization group at Nestle USA, Inc., as a supply chain operations research ana-
lyst, where he is engaged in modeling for master production schedule optimization,
network optimization, transportation forecasting, and other supply chain–related
problems. Prior to joining Nestle, he worked as an operations research analyst at
Aviation Logistics Center (ALC), United States Coast Guard (USCG) at Elizabeth
City, NC, where he was engaged in demand analysis and forecasting, modeling
Mohan et al. 567
and analysis for capacity planning, and resource allocation to support efficient
inventory control and supply chain management for aviation spare parts at ALC,
USCG. He has a PhD in industrial engineering with major in operations research
from Arizona State University, Tempe, AZ. He is very interested in studying and
applying operations research techniques in solving real life problems.
John Fowler is the Motorola Professor and Chair of the Supply Chain Manage-
ment department at Arizona State University (ASU). He is also a professor of
industrial engineering and was previously the program chair for IE at ASU. His
research interests include discrete event simulation, deterministic scheduling, and
multicriteria decision making. He has published over 100 journal articles and over
100 conference papers. He was the Program Chair for the 2008 Industrial Engi-
neering Research Conference, the 2008 Winter Simulation Conference (WSC),
and Co-Program Chair for the 2012 INFORMS National meeting. He is currently
serving as editor-in-chief for a new Institute of Industrial Engineers journal focused
on health care delivery systems entitled IIE Transactions on Healthcare Systems
Engineering. He is also an editor of the Journal of Simulation and an associate
editor for IEEE Transactions on Semiconductor Manufacturing. He is a Fellow
of the Institute of Industrial Engineers (IIE) and currently serves as the IIE Vice
President for Continuing Education, is a former INFORMS Vice President, and is
an SCS representative on the Winter Simulation Conference Board of Directors.