Dynamics La Management
Dynamics La Management
http://hdl.handle.net/10251/35748
García García, A.; Blanquer Espert, I.; Hernández García, V. (2014). SLA-driven dynamic
cloud resource management. Future Generation Computer Systems. 31:1-11.
doi:10.1016/j.future.2013.10.005.
http://dx.doi.org/10.1016/j.future.2013.10.005
Copyright
Elsevier
SLA-Driven Dynamic Cloud Resource Management
Andrés Garcı́a-Garcı́aa , Ignacio Blanquer Espert, Vicente Hernández Garcı́a
Institut d’Instrumentació per a Imatge Molecular (I3M). Universitat Politècnica de
València, Camino de Vera S/N, 46022 Valencia (Spain)
a
[email protected]
Abstract
As the size and complexity of Cloud systems increases, the manual man-
agement of these solutions becomes a challenging issue as more personnel,
resources and expertise are needed. Service Level Agreement (SLA)-aware
autonomic cloud solutions enable managing large scale infrastructure man-
agement meanwhile supporting multiple dynamic requirement from users.
This paper contributes to these topics by the introduction of Cloudcompaas,
a SLA-aware PaaS Cloud platform that manages the complete resource life-
cycle. This platform features an extension of the SLA specification WS-
Agreement, tailored to the specific needs of Cloud Computing. In partic-
ular, Cloudcompaas enables Cloud providers with a generic SLA model to
deal with higher-level metrics, closer to end-user perception, and with flexi-
ble composition of the requirements of multiple actors in the computational
scene. Moreover, Cloudcompaas provides a framework for general Cloud
computing applications that could be dynamically adapted to correct the
QoS violations by using the elasticity features of Cloud infrastructures. The
effectiveness of this solution is demonstrated in this paper through a simula-
tion that considers several realistic workload profiles, where Cloudcompaas
achieve minimum cost and maximum efficiency, under highly heterogeneous
utilization patterns.
Keywords: Service Level Agreement, Cloud Computing, Quality of
Service, monitoring
1. Introduction
Cloud computing is currently being used in different application domains,
such as industry, science, and government [1][2][3]. Therefore concepts re-
2
a specific good. In the context of Cloud computing, SLAs are considered to
be machine readable documents, which are automatically managed by the
provider’s platform.
The main objective of this paper is to advance on the vision of Cloud
computing as a utility, providing components for fulfilling the requirements
of applications that requires QoS guarantees. To this end, this paper presents
Cloudcompaas1 , a SLA-driven Cloud platform that manages the complete
lifecycle of the resources through the utilization of agreements. Cloudcom-
paas covers all the steps involved on the management of SLAs, from the
set-up of the agreement with the final user, feeding the agreement into the
Cloud provider and interacting with the manager that allocates the required
resources in the infrastructure, to the monitoring of the agreement and per-
forming the necessary actions, in order to maintain the quality levels specified
in the SLA.
Cloudcompaas is based on standards, such as the WS-Agreement [11]
specification, for defining the agreements, and on open-source initiatives,
such as the WSAG4J [12] framework, for implementing a proof-of-concept
prototype. In this paper, the WS-Agreement specification has been tailored
to meet the needs of Cloud computing, and the WSAG4J framework has been
extended and adapted to deal with the complete lifecycle of the agreement,
as well as with other requirements that are specific to the domain.
The main contributions of this paper are:
i) Proposing SLAs and the WS-Agreement specification as a mean for the
representation of Cloud resources. This methodology is illustrated with
a concrete and extensible example model of generic Cloud resources.
ii) Proposing a novel SLA-driven architecture for the automatic provision,
scheduling, allocation and dynamic management of Cloud resources.
This architecture is based in the WS-Agreement specification. It pro-
vides a multi-provider and multilevel framework that allows building
and deploying arbitrary Cloud services that span different levels of the
Cloud. This architecture allows defining arbitrary QoS rules as well as
arbitrary preemptive and corrective self-management actions.
iii) Demonstrating the capabilities of the proposed architecture by the reso-
lution of a use case by a prototype implementation. A set of experiments,
using real world load profiles show the improvement on performance in
1
http://www.grycap.upv.es/compaas/
3
terms of cost and number of failed user requests, of the proposed archi-
tecture using elasticity (upscaling and downscaling) rules.
The paper is organized as follows. Section 2 provides an overview about
related works on the topic of SLAs and Cloud. Section 3 includes a brief
description of the WS-Agreement specification and the WSAG4J framework.
The Cloudcompaas platform is introduced in Section 4, including a model for
resource representation, the Cloudcompaas architecture and its components.
Section 5 introduce the Cloud resources management cycle and operations,
as well as the major contributions introduced by Cloudcompaas to this field.
Section 6 introduces a use scenario on Cloudcompaas platform, completed
with a set of experiments and discussion about the experimental results.
Finally, Section 7 summarises the conclusions of the paper.
2. Related works
Earlier definitions propose SLAs as a mean for the definition of QoS
constraints in electronic services [13]. Other works [14] propose SLA as a
mean for the autonomic management of services. More recently these two
concepts were brought together and SLA is used both for the definition of
requirements of resources and the automatic management of the complete
lifecycle of such resources.
Several specifications exist targeting SLA definition and management,
with different levels of maturity and completeness. WS-Agreement is one of
the most important specifications, which provides a protocol for establish-
ing an agreement between two parties that is the basis for the agreement
definition language and SLA management cycle presented in this paper.
The Web Service Level Agreement [15] (WSLA) is a framework and a
specification developed by IBM for the definition and monitoring of SLAs in
a machine readable format within the domain of web services. The WSLA
language defines the parties involved in the agreement, the description of the
service that the provider delivers to the consumer and the obligations of the
agreement, where the guarantees and constrains of the SLA are defined. The
WSLA framework is a tool for the SLA-driven management of the lifecycle
of web services, using the WSLA specification. This framework integrates
the usage of WSLA with other web services standards such as WSDL.
Other proposals for the definition of SLA are the SLAng [16] and WSOL
[17] languages. Both proposals are XML-based languages whose aim is to de-
fine QoS constraints in the domain of web services, and therefore are tightly
4
related to the web services technologies and standards. Unlike the afore-
mentioned proposals, these two specifications only define a language for the
expression of QoS levels, but do not account for agreement lifecycle or as-
sessment.
Similarly, significant advances have been made in the development of
SLA-aware distributed computing systems. In particular, several innovative
projects have considered SLA-aware automatic resource management in the
last decade.
In [18] an architecture for the provisioning of on-demand virtualized ser-
vices based on SLA is proposed. The authors define it as “the first attempt
to combine SLA-based resource negotiations with virtualized resources in
terms of on-demand service provision”, and represents a first step in the line
of automated SLA-aware Clouds systems. Further works deal with specific
facets of SLA management, such as a system for the monitoring of low level
metrics in distributed environments and its transformation to high level SLA
parameters [19].
More recently [20]proposes an architecture for an SLA-oriented resource
provisioning model for Cloud Computing. This architecture is realized us-
ing the Aneka platform [21], a solution that enables QoS-driven resource
provisioning for scientific computations, and provides mechanisms for the
definition of deadline constraints and the incorporation of multiple Cloud
resources.
Several European projects in the last years are related at different de-
grees to the SLA-aware management of resources and other topics covered
by Cloudcompaas.
Reservoir [22] is a pioneering European project whose aim is to enable
providers to build their own virtualized Cloud infrastructures. Although
Reservoir does not cover SLAs and dynamic management of resources, a
number of spin-out technologies and derived projects aim to provide these
capabilities, such as Claudia [23], BonFIRE [8], Optimis [24] and 4CaaSt
[25].
The Claudia toolkit aims to provide dynamic provision and scalability of
services in IaaS Clouds. However this tool neither user SLAs to represent
resources or QoS rules, nor cover the general management of Cloud resources,
nor provide general mechanisms for the QoS assessment.
BonFIRE is a European project whose aim is to provide a platform for
the federation of Cloud deployments enabling developers to deploy and man-
age Cloud services in a unified environment, providing service metrics and
5
monitoring. However BonFIRE does not use SLA to represent resources.
It does not include QoS assessment, dynamic management of resources or
resource scheduling.
Optimis is a European project aimed to enable private Cloud to automat-
ically interact with public Cloud providers, optimizing the usage of resources
by means of Cloud federation, cloudbursting, live migration and autoscaling.
Optimis performs scheduling operations by deciding the best provider to host
resources. Optimis provides a domain-specific extension of WS-Agreement
to specify requirements at IaaS level and constraints in Cloud services.
4CaaSt is a European project aimed to provide a platform for the de-
ployment, management and trade of Cloud services. This platform pro-
vides developers with automatic scaling and management of resources. It
allows providers to federate their resources in a common marketplace and en-
ables users to compose services. However this platform does neither include
SLAs for the representation of resources, nor dynamic QoS management, nor
scheduling operations. Also it mainly focuses on the PaaS level of Cloud.
SLA@SOI [26] has among one of its aims the implementation of a frame-
work of tools and components that enables the creation of SLA-aware Service
Oriented Infrastructures (SOI). As a large scale projects, its developments
span all the facets of SLA such as a SLA definition language, negotiation,
monitoring, violation prediction and detection, etc. The SLA@SOI has devel-
oped a methodology for the SLA-aware management of infrastructures and
services, and encompasses activities such as dynamic service discovery and
composition, service monitoring and assessment, infrastructure planning and
optimization etc. However this project does not consider Cloud Computing
infrastructures as their target platform, and hence it does not account for
some specific needs of this field.
Cloud-TM [27] is a European project aimed to provide a data-centric
PaaS middleware for the development of distributed Cloud applications. Its
two major aims are to ease the development of Cloud applications by pro-
viding high level data management abstractions and to provide self-tuning
mechanisms that optimize data operations based on user QoS constraints.
The SLA system is based on SLA@SOI. However this project does not cover
the PaaS and SaaS levels of Cloud Computing, and is focused in data-centric
Cloud applications, instead of general purpose Cloud Computing.
Cloudscale [28] is a European project focused on offering a system to
automatically scale Cloud applications with minimal human interaction. Al-
though it covers extensively the upscaling and downscaling of application,
6
this project does not use SLAs for the representation of resources. It does
not account for the deployment and scheduling of resources and does not
provide a general mechanism for the dynamic QoS management of resources.
PaaSage [29] is a recent European project whose aim is to build an Inte-
grated development environment to enable designers and developers to au-
tomatically deploy and optimize Cloud services, provide runtime monitor-
ing and dynamic adaptation, intelligent metadata retrieval, multi provider
support, etc. Although this project covers several topics dealing with QoS
assessment and dynamic management of resources, it does not use SLAs for
the definition of resources and QoS rules nor cover all the levels of Cloud
Computing.
Finally Contrail [30] aimeds to federate Cloud resources by providing
unified interfaces for accessing resources. It covers all three levels of Cloud
by providing IaaS, PaaS and SaaS resources. It cover several topics regarding
SLAs that have been developed inside Cloudcompaas, such as architecture
for SLA management, dynamic QoS assessment, monitoring, accounting and
billing.
This section has presented an analysis of the state of the art in SLA man-
agement in Cloud computing environments. Although significant advances
has been achieved in this field, there are several issues that require further
developments. Particularly most of the presented solutions does not provide
a generic and standard representation of resources, the do not account for
the automatic provision and scheduling of resources and they do not account
for the QoS assessment of resources. These projects highlight the relevance
of these topics in the development of current and future Cloud platforms and
the efforts being made on provide comprehensive solutions for them.
This paper introduces Cloudcompaas, a SLA-driven PaaS Cloud platform
for the assessment of Cloud resources. The SLA representation as well as the
basic structure for the agreement monitoring and billing are based on the WS-
Agreement standard. Cloudcompaas performs static resource deployment
scheduling based on a resource definition model . Additionally Cloudcompaas
performs live evaluation of the QoS rules based on monitoring information
retrieved from Cloud resources. A self-management module processes this
information and performs corrective actions to avoid, or at least to reduce
the number of SLA violations. Finally, Cloudcompaas provides a framework
for general Cloud computing applications. The extensive and generic nature
of Cloudcompaas enables the definition of SLAs for a variety of application
requirements and Cloud resources.
7
3. WS-Agreement and WSAG4J
This section briefly introduces two basic foundations of the Cloudcom-
pass platform: the WS-Agreement specification and the WSAG4J framework.
WS-Agreement defines a Web Services protocol for establishing agreements
between two parties using an extensible XML language. The specification
consists of three parts: a schema for specifying agreements, a schema for
specifying agreement templates and a set of operations for managing the
agreement lifecycle. Although the specification is defined as a Web Services
protocol, WS-Agreement is the language of choice for the agreement specifi-
cation in many grid and Cloud SLA projects [31] due to its extensible nature.
However, despite of its wide adoption, the specification fails to address some
particular needs encountered in the Cloud Computing field. The main draw-
back is the lack of a definition for a self-management module, even though
it includes monitoring, guarantee assessment and billing.
WS-Agreement defines the contents of the agreements in two major sec-
tions, the Service Description Terms and the Guarantee Terms. These two
sections can be used to describe the static and the dynamic requirements of
the user, respectively.
Service Description Terms (SDTs) describe the assets that the consumer
requests to the provider, such as physical resources, computational services or
others. WS-Agreement does not specify how these assets are described, since
the content of each service term depends on each domain. WS-Agreement
defines the content of the service terms in an extensible manner, describing
it using a Domain Specific Language (DSL).
Guarantee terms express an assurance on the quality of the delivery of the
assets defined on the service terms, for instance the response time of a ser-
vice. This guarantee is expressed in terms of Service Level Objectives (SLO).
A Service Level Objective is an assertion involving service attributes, exter-
nal factors - such as date or time - or an expression over Key Performance
Indicators (KPIs), which are domain-specific variables that hold runtime in-
formation about the assets, such as the response time or the availability. A
guarantee term also includes a qualifying condition and a business value list.
Qualifying conditions express the conditions under which the guarantee term
is valid. For instance, a guarantee over the response time of a service may
only be valid if the number of requests is under a certain threshold. The
business value list defines different value aspects of the objective, such as
importance, rewards for successfully fulfilling the guarantee or penalties for
8
not delivering the expected service quality.
3.1. WSAG4J
WSAG4J is a framework developed as a realization of the WS-Agreement
specification. It supports the specification of agreements, supporting opera-
tions such as the creation and the validation of agreements, and evaluation
and accounting of guarantees. However, similarly to the WS-Agreement spec-
ification, the majority of the operations that depend on the back-end (such
as, reservation of resources) has to be implemented for the specific domain.
Figure 1 presents the flow of operations within WSAG4J for the dy-
namic management of agreements. The monitoring process starts with a
MonitorableAgreement instance (1). This class instance is created for each
agreement registered in the system. The MonitorableAgreement schedules
an AgreementMonitorJob (2) for execution. For every monitoring cycle the
AgreementMonitorJob calls the AgreementMonitor (3), the class responsible
of updating the agreement state. A monitoring cycle is a fixed lapse of time
defined by the provider, and represents the grain of the monitoring process.
Smaller monitoring cycles provide more accurate information about the re-
sources, but also increases the load of the system which on the other side can
9
degrade performance. Larger monitoring cycles provide less precise informa-
tion about resources, but in turn decrease the load of the system. Discussion
about the balance between the accuracy of the information and the system
load is presented in [32].
The AgreementMonitor uses the ServiceTermStateMonitor (4) to update
the SDT state, calling every IServiceTermMonitoringHandle (5) available in
the current MonitoringContext. These classes are domain-specific, and they
update the state of the different SDTs. After the SDT states have been
updated, the AgreementMonitor calls the IGuaranteeEvaluator (6), which
determines the state of each guarantee using the SDT information, and it
uses the IAccountingSystem (7) to issue rewards or penalties, as specified by
the agreement.
WSAG4J framework provides a set of basic operations, such as monitoring
and accounting, which can be used as the foundations for the development
of a SLA-aware platform. However, it lacks several key features for Cloud
Computing environments, such as an autonomic management mechanism.
Section 5 describes an extension of the WSAG4J framework to include a
self-management protocol.
4. Cloudcompaas
This section introduces Cloudcompaas, an SLA-driven PaaS Cloud plat-
form for the dynamic management of resources. The section begins intro-
ducing how Cloudcompaas uses WS-Agreement as a standard for the repre-
sentation of generic Cloud resources and QoS rules. The section continues
introducing the architecture of Cloudcompaas for the SLA-driven automatic
provision, allocation and scheduling of resources, and ends introducing the
architecture of Cloudcompaas for the SLA-driven dynamic management of
Cloud resources, extending the WSAG4J framework with autonomic decision
making capabilities.
10
4.1.1. IaaS
The basic building block of the IaaS level is the Virtual Machine. A
Virtual Machine (VM) is the aggregation of hardware resources, or simply a
hardware configuration.
This model has the following requirements.
• Several Physical Resources of the same kind can exist, each one with a
different value.
• The system can store Physical Resources not related to any VM.
• The model comprises at least one resource from one of the following
types: Cores, Memory, Network and Architecture.
The restrictions that our resource model set on a VM are the following.
• A VM must have exactly one resource of the following types: Cores,
Memory, Network and Architecture.
4.1.2. PaaS
The basic building block of the PaaS level in Cloudcompaas is the Virtual
Container [33]. A Virtual Container is a software stack composed by a hierar-
chy of components. The four level hierarchy model has been designed based
on the requirements of the use cases of the platform. An unbounded hierar-
chy model provides more flexibility in the definition of Virtual Containers,
but introduces more complexity in the management of the model.
This model has the following requirements.
• A hierarchy of software dependencies.
11
The restriction of the model is that a Virtual Container cannot have two
Virtual Runtimes, Software Resources or Software Add-ons of the same type.
Finally, a Virtual Container must have at least one Virtual Runtime.
4.1.3. SaaS
The building block of the SaaS level is the service, which has been re-
named as Virtual Service in the data model. This name keeps coherent the
nomenclature used through this section, and it also helps to discern the spe-
cific services running inside a Cloudcompaas deployment from generic Web
Services or Cloud Services.
This model has the requirement that a Virtual Service can have multiple
versions.
The restrictions of the model are.
4.1.4. Users
This simple model is utilized to represent users in the system. The ele-
ment Organization is introduced to offer the possibility of representing simple
user associations, similar to the Grid concept of Virtual Organizations.
In this model a User must belong at least to one Organization. A User
can blong to several Organizations.
12
Figure 2: Cloudcompaas architecture.
and register a new agreement. The four basic operations supported by this
component are searching, creating, querying and deleting.
The searching operation enable users to retrieve agreement templates
stored on the platform according to different criteria. The creating oper-
ation sends the SLA Manager an agreement offer. The component checks
that the offer complies with the agreement template. If this operation fails,
the offer is rejected. If the offer is well defined, the SLA Manager sends it
to the Orchestrator to schedule its deployment. If this operation fails, for
instance because no free resources are available, the offer is rejected. After
an agreement has been accepted and its resources have been allocated, the
SLA Manager registers the agreement in the Monitor component. The query
operation enable users to retrieve the state of agreements that they have
sent to the platform (including the rejected ones) and to delete an active
agreement. The deleting operation provide users with a mean to deallocate
the resources associated with an agreement and stop its monitoring. The
SLA Manager checks if the appointed agreement is currently active and if
the user has rights to delete it before interacting with the Orchestrator and
the Monitor to delete the agreement.
4.2.2. Orchestrator
The Orchestrator is the central component of the platform, it has a com-
plete view of the status of all the elements and acts as a global coordinator.
When a new SLA is accepted by the SLA Manager, a deployment request is
sent to the Orchestrator. This component keeps a global view of all the avail-
able Cloud backends, and performs the scheduling of the resource allocation
based on the SLA requirements and the available resources. The Orchestra-
13
tor manages the deployment process by delegating the allocation operations
to a set of IaaS, PaaS and SaaS connectors. Hence, the scheduling procedure
selects the Cloud backend that will deploy the resources. The Orchestrator
performs a sequential process for the allocation of resources for each level
of the Cloud. It communicates with the Infrastructure Connector, Platform
Connector and Service Connector to deploy IaaS, PaaS and SaaS resources as
required on this order, and feeds each level with information retrieved from
the previous one. Using this procedure, the Orchestrator can deploy Cloud
resources from one level on top of Cloud resources from the lower levels of
the hierarchy.
14
4.2.4. Platform connector
The behavior of this component is essentially the same as the Infras-
tructure Connector, but applied to PaaS Clouds back-ends. The Platform
Connector receives the SLA representation of a Virtual Container and trans-
lates it to the specific representation of the PaaS backend. Then it deploys
and configures the Virtual Container. Once the resources are allocated, the
endpoint reference of the Virtual Containers and other relevant information
is retrieved. As a particular case, the SLA can specify that the Virtual Con-
tainers must be hosted in the Virtual Machines deployed previously. In this
case, the plug-in must retrieve the endpoint references of the Virtual Ma-
chines and contextualize them with the Virtual Container software stack if
needed. From there on the operations will proceed as usual.
4.2.6. Catalog
The Catalog implements an Information System, by means of a dis-
tributed and replicated database accessible through a RESTful API. The
other components use this Information System for retrieving and storing a
variety of information, such as agreements, agreement templates, and run-
time and monitoring information.
15
• The SLA is completed. This condition is met if the SLA is defined for
a certain period of time, or when it is defined on the basis of particular
objectives (e.g. associated to an individual experiment or execution),
which must be completed.
• The SLA is terminated by the consumer.
• The SLA is rejected by the provider. An accepted SLA can be rejected
by the platform at any time, although this form of termination may
involve penalties to the service provider.
The update operation consists on the retrieval of the status of the SLA
terms from the Catalog component. Derived or external values, such as
the current time and date, are also computed on this step. The check step
uses the values retrieved in the previous step to determine the status of
the guarantee terms. The monitor evaluates the formulas of the guarantee
terms and sets the value of the guarantees to either Fulfilled or Violated.
The self-management step performs the assessment operations based on the
outcome of the guarantee check. For each guarantee evaluated as Fulfilled,
the Monitor performs the billing operation, charging the user for the service.
For each guarantee evaluated as Violated, the Monitor performs corrective
actions aimed to restore the proper functioning of the service. Corrective
actions are domain-specific functions that act on the configuration of the
resources. This step effectively implements the QoS assessment capabilities
of the platform. The WSAG4J framework has been used as the basis for the
development of the SLA Manager component. This open-source framework
has been extended with new components and operations to fulfil the needs of
Cloudcompaas. Figure 3 shows the structure of the modules that have been
modified in WSAG4J to perform the dynamic management of the SLA.
The first major modification to WSAG4J is that the guarantee evaluation
and the SDT monitoring have been decoupled. In the original implementa-
tion of WSAG4J, SDT monitoring is scheduled at fixed periods of time, with
the guarantee evaluation done right after it. However, this behaviour lacks
from the flexibility needed to implement the WS-Agreement specification,
which allows service providers to define the monitoring cycle for each guar-
antee term. Therefore, the monitoring of the state and the evaluation of the
guarantees must be performed independently.
In order to perform these operations, the CloudcompaasMonitor (1) sched-
ules the execution of a ServiceTermJob (2a) and a GuaranteeTermJob (2b)
16
Figure 3: SLA Manager architecture
for each SLA. These two classes are in charge of updating the monitoring in-
formation and evaluating the QoS rules for an agreement, respectively. They
exchange information by storing and retrieving the monitoring information
in a shared SLA instance, and using synchronization mechanisms to avoid
race conditions.
The ServiceTermJob executes a TermMonitor (3a) at each monitoring
cycle, which updates the state of the SDT. The TermMonitor updates the
state of each individual SDT by executing each one of the available Ser-
viceTermMonitoringHandler (4a), which are individual handlers designed to
update the different SDT. For instance, the VirtualMachineMonitoringHan-
dler is the handler that gathers the monitoring information that concerns to
the virtual machines, and decides the SDT state, accordingly.
The GuaranteeTermJob (2b) executes an IGuaranteeEvaluator (3b), which
updates the state of each guarantee term of the agreement. Similarly to the
original implementation of WSAG4J, the IGuaranteeEvaluator uses an IAc-
countingSystem (4b) to issue rewards and penalties.
One of the major limitations of the original WSAG4J approach, where
SDT monitoring is scheduled at predefined times, is that in WSAG4J the
same guarantee term cannot define different monitoring cycles for different
17
business values (for instance, a 5 min interval for reward, and a 1 min inter-
val for penalty). Cloudcompaas defines a group of GuaranteeTermJobs per
agreement, using a possibly different monitoring cycles per job defined by
the templates.
The second major modification to WSAG4J is the introduction of a self-
management component in the monitoring process. In the original imple-
mentation, the IGuaranteeEvaluator issues penalties or rewards by means of
the IAccountingSystem. However, a capability of autonomic decision making
based on the agreement terms is needed, in order to achieve a SLA-aware
self-managed platform.
In Cloudcompaas this capability is introduced by the ISelfManagement
(5b) component. This component is instantiated when the IGuaranteeEvalu-
ator evaluates that a guarantee state is violated, and it performs the required
operations to restore the state of the violated guarantee.
Each ISelfManagement groups a set of ICorrectiveAction (6b), and in-
terfaces with the matchmaking system. The ICorrectiveAction provides the
domain-specific implementation of the actions required to restore a violated
guarantee. These actions may range from very generic, general purpose ac-
tions to application specific actions. In [34] the authors discuss the dynamic
adaptation of Cloud resources, defining a hierarchy of actions to perform for
different typical Cloud scenarios. Several important contributions have been
made to this field in the recent years, such as the usage of a knowledge system
to decide the corrective actions to execute [35].
Implementing the WS-Agreement specification requires that the guaran-
tee states be checked at each monitoring cycle, issuing rewards and penalties,
and applying corrective actions (when needed). In this paper, we describe
this problem as the issue of the accumulation of corrective actions, as the fact
that the corrective action could take longer to execute than the monitoring
cycle of the associated guarantee. For instance one can consider an auto-
scaling guarantee that is assessed every 2 minutes, and a corrective action
that consists on the deployment of new virtual machines. In many cases, the
deployment could take more than 2 minutes to complete, and therefore, by
the time the guarantee is checked again, it will still be evaluated as Violated,
and another corrective action will be issued. This problem may also appear
when the correction of different guarantees lead to the same corrective action.
In this way, the issue of the accumulation of corrective actions has a nega-
tive impact in a SLA-mediated system, because introduces an unnecessary
overload and may cause violations of additional guarantees.
18
In Cloudcompaas, this issue has been addressed by using of a cooldown
approach. To this end, the ISelfManagement defines a period of time for each
corrective action to be issued again (namely a cooldown). Every time a new
request is received, the corrective action is registered as ‘ongoing’. When the
system receives a request for a corrective action, which is currently ongoing,
then it ignores the request. Additionally, after the cooldown expires, the
ongoing action is deleted from the registry.
Beside these two major modifications, other smaller changes have been
made in order to increase the flexibility of the WSAG4J framework and to
adapt it to the particular needs of a multi-tenant environment. For example,
Cloudcompaas has modified the language used for expressing conditions in
WSAG4J. One of the significant contributions to the language is the intro-
duction of array variables and math operations as possible values for the KPI
of the SLO of the guarantees. Using these variables, complex values can be
expressed, such as the mean CPU % usage of all deployed virtual machines.
Another example is the possibility of using expressions instead of literals in
the value of rewards and penalties. This change allows SLAs expressing dy-
namic prices for the resources. For instance, in Cloudcompaas, it is possible
to express the price as a function of the number of running Virtual Machines.
6. Case study
This section presents a set of experiments designed for different real-life
scenarios for Cloud resource provisioning. The aim of the experiments is to
simulate the user load over different Cloud deployments. The performance of
each experiment is measured in terms of the number of failed user requests
and the economic cost of the deployed machines. Discussion is made compar-
ing the performance of fixed deployment approaches to the Cloudcompaas
SLA-driven dynamic resource management based on in situ monitoring in-
formation.
19
variable size, defined by the user. This service is utilized to reflect user
requests served by an application demanding intensive CPU and memory.
The VirtualContainer is the software container that enables the execution
of virtual services. In the experimental setup, the virtual container is the Java
runtime.
The Virtual Machine defines the virtual hardware that hosts the virtual
containers and services. In the experimental setup, the virtual machine is a
‘small’ predefined instance, which has 512 MB of RAM and 1 CPU core.
In the experiments, these assets are deployed by the SLA Manager on an
IaaS Cloud deployment. The IaaS backend used for the experiments is an
OpenNebula deployment over a cluster composed by 8 nodes running Ubuntu
Server 10.04 (x86-64) on Intel c Xeon c Processor L5430 (12M Cache, 2.66
GHz, 1333 MHz FSB). The nodes have 16 GB DDR3 (1333 MHz) of RAM
memory and a Hard Drive Disk SATA II (7200 RPM).
20
instance of the services has a cost of 10 credits for each monitoring cycle.
As a reference, the cost a running a single Virtual Machine for 24 hours is
172,800 credits.
However these metrics does not enable measuring the tradeoff between
price and failures made between different configurations. Usually the elastic
configuration provide a lower value at the cost of increasing the number of
failed user requests. Whether the tradeoff is positive or negative depends
on the expected revenue produced by each user, or conversely the profit lost
for not serving users compared with the money saved by reducing the price.
In order to properly compare different configurations, derived metrics are
calculated for each scenario.
The average expected revenue per user r is defined as the total revenue
divided by the number of users served.
total revenue
r= (1)
total users − f ailed users
This value represents the average virtual money the service provider ex-
pects to obtain out of each individual user. Using this derived metric, it is
possible to calculate the profit made by different configurations with total
users t and failed users f .
(t − f1 ) ∗ Be − p1 = (t − f2 ) ∗ Be − p2 (3)
p1 − p2
Be = (4)
f2 − f1
For f1 and f2 the failed requests for the static and elastic scenarios respec-
tively and p1 and p2 the price for the static and elastic scenarios respectively.
This value indicates that for r < Be the elastic configuration outperforms
the static one. The higher the value, the wider the range of profits for which
21
the elastic configuration dominates. However, Be is an absolute value, and
hence it does not provide enough information to properly compare both con-
figurations.
The second derived metric is the ratio of overperformance Or. This value
represents the ratio of profit (i.e. difference between revenue and infrastruc-
ture cost) for which the elastic configuration provide better performance than
the static one. In order to derive Or from Be, the value of r for which the
profit is 0, r1 , is needed.
p1
r1 = (5)
t − f1
Be
Or = − r1 (6)
r1
Therefore, whenever the ratio of profit of the static scenario is lower than
Or, the elastic scenario will outperform the static one.
The third derived metric is the ratio of profitability P r. It might happen
that, since the price of the elastic configuration is lower, a value for r that
is not profitable in the static configuration is indeed profitable in the elastic
one. The profitability ratio represents the percentage of values of r that are
profitable for the elastic scenario but they are not for the static one. The
higher the value, the wider the range of revenues for which the elastic scenario
outperforms the static one. In order to calculate the ratio, the values of r, r1
and r2 , for which the profit of the static and elastic configuration is 0 need
to be calculated.
p2
r2 = (7)
t − f2
r2
Pr = 1 − (8)
r1
Moreover, the sign of Be and P r give information about the general
behaviour of the elastic configuration respect the static one. If both values
are positive, the elastic configuration outperforms the static one for r < Be.
If Be is negative and P r is positive, the elastic configuration outperforms
the static one for all values of r. If both values are negative, the elastic
configuration will never outperform the static one.
22
Total Failed
Scenario Config. Price Be Or Pr
requests requests
Static 40 1,205,830
Chemistry 247,971 987.78 19,823% 70.42%
Elastic 901 355,350
High-Energy Static 122 1,225,840
565,259 857.22 3,930% 2.51%
Physics Elastic 158 1,194,980
Static 150 1,206,610
Fusion 174,467 20,169.5 290,693% 66.86%
Elastic 190 399,830
23
Figure 4: Experiment for the Chemistry scenario.
2
Random failures correspond to failed requests due to network, computing or other
errors independent from server capacity. Experiments suggests a random failure margin
of 0∼200 for the presented scenarios.
24
for the downscale of the infrastructure when the load is not at its peak value.
Since the load is at sub-peak values for most of the experiment, by turning
off unused machines great savings are achieved.
Even though intuitively a reduction of a third of the price at the cost of
20 times more failures may intuitively seem a bad trade, the derived metrics
show that the elastic configuration performs well. The Be is 987.78, with a
Or of 19,823% and a P r of 70,42%. These values indicate that even though
the failures increase considerably, the reduction in price makes up for the
revenue lost by not serving the clients. Assuming that all the clients are
equally valuable, a service would need a profit higher than 19,823% the price
of the Cloud deployment in order for the static configuration to outperform
the elastic one. Also the P r of 70,42% shows that there is a wide range
of services that are not profitable under the static configuration, but are
profitable under the elastic one.
25
machines during the complete duration of the experiment, up until the end
when the load falls to 0. The elastic scenario therefore achieves a small saving
in the cost while providing virtually the same number of failures, since both
experiment are within the random failure margin.
This fact is reflected on the derived metrics. The elasticity configuration
has a negligible P r of 2.51%, since it essentially behaves as the static config-
uration. The Be is 857.22 and the Or 3,930%, which is sensibly lower than
the Chemistry scenario. Nevertheless the value of Or indicate that the little
savings obtained by the elasticity scenario for the same number of failures
produces a positive performance.
Figure 6 depicts the experimental results for the Fusion scenario. Fusion
has an execution profile that exhibits a low load for most of the time, with
a peak of very high load on the second half of the profile. This peak last
for a small fraction of the total duration, and it has the steepest variation
of all the experiments. The fixed configuration produces 150 failed requests
with a total price of 1,206,610 credits, while the elastic scenario produces 190
failures for a price of 399,830 credits.
The most notable comparison is that the number of failures is very sim-
ilar in both cases. This fact is due to the ability of the platform to detect
an increase in the load and react quickly deploying new replicas when the
load peek occurs, with a small number of failures happening as load goes up.
26
The static configuration of this scenario is a clear case of overprovisioning,
since the resources are idle for most of the execution. Overprovisioning sce-
narios are favourable for an elastic configuration, since great savings could
be achieved by turning off idle resources. The experimental metrics back off
this intuitive appreciation, as the price of the elastic scenario is roughly one
third of the static one.
Derived metrics indeed show that this is the scenario for which the elastic
configuration achieves the best performance. With a Be of 20,169.5 and a Or
of 290,693%, that is, around 17 times higher than the Chemistry scenario.
This is due to the fact that the elastic configuration yields almost the same
number of failures than the static configuration, but provides a substantial
saving in price. On the other hand the P r is 66.86%, comparable to the value
obtained for the Chemistry scenario.
Some general conclusions can be drawn from the experimental results
obtained from the different scenarios. First, the experimental results show
the general tendency of elastic configurations to yield a higher number of
failures, due to the delay in the deployment of new resources in response to
load increases, and lower price, due to the deployment of less resources when
the load is low. Second, for services that serve a large number of users and
rely on economies of scale to provide profits, the tradeoff between failures
and cost is positive, since the opportunity cost of not serving users is lower
than the money saved in the infrastructure. Third, the performance of the
elastic configuration depends on the load profile of the service. For a load
with little variance, the elastic configuration provides little advantage. For an
average variable load, the elastic configuration provides a large performance
improvement. The best scenario occurs for very irregular loads with large
peaks of load, where the elastic configuration vastly outperforms the static
one by avoiding overprovisioning of resources. Fourth, the P r values for the
variable load profiles show that there is a wide range of services that not
being profitable under an static configuration might indeed be profitable by
adopting an elastic configuration.
7. Conclusions
Currently, the integration of SLA in Cloud Computing systems is an
important field of research, as the increasing number of projects focused on
this area certifies. Service Level Agreements, as well as all its concerning
facets such as SLA definition language, negotiation, monitoring, etc. have
27
been subject of research for years, but the advent of Cloud Computing, and
the need of means for defining and ensuring QoS levels have greatly increased
the interest on these developments.
The automation of large scale systems management is another promis-
ing feature of the SLA-aware autonomic Cloud solutions. As the size and
complexity of Cloud systems increases, the manual management of these so-
lutions becomes a challenging issue. [37] identifies four key topics and trends
in Cloud Computing, two of which are specifically addressed by means of
SLA, namely the Cloud Markets and the Cloud Resource Management.
This paper contributes to these topics by the introduction of Cloudcom-
paas, a SLA-aware PaaS Cloud platform that manages the complete resource
lifecycle. This platform features an extension of the SLA specification WS-
Agreement, tailored to the specific needs of Cloud Computing applied to the
e-science domain, addressing issues not considered by other research projects.
Finally, a complete working prototype of the proposed platform has been
presented as a proof of concept of the platform, showcasing an elastic de-
ployment. A set of experiments for different load profiles and configurations
measure key metrics that quantify the performance improvement provided
by using an elastic configuration.
The integral approach followed by the Cloudcompaas project features
some open problems that originates a wide variety of research lines. These
research lines include, but are not limited to a negotiation protocol for the
establishment of SLA, the design of monitoring systems, a decision making
system for the election of the corresponding corrective action for each type
of violation, distributed SLA monitoring and a disaster recovery protocol.
Acknowledgements
This work has been developed under the support of the program For-
mación de Personal Investigador de Carácter Predoctoral grant number BFPI/
2009/103, from the Conselleria d’Educació of the Generalitat Valenciana.
Also, the authors wish to thank the financial support received from The
Spanish Ministry of Education and Science to develop the project ‘Code-
Cloud’, with reference TIN2010-17804.
References
[1] The VENUS-C consortium, Virtual multidisciplinary environments us-
ing clouds (2012).
28
URL www.venus-c.eu
[6] R. Zhang, L. Liu, Security models and requirements for healthcare appli-
cation clouds, in: Cloud Computing (CLOUD), 2010 IEEE 3rd Interna-
tional Conference on, 2010, pp. 268 –275. doi:10.1109/CLOUD.2010.62.
[8] The BonFIRE Consortium, Building service test beds on FIRE (2012).
URL http://www.bonfire-project.eu/
[9] S. Phillips, V. Engen, J. Papay, Snow white clouds and the seven
dwarfs, in: Cloud Computing Technology and Science (CloudCom),
2011 IEEE Third International Conference on, 2011, pp. 738 –745.
doi:10.1109/CloudCom.2011.114.
29
[10] R. Buyya, C. S. Yeo, S. Venugopal, J. Broberg, I. Brandic, Cloud com-
puting and emerging IT platforms: Vision, hype, and reality for deliv-
ering computing as the 5th utility, Future Gener. Comput. Syst. 25 (6)
(2009) 599–616. doi:10.1016/j.future.2008.12.001.
URL http://dx.doi.org/10.1016/j.future.2008.12.001
30
[18] A. Kertesz, G. Kecskemeti, I. Brandic, An SLA-based resource virtual-
ization approach for on-demand service provision, in: Proceedings of the
3rd international workshop on Virtualization technologies in distributed
computing, VTDC ’09, ACM, New York, NY, USA, 2009, pp. 27–34.
doi:10.1145/1555336.1555341.
URL http://doi.acm.org/10.1145/1555336.1555341
[25] The 4CaaSt Consortium, Building the PaaS Cloud of the Future (2012).
URL http://4caast.morfeo-project.org/
31
[26] The SLA@SOI consortium, SLA@SOI Empowering the service economy
with SLA-aware infrastructures (2011).
URL http://sla-at-soi.eu/
[29] The PaaSage Consortium, PaaSage: Model Based Cloud Platform Up-
perware (2012).
URL http://www.paasage.eu/
[30] The Contrail Consortium, Open computing infrastructure for elastic ser-
vices (2012).
URL http://contrail-project.eu/
32
(SERVICES-1), 2010 6th World Congress on, 2010, pp. 527 –534.
doi:10.1109/SERVICES.2010.26.
33