Software Architecture Issues On Cloud
Software Architecture Issues On Cloud
1
Platforms & Applications Architecture
2
Building Quality Into Cloud Apps
Challenges
-
Failures
-Interference
Tactics
- Security
…
Build QAs into
Exploit/
apps:
mitigate
platform - Sclalability
characteristics - Availability
e.g. software
- Reliability
abstraction
of hardware - Security …
3
Cloud vs. Non-cloud
Cloud vendors address some issues faced by non-
cloud data centers
Provisioning of server resources
Physical security
Hiring/training data center personnel
Several other issues still remain the same
Network intrusion threats from outside
Isolating and managing production/test
environments
Installation of updates/patches
Some new or now-changed issues have arisen
These are the issues that we bring out
4
Key Issues Specific To Cloud
Security/privacy
Multi-tenancy
Access keys/credentials
Dependency on geographic/legal jurisdiction
Failure
Failure of VM instances
Data consistency failures
Software upgrade error
5
Key Issues Specific To Cloud
Performance
Network latency
How fast can you provision
Elasticity Over/under provisioning
Performance interference due to VM co-location
6
Details about Key Issues
7
Failure Defined
A system failure occurs when the delivered service no
longer complies with the specifications, the latter
being an agreed description of the system's expected
function and/or service
Related terms:
Faults (defects or bugs)
Errors (expected and actual behavior differs)
Faults and errors may lead to failure
8
Failure Rates Get Amplified In Cloud
Failure rates
Servers experience 2-4% annual failure rates (AFR)
Disk drives have about 4-6% AFR
For server AFR of 3% MTBF is about 292000hrs
More than 30 years
In a datacenter having 64000 servers having 2 disks
each
Daily, more than 5 servers and 15 disks can fail
9
Handling Failure In Cloud
Before you handle a failure you must detect it
Heartbeat remains the key tactic for identifying
failures
Must have a monitor that watches aliveness of
VMs
A monitor can be in: infrastructure, client or part of the
application
A VM must periodically show its aliveness to the
monitor
By responding to some query
Sending some message
10
Stateful vs. Stateless Instances
11
Recovery For Stateful Instance
Multiple tactics are possible depending upon
Whether application stored state in the VM itself or on
external device
Application’s tolerance for loss of computation
Availability of VM check-pointing
Basic tactic is to:
Keep saving most recent state data somewhere safely
Restore back to last saved state on detecting failure
12
Stateful Recovery | Tactic 1
Assumes that application keeps state in VM itself
No state dependency on external device
Works for cloud only
*Check-point VM state on regular intervals
On detection of failure, restore and start from last check-
pointed VM image
Requests arriving between time of failure until restoring to
last checkpoint get lost
Internet Ap
p
Storag
Client e
Monitor
In VM Check-
cloud pointing
14
Stateful Recovery | Tactic 2
Application saves recent state on external device
Done at regular intervals
On detection of failure:
Check for recent state saved on external device
Resume from state restored from external device
Requests arriving between time of failure until
restoring to last saved state get lost
Recovery mechanism needs to be coded into
application itself
15
Stateful Recovery | Tactic 2
VM
Internet
Ap
p
Client
Storag
Monitor e
In
cloud
16
Avoiding Lost Requests In Tactic 2
17
Stateless Applications | Requests Flow
18
Routing Requests To Instances
19
Architecture Of Push Based LB
3. LB forwards
to one of the VM
e.g. in round robin
2. Request
arrives at
LB
1. Client sends
request
20
Role Of Monitor
Observes the VMs
Resource (CPU, memory etc.) utilization
Requests load
Quality of service violations etc.
Sends VM stats to load balancer
Include failures info
Decides, based on some rules, when more resources
are required
21
Working Of Push LB Pattern
Monitor detects VM failure
E.g. when VM becomes non-responding
LB gets to know this and stops sending requests to the
failed VM
Current in-progress requests handled by VM are lost
Client needs to detect this possibly via timeout
Client should resend the lost requests
22
Architecture Of Pull Based LB
4. VM tells LB to
remove successfully
processed request from
queue
3. VM pulls the next
available request
from LB queue
2. LB places request
in app specific
queue
1. Client sends
request
23
Role Of Monitor
Watch the application specific queues on LB
Waiting time for requests in a queue
Length of the queue
Infer load on a VM from queue stats
Decides, based on some rules, when more resources
are required
24
Working Of Pull LB Pattern
LB knows when a request has been processed by a VM
Requests that remain unhandled until a time limit get
reassigned by LB
A failed VM won’t pick requests from it queue
This automatically takes a failed VM out of service
Requests trapped in failed VMs
Can get processed when VM recovers
Application must handle duplicate processing scenario
25
VM Cleanup On IaaS Cloud
When a VM fails:
It is not automatically de-allocated
It needs to be de-allocated by consumer
Cloud provider continue to charge until VM is de-
allocated
On de-allocation of a VM:
Its public and private IP addresses become available for
reassignment
Infrastructure can be told to assign released public IP
address to replacement VM
26
Summary
LB and monitor are key components
LB policies take failures into account
LB is augmented by a monitor component
Tactic has two flavors:
Push: LB decides which instance gets to serve a request
Pull: Instances pull the requests from a queue
maintained by the LB
27
Data storage issues on cloud
28
Data Storage
Provides ability to persists digitized information
e.g. plain text files, binary files containing photos
Retains data for an interval of time
Length of time depends on storage type
e.g. RAM contents don’t survive machine restart, whereas hard
disk contents can
29
Data Storage Components
Can be have different categories, for example:
Raw files on a file systems (FS) on the operating system
Data stores such as RDBMS engines, key-value stores etc.
Characteristics/behavior of each category can
vary significantly
An application can use multiple types of data storage
components
E.g. can write to both the raw FS and a database table
30
Data Storage And Cloud
Mainly we are concerned about data storage on:
IaaS cloud
PaaS cloud
IaaS cloud because:
Cloud user has to manage the resources including
storage
PaaS cloud because:
As a developer you need to handle data storage from
within the applications you write
SaaS cloud case is not interesting
Because as a user you don’t write any software
here
31
Data Storage On Cloud
IaaS cloud vendors offer two data storage types:
Ephemeral
Persistent
Ephemeral storage doesn’t survive instances failure
Typically available as a block device attached to the VM
instance
Persistent storage is long lived
Cloud vendor automatically replicates
Geographically distributed replicas
Storage failures can lead to data consistency issues
Application needs to be prepared for this
32
Data Consistency In Applications
Consistency
Disallow multiple values of same piece of data when
seen by different clients at the same point in time
34
Data Consistency | CAP Theorem
Relational Partition Tolerance
Tabular/Column-oriented A+C+P is
impossible!
Model
Key-value store
System works
Data
36
Why Is It Important?
Because users needs to be happy
“500 Internal Server Error” is the last thing I want to
see after punching in my credit card details
Next time I’ll shop elsewhere
It impacts* the businesses too
Extra 0.1s in response time costs Amazon 1% in sales
Google found that 0.5s jump in latency leads to 20%
drop in traffic
* http://highscalability.com/latency-everywhere-and-it-costs-you-sales-how-crush-it
37
How To Address It
You need to choose any two from among:
Availability, consistency and partition tolerance
Drop availability
Services are unavailable until data is consistent on all
nodes
Drop partition tolerance
Avoid partitions from happening
Drop consistency
Expect that data becomes consistent eventually
38
Summary
It is important to understand characteristics of data
storage mechanisms on cloud
Different storage services may behave differently
Data consistency in applications
Cannot achieve all three at the same time:
Consistency of data
Availability of data/services
Partition tolerance
39
Software Upgrade Induced Failures
40
An Example Scenario
Client/web Server-side
browser VM
t1) S/W upgrade
t2) Client initiates starts
request
42
A Solution Idea
Treat each application version as a separate
destination for requests
Client knows the latest version number that it has
interacted with
Load balancer routes requests based on version
number present in request header
A version xxx in the header is routed to instance whose
version is ≥ xxx
Route normally if no version number found in request
header
43
Security and privacy
44
Information Security
“Information security means protecting
information and information systems from
unauthorized access, use, disclosure, disruption,
modification, perusal, inspection, recording or
destruction”
45
Core Artifacts: Credentials and Keys
47
Security Scenario | In-house vs. Cloud
In-house data
center
Some Service
Client (e.g. storage or
HTTPS an
application)
Needs access to private
key for SSL/HTTPS Developer
Client On
HTTPS cloud
Some Service
(e.g. storage or
Developer an application)
48
Important Security Aspects On Cloud
Central issue:
Data and applications being in 3rd party (cloud vendor)
custody
Lack of trust on the cloud service provider
Access credentials and encryption keys
Management of keys and credentials
Privacy and security in multi-tenant
environment
Dependency on legal/geographical
jurisdiction
Local laws governing a cloud service
49 provider
Bank vs. At-home | A Metaphor
Ban At-
k home
User Applicatio
(outside
data HTTP Data
n in
ofcloud S store
cloud
)
Handles
Unencrypted
data
unencrypted
51
Issues With Elementary Data Security
52
Credentials Management For Cloud
Important considerations:
Who has/needs access to credentials?
Will you need to change credentials? When?
Storage and automated provisioning of credentials
Some options for providing credentials in cloud
Build into the VM image
Supply as parameter during instance launch
Keep in some persistent storage
Send from client every time a new instance starts
53
Summary
Several security aspects remain same on cloud as they
were in-house
e.g. from tightening of firewall rules to input validation
in you application
Few things have changed
Data/apps now lives in a 3rd party custody
It is cloud consumer’s responsibility to protect
its sensitive data
Can use encryption technologies to achieve this
54
Geo/Legal Jurisdiction Dependency
55
Impact On Cloud Consumers
Awareness of cloud vendor data centers location
Some do not provide locations of its data centers
Backup locations may be chosen by the vendor
For optimization of its resources etc.
Abilities to allow users to control data
locations
Based on type of the data
Every vendor may not offer
56
How To Deal With It?
Use anonymization techniques for securing PI
E.g. replace PI with tokens and keep the PI ↔ token mapping
locally
Example:
Original data:
Shiva Kumar {Sensitive data}, e.g., bank account number
Anonymization tokens (kept locally)
Shiva Kumar {Token}, e.g., a number
Data stored in cloud:
Token
Sensitive data
Restore original data by taking join of token table and cloud
data table
57
performance
58
Performance Of A System
59
Achieving Better Performance
System should do:
more work
at a faster rate
by consuming less computing resources
Time-proven design principles apply on cloud as well
Exploit parallelism
Pooling of shared resources
Put processing near the data/resources it needs
Minimizing round trips
… and rest of the good stuff
60
Key Points For Cloud
Consolidation of computing resources
Improves overall utilization in a data center
Use virtualization to achieve this
Rapid elasticity
On-demand provisioning of resources
Fast scaling of applications
Latencies can still be an issue
Multi-tenancy
May result in performance interference
61
Elasticity On Cloud Platforms
Elasticity: provision computing resources on demand
Consumers can define auto-scaling rules
E.g. on an existing VM when CPU usage remains above
80% for 10 minutes, launch a new VM
Eliminates manual intervention for provisioning
Auto-scaling strategies
A matter of research
Provisioning a VM and starting the apps on it takes
time
62
Provisioning Latency: An Important Issue
Small Instance
1 EC2 Compute Unit (1 virtual core with 1 EC2 Compute Unit),
1.7 GB of memory, 160 GB of instance storage, 32-bit
platform with a base install of CentOS 5.3 AMI
Can take 5 to 6 minutes us-east-1c from launch to
availability
Large Instance
4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute
Units each), 7.5 GB of memory, 850 GB of instance storage,
64-bit platform with a base install of CentOS 5.3 AMI
Can take 11 to 18 minutes us-east-1c
[http://www.philchen.com/2009/04/21/how-long-does-it-
take-to-launch-an-amazon-ec2-instance]
63
Addressing Provisioning Latency Issue
64
Multi-tenancy On Cloud
Different consumer’s applications and data hosted
on shared infrastructure
e.g. Single physical disk holding data from different
consumers
Needed for optimizing resource utilization
Allows cloud provider to leverage economies of
scale
Consumers can assume they are sandboxed
i.e. their apps and data is isolated from other’s
65
Multi-tenancy On Cloud
VM for customer 1
VM for customer
2
VM for customer 3
Shared physical
host
66
Issues Arising Due To Multi-tenancy
Performance interference
VMs co-located on same physical host may affect each
other’s performance
E.g. simply forcing expensive cache invalidation on the host
CPU
Potential for one VM breaking into another
Bugs in virtualization software
Encrypting data by user can help
68
Performance Interference (Throughput)
Increasin
g Load
type
Steady
Load type
Sodhi & Prabhakar, “Performance Characteristics of Virtualized Platforms
69 from Applications Perspective” in GLOBE 2012 (Springer LNCS)
Performance Interference (CPU)
70
Performance Interference (Memory)
71
Summary
Performance best practice from non-cloud
environments continue to apply on cloud
But you can leverage cloud specific
characteristics
Elasticity of computing resources
Virtual nature of resources
On-demand provisioning
Performance remains a design issue on
cloud as well
One has to design performance into the application
It does not come automatically!
72
Impact Of Platform Characteristics
73
Platform Characteristics
A computing platform (?aaS) carries characteristics
unique to it
Chars. = {Functional + non-functional attributes of
platform}
For example:
Software abstraction of hardware resources (V12N)
Coarse-grained multi-tenancy (IaaS/PaaS)
Limited control of underlying infrastructure (PaaS)
Location transparency (XaaS)
They impact guest application architecture
E.g ability of a guest app to achieve certain QAs
74
Finding Impact on Application
Architecture
Examine platform characteristics in light of architecture
knowledge and reverse-engineer the tactics and patterns
QA Remarks
(Possible tactic for it)
Reliability On finding a failure, the system can
(State checkpoint) be restored to a prior check-pointed
consistent state.
Performance Processing different sets of tasks in
(Add concurrency) parallel by creating additional threads.
Security Services are deployed on hosts in a
(Limit exposure) manner that reduces overall damage when
the host is compromised.
Characteristic Impacted QA
Software abstraction of hardware DR, Efficiency (+)
Programmatic self-serviced provisioning Deployability, Scalability (+)
VM/resource check-pointing and snap shots Availability, DR, Reliability (+)
Lack of computing assets custody Privacy, Security (-)
Relative anonymity behind subscription and Security (-)
usage
78
Tier Co-location: Logical View
79
Tier Co-location: Example
80
Selective Computation Transfer
Problem context:
An application functionality is such that:
Different requests may take different amount of resources to
process
Some requests may require execution of privileged
components
Majority of requests can be served from normal deployment
environment
You want to maintain certain QoS for ALL requests
81
Selective Computation Transfer
Architectur
e
82
Selective Computation Transfer
Example implementation
83
Aspect-orientation at Platform-level
Problem context:
There are software design and implementation concerns
which apply at the coarse grained computing
environment/platform level, and they cut across the
applications
E.g. monitoring and reacting to platform level events
84
Aspect-orientation at Platform-level
85
Aspect-orientation at Platform-level
Example implementation
86
Summary
It is important to determine and understand platform
characteristics
Cloud and virtualization based platforms have several
unique characteristics
Devise solutions to application scenarios
Leverage platform characteristics
87