0% found this document useful (0 votes)
14 views

Lecture 8 ICT723

Uploaded by

IIB Tech
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Lecture 8 ICT723

Uploaded by

IIB Tech
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

ICT723

Virtualisation and Cloud


Computing

Lecture 8 Cloud Hardware and Software

CRICOS 03171A 1
Cloud infrastructure
• Cloud service providers exploit the latest computing, communication, and
software technologies to offer a highly available, easy to use, and efficient
cloud computing infrastructure.
• Cloud infrastructure is built with inexpensive off-the-shelf components to
deliver cheap computing cycles.
• Virtual machines (VMs) and containers are key components of the cloud
infrastructure.
• Typical cloud workloads include
• Coarse-grained, batch applications,
• Fine-grained, long running applications with strict timing constrains.
Virtualization
• Virtualization is a critical element of the cloud infrastructure.
• Virtualization simulates the interface to a physical object by:
• Multiplexing: creates multiple virtual objects from one instance of a
physical object. Example - a processor is multiplexed among a number
of processes or threads.
• Aggregation: creates one virtual object from multiple physical objects.
Example - a number of physical disks are aggregated into a RAID disk.
• Emulation: constructs a virtual object from a different type of a physical
object. Example - a physical disk emulates a Random Access Memory
(RAM).
• Multiplexing and emulation. Examples - virtual memory with paging
multiplexes real memory and disk; a virtual address emulates a real
address.
Virtual machines
• Processor virtualization by multiplexing is beneficial for users & CSPs.
• Users appreciate virtualization because it allows a better isolation of applications from
one another than the traditional process sharing model. An application developer can
chose to develop the application in a familiar environment and under the OS of her
choice.
• CSPs enjoy larger profits due to the low cost for providing cloud services.
• Running multiple VMs on the same server allows applications to better
share the server resources and lead to higher processor utilization.
• Virtualization also provides more freedom for the system resource
management because VMs can be easily migrated.
• VM migration steps: a VM is stopped, its state is saved as a file, the file is
transported to another server, and the VM is restarted.
Containers
• Containers
• Are based on OS-level virtualization rather than hardware virtualization
• Isolate applications running inside a container from applications running in a different
container
• Isolate applications from the physical system where they run
• Resources used by a container can be limited
• Benefits
• Ease creation and deployment of applications.
• Applications container images are created at build time, rather than deployment time.
• Support portability; containers run independently of the environment.
• Benefit from an application-centric management.
• Optimal philosophy for application deployment; applications are broken into smaller, independent pieces
and can be managed dynamically.
• Support higher resource utilization.
• Lead to predictable application performance.
Warehouse Scale Computers
• WSCs form the backbone of the cloud infrastructure.
• A WSC has 50,000 - 100,000 processors.
• A hierarchy of networks connect, servers, racks, and cells/arrays.
• A rack consists of 48 servers interconnected by a 48 port, 10 Gbps
Ethernet (GE) switch. In addition to the 48 ports, the GE switch has two
to eight uplink ports connecting a rack to a cell.
• A cell/array consists of a number of racks. The racks in a cell are
connected by an array switch.
• The cost of a WSC is of the order of $150 million.
• Cost-performance is what makes WSCs appealing.
WSC processors
• Two basic groups of multicore processors:
• browny - single core performance is impressive, but so is the
power dissipation.
• wimpy - less powerful but consume less power.
• When running on wimpy cores a task needs to spawn a larger number of
threads. Major implications:
• It complicates the software development process as it requires an explicit parallelization of
the application thus, increasing the cost of application development.
• Running a larger number of threads increases the response time. Very often all threads have
to finish before the next step of an algorithm, the well known problem posed by barrier-
synchronization.
WSC Storage – latency, bandwidth, and capacity
• The memory hierarchy of a WSC with the latency given in
microseconds, the bandwidth in MB/sec, and the capacity in GB.
WSC performance
• WSCs workload is diverse, there are no ``killer'' applications that would drive
the design decisions and, at the same time, guarantee optimal performance
for such workloads.
• Solution - profile realistic workloads and analyze data collected during
production runs.
• Google-Wide-Profiling (GWP) - a low-overhead monitoring tool used to gather
the data through random sampling.
• Only data for C++ codes was analyzed because C++ codes dominate the
CPU cycle consumption though, the majority of codes are written in Java,
Python, and Go.
• Data was collected from some 20,000 servers built with Intel Ivy Bridge
processors.
Insights from WSC performance analysis
• Cloud workloads display access patterns involving bursts of
computations intermixed with bursts of stall cycles.
• Processors supporting a higher level of simultaneous multithreading
(SMT) are better equipped than 2-wide SMP processors to hide the
latency by overlapping stall cycles.
• Large working sets of the codes are responsible for the high rate of
instruction cache misses.
• MPKI (misses per kilo instructions) are particularly high for L2 caches.
• Larger caches would alleviate this problem, but at the cost of higher
cache latency.
• Separate cache policies which give priority to instructions over data or
separate L2 caches for instructions and data could help.
Virtualization → user benefits versus concerns
• Users operate in environments they are familiar with, rather than forcing
them to idiosyncratic ones.
• Applications can migrate from one platform to another.
• Support performance isolation important for application optimization and
QoS (Quality of Service) assurance.
• Adds overhead and increases the execution time. The hypervisor is
invoked by the OS when applications make systems calls.
Virtualization → user benefits versus concerns
◼ Simplifies the development and management of services offered
by a CSP.
◼ Allows isolation of services running on the same hardware.
• Important for load balancing. The state of a virtual machine (VM) running
under a hypervisor can de saved and migrated to another server to balance
the load.
• Increases the size of software stack.
• Complicates software maintenance. Saved VMs are not updated when OS
and other system software patches are applied.
Hypervisors –CPU and memory virtualization
• A hypervisor:
• Traps the privileged instructions executed by a guest OS and
enforces the correctness and safety of the operation.
• Traps interrupts and dispatches them to the individual guest
operating systems.
• Controls the virtual memory management.
Hypervisors –CPU and memory virtualization
• A hypervisor:
• Maintains a shadow page table for each guest OS and replicates
any modification made by the guest OS in its own shadow page
table. This shadow page table points to the actual page frame
and it is used by the Memory Management Unit (MMU) for
dynamic address translation.
• Monitors the system performance and takes corrective actions to
avoid performance degradation. For example, the VMM may
swap out a Virtual Machine to avoid thrashing.
Cluster resource management
• There are two sides of cluster management:
• One reflects the views of application developers who need simple means to locate
resources for an application and then to control the use of resources;
• The other is the view of service providers concerned with system availability,
reliability, and resource utilization.

• New concepts
• Framework ➔a large consumer of CPU cycles, a widely-used software system such
as Hadoop and MPI (Message Passing Interface), a standardized and portable
message-passing system used by the parallel computing community since 1990s.
• Resource offer ➔ abstraction for a bundle of resources a framework can allocate on a
cluster node to run its tasks.
• Large clusters scheduling is a hard problem due to the system scale
combined with the workload diversity.
Cluster Management with Borg
• Borg is a cluster management software developed at Google.
• Design goals:
• Manage effectively workloads distributed to a large number of machines and be highly
reliable and available.
• Hide the details of resource management and failure handling thus, allow users to focus on
application development. This is important as the machines of a cluster differ in terms of
processor type and performance, number of cores per processor, RAM, secondary storage,
network interface, and other capabilities.
• Support a range of long-running, highly-dependable production jobs and non-production,
batch jobs.
Borg organization
• Borg cluster ➔ tens of thousands of machines co-located and interconnected
by a data center-scale network fabric.

• Cell ➔ a cluster managed by Borg.

• Borg architecture:
• BorgMaster ➔ a logically centralized controller.
• Borglets ➔ processes running on each machine in the cell.
Borg organization
• The BorgMaster is replicated (5 replicas)
• Each replica maintains an in-memory copy of the state of the cell.
• The state of a cell is also recorded in a Paxos-based store on local disks of each replica.
• An elected master serves as Paxos leader and handles operations that change the state of
a cell, e.g., submit a job or terminate a task.

• Borglets start, stop, and restart failing tasks, manipulate the OS kernel setting
to manage local resources, and report the local state to the BorgMaster.
Borg organization
Configuration file
BorgConfig Command Line Tools Web browser

Borg cell

Scheduler
Scheduler Control

BorgMaster
Persistent Communication manager
store

Borglet Borglet Borglet


Borglet Borglet
Borg scheduler
• Scheduler scans periodically in a round-robin order a priority queue of
pending tasks.
• Feasibility component attempts to locate systems where to run tasks.
• Scoring component identifies the machine(s) to actually run the task.

• Alloc and alloc sets reserve resources on a server, or on multiple ones.



• Jobs have priorities.
Borg scheduler
• Distinct priority bands are defined for activities such as monitoring,
production, batch, and testing.

• A quota system for job scheduling uses a vector including the quantity of
resources such as CPU, RAM, disk for specified periods of time.

• Higher-priority quota cost more than lower-priority ones.

• To manage large cells, the scheduler spawns several concurrent


processes to interact with the BorgMaster and Borglets. These
processes operate on cached copies of the cell state.
Borg scheduler
• To avoid determining the feasibility of each pending task on every
machine, Borg computes feasibility and scoring per equivalence classes of
tasks, tasks with similar requirements. This evaluation is not done for
every machine in the cell but on random machines until enough suitable
machines have been found.

• Production jobs are allocated about 70% of CPU resources and 55% of
the total memory.
Borg scheduler
• A Borg job could have multiple tasks and runs in a single cell. The majority
of jobs do not run inside a VM.

• Tasks map to Linux processes running in containers.

• Results collected for a 12,000 server cluster at Google show:


• Aggregate CPU utilization of 25-35%.
• Aggregate memory utilization of 40%.
• A reservation system raises these figures to 75% and 60%, respectively.
Resource isolation
• A cluster management systems must perform well for a mix of applications
and deliver the performance promised by the strict Service Level Objectives
(SLOs) for each workload.

• The dominant components of this application mix are:


• Latency-critical - workloads, e.g., web search.
• Best-effort} batch workloads, e.g., Hadoop.
Resource isolation
• The two types of workloads share the servers and compete with one another
for their resources.

• Previous resource management systems act at the level of a cluster, but


cannot be very effective at the level of individual servers or processors.
• They cannot have accurate information simply because the state of processors changes
rapidly and communication delays prohibit a timely reaction to these changes.
• A centralized, or even a distributed system for fine-grained server-level resource tuning
would not be scalable.
System resources and isolation
• Physical cores, cache, DRAM, power supplied to the processor, and
network bandwidth are all resources that affect the ability of an LC
workload to satisfy the SLO constraints.

• Individual resource isolation is not sufficient, cross-resource interactions


deserve close scrutiny. For example,
• Contention for cache affects DRAM bandwidth;
• A large network bandwidth allocated to query processing affects CPU utilization as
communication protocols consume a large number of CPU cycles
System resources and isolation
• Processor cores are the engine delivering CPU cycles and an obvious
target of dynamic rather than static allocation for co-located workloads.

• Hyper threading (HT) takes advantage of superscalar architecture and


increases the number of independent instructions in the pipeline. For
each physical core the OS uses two virtual cores and shares the
workload between them when possible. This interferes with the
instruction execution, shared caches, and TLB operations.
Controlling processor resources
• Dynamic frequency scaling → technique for adjusting the clock
rate for cores sharing a socket. The higher the frequency, the
more instructions are executed per unit of time by each core, and
the larger is the processor power consumption. Clock frequency
is related to the operating voltage of the processor.
• Dynamic voltage scaling → a power conservation technique often
used together with frequency scaling, thus the name dynamic
voltage and frequency scaling (DVFS).
• Overclocking → techniques based on DVFS opportunistically
increase the clock frequency of processor cores above the
nominal rate when the workload increases.
Latency critical (LC) & best-effort (BE) workloads
• Google solution for mixing LC and BE workloads:
• Each server should react to changing demands and dynamically alter
the balance of resources used by co-located workloads.
• A system with feedback to implement an iso-latency policy, to supply
sufficient resource so that SLOs are met.
• Allow LC workloads to expand their resource portfolio at the expense of
co-located best-effort workloads.
• LC workloads at any given time is unpredictable, therefore their
latency constrains are unlikely to be satisfied at times of peak
demand unless special precautions are taken.
Latency critical (LC) & best-effort (BE) workloads
• Resource reservation at the level needed for peak demand of LC
workloads is wasteful, it leads to low or extremely low resource
utilization thus, the need for better alternatives.

• Isolate →prevent the BE workload to interfere with the SLO of


the latency-critical workload.
Latency-critical workloads at Google
• websearch → the query component of the web search service.
• Every query has a large fan-out to thousand of leaf nodes, each one of them
processing the query on a shard of the search index stored in DRAM.
• Each leaf node has strict SLO of tens of milliseconds. This task is compute-
intensive as it has to rank search hits and has a small working set of
instructions, a large memory footprint, and a moderate DRAM bandwidth.

• memkeyval → in-memory key-value store used by back-end web


service.
• SLO latency is of hundreds of microseconds.
• The high request rate makes this service compute intensive mostly due to the
CPU cycles needed for network protocol processing.
Latency-critical workloads at Google
• ml_cluster → standalone service using machine-learning for assigning
a snippet of text to a cluster. Its SLO is of tens of milliseconds.
• Slightly less CPU intensive, requires a larger memory bandwidth and lower
network bandwidth than memkeyval.
• Each request for this service has a small cache footprint but a high rate of
pending requests put pressure on the cache and DRAM.
In-memory cluster computing for Big Data
• It is unrealistic to assume that very large clusters could
accommodate in-memory storage of Petabytes or more in the
foreseeable future. Even if storage costs will decline dramatically,
the intensive communication among the servers will limit the
performance.
• Iterative and other classes of Big Data applications → a stable
subset of the input data is used repeatedly. In such cases
performance improvements can be expected if a working set of
input data is identified, loaded in memory, and kept for future
use.
In-memory cluster computing for Big Data
• A distributed shared-memory (DSM) is the obvious solution to in-
memory data reuse. It has advantages and drawbacks:
• Allows fine-grained operations.
• Access to individual data elements is not particularly useful for several
classes of applications.
• Does not support effective fault-recovery and data distribution.
• Does not lead to significant performance improvements.
A data sharing abstraction
• Resilient Distributed Dataset (RDD) → data sharing abstraction
for fault-tolerant, parallel data structures.
• RDD allows a user to keep intermediate results and optimizes
their placement in the memory of a large cluster.
• Data storage in memory significantly improves performance.
• The write bandwidth throughput for both hard disks and solid-state disks
is three orders of magnitude lower than the memory bandwidth.
• Only the random access latency of solid-state disks is much lower than
the latency of hard disks, their sequential I/O bandwidth is not larger.
A data sharing abstraction
• The RDD user interface exposes:
1. Partitions, atomic pieces of the dataset.
2. Dependencies on parent RDD.
3. A function for constructing the dataset.
4. Metadata about data location.
Spark
• The Spark system developed at U. C. Berkeley provides a set of
operators to efficiently manipulate RDDs using a set of coarse-
grained operations such as map, union, sample and join.
• Map - creates an object with the same partitions and preferred
locations as its parent, but applies the function used as an
argument to the call to the {\it iterator} method applied to the
parent's records.
• Union - applied to two RDDs returns an RDD whose partitions are
the union of the partitions of the two parents.
Spark
• Sampling - is similar to map, but the RDD stores for each partition
a random number generator to deterministically sample parent
records.
• Join - creates an RDD with either two narrow, two wide or mixed
dependencies.
• Spark and RDDs are restricted to I/O intensive applications
performing bulk writing.
Spark performance
• Spark is up to 20 times faster than Hadoop for iterative
applications, speeds-up a data analytics report 40 times and can
be used interactively to scan a 1 TB dataset with $5 - 7$ seconds
latency.

• Only 200 lines of Spark code implement the HaLoop model for
MapReduce applications.
Questions:

You might also like