Evaluating_and_Improving_the_Performance_and_Scheduling_of_HPC_Applications_in_Cloud
Evaluating_and_Improving_the_Performance_and_Scheduling_of_HPC_Applications_in_Cloud
Abstract—Cloud computing is emerging as a promising alternative to supercomputers for some high-performance computing (HPC)
applications. With cloud as an additional deployment option, HPC users and providers are faced with the challenges of dealing with
highly heterogeneous resources, where the variability spans across a wide range of processor configurations, interconnects,
virtualization environments, and pricing models. In this paper, we take a holistic viewpoint to answer the question—why and who
should choose cloud for HPC, for what applications, and how should cloud be used for HPC? To this end, we perform comprehensive
performance and cost evaluation and analysis of running a set of HPC applications on a range of platforms, varying from
supercomputers to clouds. Further, we improve performance of HPC applications in cloud by optimizing HPC applications’
characteristics for cloud and cloud virtualization mechanisms for HPC. Finally, we present novel heuristics for online application-aware
job scheduling in multi-platform environments. Experimental results and simulations using CloudSim show that current clouds cannot
substitute supercomputers but can effectively complement them. Significant improvement in average turnaround time (up to 2X)
and throughput (up to 6X) can be attained using our intelligent application-aware dynamic scheduling heuristics compared to
single-platform or application-agnostic scheduling.
Index Terms—HPC, cloud, performance analysis, economics, job scheduling, application-awareness, characterization
1 INTRODUCTION
TABLE 1
Testbed
Resource Platform
Ranger Taub Open Cirrus Private Cloud Public Cloud EC2-CC Cloud
Processors 16AMD Opteron 12Intel Xeon 4Intel Xeon 2QEMU Virtual 4QEMU Virtual 16Xen HVM
in a Node QC @2.3 GHz X5650 @2.67 GHz E5450 @3.00 GHz CPU @2.67 GHz CPU @2.67 GHz VCPU @2.6 GHz
Memory 32 GB 48 GB 48 GB 6 GB 16 GB 60 GB
Network Infiniband QDR Infiniband 10GigE internal, Emulated 1GigE Emulated 1GigE Emulated 10GigE
(1 GB/s) 1GigE x-rack
OS Linux Sci. Linux Ubuntu 10.04 Ubuntu 10.04 Ubuntu 10.10 Ubuntu 12.04
and observed performance, identifying what appli- user. Table 1 shows the details of each platform. In case of
cations are suitable for cloud. (Section 3, Section 4) cloud a node refers to a virtual machine and a core refers to
We also evaluate the performance when running a virtual core. For example, “2 QEMU Virtual CPU
the same benchmarks on exactly same hardware, @2.67 GHz” means each VM has two virtual cores. Ranger
without and with different virtualization technolo- [10] at TACC was a supercomputer (decommissioned in
gies, thus providing a detailed analysis of the iso- Feb. 2013), and Taub at UIUC is an HPC-optimized cluster.
lated impact of virtualization on HPC applications Both use Infiniband as interconnect. Moreover, Taub uses
(Section 6). scientific Linux as OS and has QDR Infiniband with band-
To bridge the divide between HPC and clouds, we width of 40 Gbps. We used physical nodes with commodity
present the complementary approach of (1) mak- interconnect at Open Cirrus testbed at HP Labs site [11]. The
ing HPC applications cloud-aware by optimizing next two platforms are clouds—a private cloud setup using
an application’s computational granularity and Eucalyptus [12], and a public cloud. We use KVM [13] for
problem size for cloud and (2) making clouds virtualization since it has been shown to be a good candidate
HPC-aware using thin hypervisors, OS-level con- for HPC virtualization [14]. Finally, we also used an HPC-
tainers, and hypervisor- and application-level CPU optimized cloud—Amazon EC2 Cluster Compute Cloud [7]
affinity, addressing – how to use cloud for HPC. of US West (Oregon) zone, cc2.8xlarge instances with Xen
(Section 5, Section 6) HVM virtualization launched in same placement group for
We investigate the economic aspects of running in best networking performance [7].
cloud and discuss why it is challenging or rewarding Another dedicated physical cluster at HP Labs Singa-
for cloud providers to operate business for HPC pore (HPLS) is used for controlled tests of the effects of vir-
compared to traditional cloud applications. We also tualization (see Table 2). This cluster is connected with a
show that small/medium-scale users are the likely Gigabit Ethernet network on a single switch. Every server
candidates who can benefit from an HPC-cloud. (Sec- has two CPU sockets, each populated with a six-core CPU,
tion 7) resulting in 12 physical cores per node. The experiment on
Instead of considering cloud as a substitute of the HPLS cluster involved benchmarking on four configu-
supercomputer, we investigate the co-existence of ration: physical machines (bare), LXC containers [15], VMs
multiple platforms—supercomputer, cluster, and configured with the default emulated network (plain VM),
cloud. We research novel heuristics for applica- and VMs with pass-through networking by enabling
tion-aware scheduling of jobs in this multi-plat- input/output memory management unit (IOMMU) on the
form scenario significantly improving average job Linux hosts to allow VMs to directly access the Ethernet
turnaround time (up to 2X) and job throughput hardware (thin VM) [16]. Both the plain VM and thin VM
(up to 6X), compared to running all jobs on super- run atop KVM. This testbed is designed to test the isolated
computer. (Section 8) impact of virtualization, impossible to execute on public
The insights from performance evaluation, characteriza- clouds, due to the lack of direct access to public cloud’s
tion, and multi-platform scheduling are useful for both— hardware.
HPC users and cloud providers. Users can better quantify
the benefits of moving to a cloud. Cloud providers can opti- TABLE 2
mize the allocation of applications to their infrastructure to Virtualization Testbed
maximize utilization and turnaround times.
Resource Virtualization
2 EVALUATION METHODOLOGY Phy., Container Thin VM Plain VM
In this section, we describe the platforms which we com- Processors 12Intel Xeon 12QEMU 12QEMU
pared and the applications suite used in this study. in a X5650 @2.67 GHz Virtual CPU Virtual CPU
Node/VM @2.67 GHz @2.67 GHz
2.1 Experimental Testbed Memory 120 GB 100 GB 100 GB
Network 1GigE 1GigE Emulated
We selected platforms with different interconnects, operat- 1GigE
ing systems, and virtualization support to cover the domi- OS Ubuntu 11.04 Ubuntu 11.04 Ubuntu 11.04
nant classes of infrastructures available today to an HPC
Authorized licensed use limited to: Nirma University Institute of Technology. Downloaded on October 03,2024 at 07:17:34 UTC from IEEE Xplore. Restrictions apply.
GUPTA ET AL.: EVALUATING AND IMPROVING THE PERFORMANCE AND SCHEDULING OF HPC APPLICATIONS IN CLOUD 309
Fig. 1. Time in seconds (y-axis) versus core count (x-axis) for different applications (strong scaling except Sweep3D). All applications scale well on
supercomputers and most scale moderately well on Open Cirrus. On clouds, some applications scale well (e.g., EP), some scale till a point (e.g.,
ChaNGa) whereas some do not scale (e.g., IS).
2.2 Benchmarks and Applications in our runs) so that no two queens attack each other.
To gain thorough insights into the performance of selected Communication happens only for load balancing.
platform, we chose benchmarks and applications from dif- On Ranger and Taub, we used MVAPICH2 for MPI and
ferent scientific domains and those which differ in the CHARM++ ibverbs layer. On remaining platforms we
nature, amount, and pattern of inter-processor communica- installed Open MPI and net layer of CHARM++.
tion. Similarly to previous work [2], [3], [4], [5], we used
NAS parallel benchmarks (NPB) class B [17] (the MPI ver- 3 BENCHMARKING HPC PERFORMANCE
sion, NPB3.3-MPI), which exhibit a good variety of compu-
tation and communication requirements. Moreover, we Fig. 1 shows the scaling behavior of our testbeds for the
chose additional benchmarks and real world applications, selected applications. These results are averaged across
written in two different parallel programming environ- multiple runs (five executions) performed at different times.
ments – MPI [18] and CHARM++ [19]: We show strong scaling results for all applications except
Sweep3D, where we chose to perform weak scaling runs.
Jacobi2D—A five-point stencil kernel, which aver- For NPB, we present results for only embarrassingly paral-
ages values in a 2D grid, and is common in scientific lel (EP), LU solver (LU), and integer sort (IS) benchmarks
simulations, numerical linear algebra, solutions of due to space constraints. The first observation is the differ-
partial differential equations, and image processing. ence in sequential performance: ranger takes almost twice
NAMD [20]—A highly scalable molecular dynamics as long as the other platforms, primarily because of the
application representative of a complex real world older and slower processors. The slope of the curve shows
application used ubiquitously on supercomputers. how the applications scale on different platforms. Despite
We used the ApoA1 input (92k atoms). the poor sequential speed, Ranger’s performance crosses
ChaNGa [21]—A cosmological simulation applica- Open Cirrus, private cloud and public cloud for some appli-
tion which performs collisionless N-body interac- cations at around 32 cores, yielding a much more linearly
tions using Barnes-Hut tree for calculating forces. scalable parallel performance. We investigated the reasons
We used a 300,000 particle system. for better scalability of these applications on Ranger using
Sweep3D [22]—A particle transport code widely application profiling, performance tools, and microbe-
used for evaluating HPC architectures. Sweep3D nchmarking and found that network performance is a domi-
exploits parallelism via a wavefront process. We ran nant factor (see Section 4).
the MPI-Fortran77 code in weak scaling mode main- We observed three different patterns for applications on
taining 5 5 400 cells per processor. these platforms. First, some applications such as EP, Jaco-
NQueens—A backtracking state space search imple- bi2D, and NQueens scale well on all the platforms up to
mented as tree structured computation. The goal is 128–256 cores. The second pattern is that some applications
to place N queens on an N N chessboard (N ¼ 18 such as LU, NAMD, and ChaNGa scale on private cloud till
Authorized licensed use limited to: Nirma University Institute of Technology. Downloaded on October 03,2024 at 07:17:34 UTC from IEEE Xplore. Restrictions apply.
310 IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 4, NO. 3, JULY-SEPTEMBER 2016
TABLE 3
Application Communication Characteristics: MPI Collectives in
Parentheses, Bottlenecks in Bold
Fig. 3. (a) CPU utilization for Jacobi2D on 32 two-core VMs of private cloud. White portion: idle time, colored portions: application functions. (b) Net-
work performance on private and public clouds is off by almost two orders of magnitude compared to supercomputers. EC2-CC provides high band-
width but poor latency.
Authorized licensed use limited to: Nirma University Institute of Technology. Downloaded on October 03,2024 at 07:17:34 UTC from IEEE Xplore. Restrictions apply.
GUPTA ET AL.: EVALUATING AND IMPROVING THE PERFORMANCE AND SCHEDULING OF HPC APPLICATIONS IN CLOUD 311
Fig. 5. Optimizing HPC for cloud: Effect of grain size and problem size.
5.2 Problem Sizes and Runtime Retuning we explore optimizations to mitigate the virtualization over-
Fig. 6 shows the effect of problem size on performance head, hence making clouds HPC-aware.
(speedup) of different applications on private cloud and
supercomputer (Taub). With increasing problem sizes 6.1 Lightweight Virtualization
(A!B!C), applications scale better, and the gap between We consider two lightweight virtualization techniques, thin
cloud and supercomputer reduces. Fig. 5c reaffirms the pos- VMs configured with PCI pass-through for I/O, and contain-
itive impact of problem size. For Jacobi, we denote class A ers, that is OS-level virtualization. Lightweight virtualization
as 1k 1k, class B as 4k 4k, and class C as 16k 16k grid reduces the overhead of network virtualization by granting
size. As problem size increases (say by a factor of X) with VMs native accesses to physical network interfaces. Using
fixed number of processors, for most scalable HPC applica- thin VM with IOMMU, a physical network interface is allo-
tions, the increase in communication (e.g. uðXÞ for Jacobi2D) cated exclusively to a VM, preventing the interface to be
is less than the increase in computation (uðX2 Þ for Jacobi2D). shared by the sibling VMs and the hypervisor. This may
Hence, the communication to computation ratio decreases lead to under-utilization when the thin VM generates insuf-
with increase in problem size, which results in reduced per- ficient network load. Containers such as LXC [15] share the
formance penalty of execution on a platform with poor physical network interface with its sibling containers and its
interconnect. Thus, adequately large problem sizes such host. However, containers must run the same OS as their
that the communication to computation ratio is adequately underlying host. Thus, there is a trade-off between resource
small can be run more effectively in cloud. Furthermore, multiplexing and flexibility offered by VM.
applying our cost analysis methodology (Section 7), Fig. 5c Table 4 first five columns, validate that network virtuali-
can be used to estimate the cross-over points of the problem zation is the primary bottleneck of cloud. These experiments
size where it would be cost-effective to run on supercom- were conducted on the virtualization testbed described ear-
puter versus cloud. lier (Table 2). Plain VM attains poor scaling, but on thin
While performing experiments, we learned that parallel VM, NAMD execution times closely track those on the
runtime systems have been tuned to exploit fast HPC net- physical machine even as multiple nodes are used (i.e., 16
works. For best performance on cloud, some of the network cores onwards). The performance trend of containers also
parameters need to be re-tuned for commodity cloud net- resembles that of physical machine. This demonstrates that
work. E.g., in case of CHARM++, increasing the maximum thin VM and containers significantly lower the communica-
datagram size from 1;400 to 9;000, reducing the windows tion overhead. This low overhead was further validated by
size from 32 to 8, and increasing the acknowledgement the ping-pong test.
delay from 5 to 18 ms resulted in 10-50 percent performance
improvements for our applications. 6.2 Impact of CPU Affinity
CPU affinity instructs the OS to bind a process (or thread) to
a specific CPU core. This prevents the OS from inadver-
6 OPTIMIZING CLOUD FOR HPC tently migrating a process. If all important processes have
Cloud-aware HPC execution reduces the penalty caused by non-overlapping affinity, it practically prevents multiple
the underlying slow physical network in clouds, but it does processes from sharing a core. In addition, cache locality
not address the overhead of network virtualization. Next, can be improved by processes remaining on the same core
Fig. 6. Effect of problem size class on attained speedup on supercomputer (Taub) versus private cloud.
Authorized licensed use limited to: Nirma University Institute of Technology. Downloaded on October 03,2024 at 07:17:34 UTC from IEEE Xplore. Restrictions apply.
GUPTA ET AL.: EVALUATING AND IMPROVING THE PERFORMANCE AND SCHEDULING OF HPC APPLICATIONS IN CLOUD 313
throughout their execution. In the cloud, CPU affinity can operations unrelated to applications. Even on the physical
be enforced at the application level, which refers to binding machine, execution time increased by 3–5 percent due to
processes to the virtual CPUs of a VM, and at the hypervisor disk I/O generated by the launcher shell script and its
level, which refers to binding virtual CPUs to physical CPUs. stdout/stderr redirection. The spikes on the physical
Fig. 7 presents the results of our micro-benchmarks with machine in Fig. 7c are caused by short ssh sessions which
various CPU affinity settings on different types of virtual simulate the scenarios where users log in to check the job
environments. In this experiment, we executed 12 processes progress. Thus, minimizing unrelated I/O is an important
on a single 12-core virtual or physical machine. Each pro- issue for HPC cloud providers.
cess runs 500 iterations, where each iteration executes 200 Table 4 also shows the positive impact of enabling hyper-
millions of y ¼ y þ randðÞ=c operations, with c being a con- visor-level affinity , application-level affinity, or both (dual-
stant. Without CPU affinity (Fig. 7a), we observe wide fluc- AFF). Significant benefits are obtained for thin-VM dualAFF
tuation on the process execution times, up to twice the case compared to the case with no affinity.
minimum execution time (i.e., 2.7 s). This clearly demon-
strates that frequently two or more of our benchmark pro- 6.3 Network Link Aggregation
cesses are scheduled to the same core. The impact of CPU Even though network virtualization cannot improve net-
affinity is even more profound on virtual machines. Fig. 7b work performance, an approach to reduce network
shows the minimum and maximum execution times of the latency using commodity Ethernet hardware is to imple-
12 processes with CPU affinity enabled on the physical ment link aggregation and a better network topology.
machine, while only application-level affinity (appAFF) is Experiments from [24] show that using four-six aggre-
enabled on the thin VM. We observe that the gap between gated Ethernet links in a torus topology can provide up
minimum and maximum execution times is narrowed in to 650 percent improvement in overall HPC performance.
this case. However, on the thin VM, we still notice the fre- This would allow cloud infrastructure using commodity
quent spikes, which is attributed to the absence of hypervi- hardware to improve raw network performance. Software
sor-level affinity (HyperAFF). Even though each process is defined networking (SDN) based on open standards such
pinned to a specific virtual core, multiple virtual cores may as Openflow, or similar concepts embedded in the cloud
still be mapped to the same physical core. With hypervisor- software stack, can be used to orchestrate the link aggre-
level affinity enabled, execution times across virtual cores gation and Vlan isolation necessary to achieve such com-
stabilize close to those of the physical machine (Fig. 7c). plex network topologies on an on-demand basis. The use
In conducting these experiments, we also learned some of SDN for controlling link aggregation is applicable to
lessons. First, virtualization introduces a small amount of both bare-metal and virtualized compute instances. How-
computation overhead – execution times on containers, thin ever, in a virtualized environment, SDN can be integrated
VM, and plain VM are higher by 1–5 percent (Fig. 7c). Sec- into network virtualization to provide link aggregation to
ond, for best performance, it is crucial to minimize I/O VM transparently.
TABLE 4
Impact of Virtualization and CPU Affinity Settings on NAMD’s Performance
Cores Execution Timeper step (s) of NAMD for specific virtualization and affinity setting
bare container plain- thin- bare- container- plainVM- plainVM- plainVM- thinVM- thinVM- thinVM-
VM VM appAFF appAFF hyperAFF dualAFF appAFF hyperAFF dualAFF appAFF
1 1.479 1.473 1.590 1.586 1.460 1.477 1.584 1.486 1.500 1.630 1.490 1.586
2 0.744 0.756 0.823 0.823 0.755 0.752 0.823 0.785 0.789 0.859 0.854 0.823
4 0.385 0.388 0.428 0.469 0.388 0.385 0.469 0.422 0.429 0.450 0.449 0.469
8 0.230 0.208 0.231 0.355 0.202 0.203 0.355 0.226 0.228 0.354 0.244 0.355
16 0.259 0.267 0.206 0.160 0.168 0.197 0.227 0.166 0.189 0.186 0.122 0.160
32 0.115 0.140 0.174 0.108 0.079 0.082 0.164 0.141 0.154 0.106 0.079 0.108
64 0.088 0.116 0.166 0.090 0.079 0.071 0.150 0.184 0.195 0.089 0.066 0.090
128 0.067 0.088 0.145 0.077 0.062 0.056 0.128 0.154 0.166 0.074 0.051 0.077
Authorized licensed use limited to: Nirma University Institute of Technology. Downloaded on October 03,2024 at 07:17:34 UTC from IEEE Xplore. Restrictions apply.
314 IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 4, NO. 3, JULY-SEPTEMBER 2016
Fig. 8. Cost ratio of running in cloud and a dedicated supercomputer for different scale (cores) and cost ratios (1x–5x). Ratio>1 imply savings of run-
ning in the cloud, <1 favor supercomputer execution.
7 HPC ECONOMICS IN THE CLOUD HPC users and cloud providers? Unlike large supercom-
puting centers, HPC users in small-medium enterprises
There are several reasons why many commercial and web
are much more sensitive to the CAPEX/OPEX argument.
applications are migrating to public clouds from fully
These include startups with nascent HPC requirements
owned resources or private clouds: variable usage in time
(e.g., simulation or modeling) and small-medium enter-
resulting in lower utilization, trading capital expenditure
prises with growing business and an existing HPC infra-
(CAPEX) for operating expenditure (OPEX), and the shift
structure. Both of them may prefer the pay-as-you-go
towards a delivery model of Software as a Service. These
approach in clouds versus establishing/growing on-prem-
arguments apply both to cloud providers and cloud users.
ise resources in volatile markets. Moreover, the ability to
Cloud users benefit from running in the cloud when their
leverage a large variety of heterogeneous architectures in
applications fit the profile we described e.g., variable utili-
clouds can result in better utilization at global scale, com-
zation. Cloud providers benefit if the aggregated resource
pared to the limited choices available in any individual
utilization of all their tenants can sustain a profitable pricing
organization. Running applications on the most economi-
model when compared to the substantial upfront invest-
cal architecture while meeting the performance needs can
ments required to offer computing and storage resources
result in savings for consumers.
through a cloud interface.
Why not cloud for HPC: HPC is however quite different
from the typical web and service-based applications. (1) Uti- 7.1 Quantifiable Analysis
lization of the computing resources is typically quite high on To illustrate a few possible HPC-in-the-cloud scenarios,
HPC systems. This conflicts with the desirable property of we collected and compared cost and price data of super-
low average utilization that makes the cloud business model computer installations and typical cloud offerings. Based
viable. (2) Clouds achieve improved utilization through con- on our survey of cloud prices, known financial situations
solidation enabled by virtualization—a foundational tech- of cloud operators, published supercomputing costs, and
nology for the cloud. However, as evident from our analysis, a variety of internal and external data sources [25], we
the overhead and noise caused by virtualization and multi- estimate that a cost ratio between 2x and 3x is a reason-
tenancy can significantly affect HPC applications’ perfor- able approximate range capturing the differences between
mance and scalability. For a cloud provider that means that a cloud deployment and on-premise supercomputing
the multi-tenancy opportunities are limited and the pricing resources today. In our terminology, 2x indicates the case
has to be increased to be able to profitably rent a dedicated where one supercomputer core-hour is twice as expensive
computing resource to a single tenant. (3) Many HPC appli- as one cloud core-hour. Since these values can fluctuate,
cations rely on optimized interconnect hardware to attain we expand the range to [1x–5x] to capture different
best performance, as shown by our experimental evaluation. future, possibly unforeseen scenarios.
This is in contrast with the commodity Ethernet network Using the performance evaluations for different appli-
(1Gbps today moving to 10 Gbps) typically deployed in most cations (Fig. 1), we calculated the cost differences of run-
cloud infrastructures to keep costs small. When networking ning the application in the public cloud versus running it
performance is important, we quickly reach diminishing in a dedicated supercomputer (Ranger), assuming differ-
returns of scaling-out a cloud deployment to meet a certain ent per core-hour cost ratios from 1x to 5x. Fig. 8 shows
performance target. If too many VMs are required to meet the cost differences for three applications, where values>1
performance, the cloud deployment quickly becomes uneco- indicate savings of running in the cloud and values<1 an
nomical. (4) The CAPEX/OPEX argument is less clear for advantage of running it on a dedicated supercomputer.
HPC users. Publicly funded supercomputing centers typi- We can see that for each application there is a scale in
cally have CAPEX in the form of grants, and OPEX budgets terms of the number of cores up to which it is more cost-
may actually be tighter and almost fully consumed by the effective to execute in the cloud versus on a supercom-
support and administration of the supercomputer with little puter. For example, for Sweep3D, NAMD, and ChaNGa,
headroom for cloud bursting. (5) Software-as-a-service offer- this scale is higher than 4, 8, and 16 cores respectively.
ing are also rare in HPC to date. This break-even point is a function of the application scal-
Why cloud for HPC: So, what are the conditions that can ability and the cost ratio. However our observation is that
make HPC in the cloud a viable business model for both, there is little sensitivity to the cost ratio and it is relatively
Authorized licensed use limited to: Nirma University Institute of Technology. Downloaded on October 03,2024 at 07:17:34 UTC from IEEE Xplore. Restrictions apply.
GUPTA ET AL.: EVALUATING AND IMPROVING THE PERFORMANCE AND SCHEDULING OF HPC APPLICATIONS IN CLOUD 315
Fig. 11. Adaptive heuristic significantly improves makespan and throughput when system is reasonably loaded.
Authorized licensed use limited to: Nirma University Institute of Technology. Downloaded on October 03,2024 at 07:17:34 UTC from IEEE Xplore. Restrictions apply.
318 IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 4, NO. 3, JULY-SEPTEMBER 2016
TABLE 6
Findings and Our Approach to Address the Research Questions on HPC in Cloud
Question Answers
Who (1) Small and medium scale organizations, startups or growing businesses, which can benefits from
pay-as-you-go model.
(2) Users with applications which result in best performance/cost ratio in cloud versus other platforms.
What (1) Applications with less-intensive communication patterns and less sensitivity to interference.
(2) Applications with performance needs that can be met at small to medium scale execution
(in terms of number of cores).
Why (1) Small-medium enterprises benefit from pay-as-you-go model since they are highly sensitive to
CAPEX/OPEX argument.
(2) Clouds enables multiple organizations to access a large variety of shared architectures, leading to improved
utilization.
How (1) Technical approaches: (a) Making HPC cloud-aware e.g. tuning computational granularity and problem sizes, and
(b) making clouds HPC-aware e.g. providing lightweight virtualization and enabling CPU affinity.
(2) Business models: Hybrid supercomputer–cloud approach with application-aware scheduling and cloud bursting.
maps IS at 32 cores to cluster even if supercomputer is free. 9.2 Bridging the Gap between HPC and Cloud
For benefitting from multiple platforms, we need to a) con- The approaches towards reducing the gap between tradi-
sider both, the application characteristics and the scale at tional cloud offerings and HPC demands can be classified
which it will be run, and b) dynamically adapt to the plat- into two categories—(1) those which make clouds HPC-
form loads. Adaptive heuristic meets these two goals. aware, and (2) those which makes HPC clouds-aware. In this
paper, we presented techniques for both. For (1), We explored
techniques in low-overhead virtualization, and quantified
9 RELATED WORK how close we can get to physical machine’s performance for
HPC workloads. There are other recent efforts on HPC-opti-
In this section, we summarize the related research on HPC
in cloud, including performance evaluation studies. mized hypervisors [33], [34]. Other examples of (1) include
HPC-optimized clouds such as Amazon Cluster Compute [7]
and DoE’s Magellan [1] and hardware- and HPC-aware cloud
9.1 Performance and Cost Studies of HPC on Cloud schedulers (VM placement algorithms) [35], [36].
Walker [5], followed by several others [1], [2], [3], [4], [6], [8], The latter approach (2) has been relatively less explored,
[9], [27], [30], conducted the study on HPC in cloud using but has shown tremendous promise. Cloud-aware load bal-
benchmarks such as NPB and real applications. Their con- ancers for HPC applications [37] and topology aware
clusions can be summarized as: deployment of scientific applications in cloud [38] have
shown encouraging results. In this paper, we demonstrated
Primary challenges for HPC in cloud are insufficient how we can tune the HPC runtime and applications to
network and I/O performance in cloud, resource clouds to achieve improved performance.
heterogeneity, and unpredictable interference arising
from other VMs [1], [2], [3], [4]. 9.3 HPC Characterization, Mapping, and Scheduling
Considering cost into the equation results in interest- There are several tools for scheduling HPC jobs on clusters,
ing trade-offs; execution on clouds may be more eco- such as ALPS, OpenPBS, SLURM, TORQUE, and Condor.
nomical for some HPC applications, compared to They are all job schedulers or resource management systems
supercomputers [4], [27], [31], [32]. which aim to utilize system resources in an efficient manner.
For large-scale HPC or for centers with large user They differ from our work on scheduling since we perform
base, cloud cannot compete with supercomputers application-aware scheduling and provide solution for
based on the metric $/GFLOPS [1], [9]. multi-platform case. GrADS [39] project addressed the prob-
In this paper, we explored some of the similar questions lem of scheduling, monitoring, and adapting applications to
from the perspective of smaller scale HPC users, such as heterogeneous and dynamic grid environment. Our focus is
small companies and research groups who have limited on clouds and hence we address additional challenges such
access to supercomputer resources and varying demand as virtualization, cost and pricing models for HPC in cloud.
over time. We also considered the perspective of cloud pro- Kim et al. [40] presented three usage models for hybrid
viders who want to expand their offerings to cover the HPC grid and cloud computing: acceleration, conservation,
aggregate of these smaller scale HPC users. and resilience. However, they use cloud for sequential tasks
Furthermore, our work explored additional dimensions: and do not consider execution of parallel applications.
(1) With a holistic viewpoint, we considered all the different Inspired by that work, we evaluate models for HPC-clouds:
aspects of running in cloud-performance, cost, and business substitute, complement, and burst.
models, and (2) we explored techniques for bridging the
gap between HPC and clouds. We improved HPC perfor-
mance in cloud by (a) improving execution time of HPC in 10 CONCLUSIONS, LESSONS, FUTURE WORK
cloud and (b) by improving the turnaround time with intel- Through a performance, economic, and scheduling analysis
ligent scheduling in cloud. of HPC applications on a range of platforms, we have
Authorized licensed use limited to: Nirma University Institute of Technology. Downloaded on October 03,2024 at 07:17:34 UTC from IEEE Xplore. Restrictions apply.
GUPTA ET AL.: EVALUATING AND IMPROVING THE PERFORMANCE AND SCHEDULING OF HPC APPLICATIONS IN CLOUD 319
shown that different applications exhibit different charac- [2] P. Mehrotra, J. Djomehri, S. Heistand, R. Hood, H. Jin, A. Lazan-
off, S. Saini, and R. Biswas, “Performance evaluation of Amazon
teristics that determine their suitability towards a cloud EC2 for NASA HPC applications,” in Proc. 3rd Workshop Scientific
environment. Table 6 presents our conclusions. Next, we Cloud Comput. 2012, pp. 41–50.
summarize the lessons learned from this research and the [3] C. Evangelinos and C. N. Hill, “Cloud computing for parallel sci-
emerging future research directions. entific HPC applications: Feasibility of running coupled atmo-
sphere-ocean climate models on Amazon’s EC2,” in Proc. IEEE
Clouds can successfully complement supercomputers, but Cloud Comput. Appl., Oct. 2008, pp. 2–34.
using clouds to substitute supercomputers is infeasible. Burst- [4] A. Gupta and D. Milojicic, “Evaluation of HPC Applications on
ing to cloud is also promising. We have shown that by Cloud,” in Proc. Open Cirrus Summit (Best Student Paper), Atlanta,
GA, USA, Oct. 2011, pp. 22–26.
performing multi-platform dynamic application-aware [5] E. Walker, “Benchmarking Amazon EC2 for high-performance
scheduling, a hybrid cloud-supercomputer platform envi- scientific computing,” LOGIN, vol. 33, pp. 18–23, 2008.
ronment can actually outperform its individual constitu- [6] A. Gupta, L. V. Kale, D. S. Milojicic, P. Faraboschi, R. Kaufmann,
ents. By using an underutilized resource which is “good V. March, F. Gioachin, C. H. Suen, and B.-S. Lee, “The who, what,
why, and how of HPC applications in the cloud,” in Proc. 5th IEEE
enough” to get the job done sooner, it is possible to get Intl. Conf. Cloud Comp. Techno. and Sc. Best Paper, 2013, pp. 306–314.
better turnaround time for job (user perspective) and [7] High Performance Computing (HPC) on AWS [Online].
improved throughput (provider perspective). Another Available: http://aws.amazon.com/hpc-applications
potential model for HPC in cloud is to use cloud only [8] A. Iosup, S. Ostermann, M. N. Yigitbasi, R. Prodan, T. Fahringer,
and D. H. J. Epema, “Performance Analysis of Cloud Computing
when there is high demand (cloud burst). Our evaluation Services for Many-Tasks Scientific Computing,” IEEE Trans. Paral-
showed that application-agnostic cloud bursting (e.g. lel Distrib. Syst., vol. 22, no. 6, pp. 931–945, Jun. 2011.
BestFirst heuristic) is unrewarding, but application-aware [9] J. Napper and P. Bientinesi, “Can Cloud Computing reach the
Top500?’ in Proceedings of the Combined Workshops on Unconven-
bursting is a promising research direction. More work is tional High Performance Computing Workshop Plus Memory Access
needed to consider other factors in multi-platform sched- Workshop. New York, NY, USA: ACM, 2009.
uling: job quality of service (QoS) contracts, deadlines, pri- [10] Ranger User Guide. [Online]. Available: http://services.tacc.
orities, and security. Also, future research is required in utexas.edu/index.php/ranger-user-guide
[11] A. I. Avetisyan, R. Campbell, I. Gupta, M. T. Heath, S. Y. Ko, G. R.
cloud pricing in multi-platform environments. Market Ganger, M. A. Kozuch, D. O’Hallaron, M. Kunze, T. T. Kwan, and
mechanisms and equilibrium factors in game theory can others, “Open Cirrus: A Global Cloud Computing Testbed,” Com-
help automate such decisions. puter, vol. 43, no. 4, pp. 35–43, Apr. 2010.
[12] D. Nurmi R. Wolski, C. Grzegorczyk, G. Obertelli, S. Soman, L.
For efficient HPC in cloud, HPC need to be cloud-aware and Youseff, and D. Zagorodnov, “Cluster Computing and the Grid,
clouds needs to be HPC-aware. HPC applications and runtimes 2009. CCGRID’09. 9th IEEE/ACM International Symposium on,”
must adapt to minimize the impact of slow network, hetero- in Proc. Cloud Comput. Appl., Oct. 2009, pp. 124–131.
geneity, and multi-tenancy in clouds. Simultaneously, [13] Kivity, A. Y. Kamay, D. Laor, U. Lublin, and A. Liguori, “kvm: the
Linux virtual machine monitor,” in Proc. of the Linux Symposium,
clouds should minimize overheads for HPC using techni- vol. 1, pp. 225–230, 2007.
ques such as lightweight virtualization and link aggregation [14] A. J. Younge, R. Henschel, J. T. Brown, G. Von Laszewski, J. Qiu,
with HPC-optimized network topologies. With low-over- and G. C. Fox, “Analysis of virtualization technologies for high
performance computing environments,” in Proc. IEEE Int. Conf.
head virtualization, web-oriented cloud infrastructure can
Cloud Comput., 2011, pp. 9–16.
be reused for HPC. We envisage hybrid clouds that support [15] D. Schauer et al, (June). Linux containers version 0.7.0. (2010)
both HPC and commercial workloads through tuning or [Online]. Available: http://lxc.sourceforge.net/
VM re-provisioning. [16] Intel Corporation. (Feb.) Intel(r) Virtualization Technology for
Directed I/O. Tech. Rep. D51397-006, (2011) [Online]. Available:
Application characterization for analysis of the performance- http://download.intel.com/technology/computing/vptech/Intel
cost tradeoffs for complex HPC applications is a non-trivial task, (r)_VT_for_D irect_IO.pdf
but the economic benefits are substantial. More research is nec- [17] NPB [Online]. Available: http://nas.nasa.gov/publications/npb.
essary to quickly identify important traits for complex html
[18] “MPI: A Message Passing Interface Standard,” in MPI Forum,
applications with dynamic and irregular communication 1994.
patterns. A future direction is to evaluate and characterize [19] L. Kale and S. Krishnan, “CHARM++: A Portable Concurrent
applications with irregular parallelism [41] and dynamic Object Oriented System Based on C++,” in Proc. 8th Annu. Conf.
Object-Oriented Program. Syst., Languages, Appl., 1993, pp. 91–108.
datasets. For example, challenging data sets arise from 4D [20] A. Bhatele, S. Kumar, C. Mei, J. C. Phillips, G. Zheng, and L. V.
CT imaging, 3D moving meshes, and computational fluid Kale, “Overcoming scaling challenges in biomolecular simula-
dynamics (CFD). The dynamic and irregular nature of such tions across multiple platforms,” in Proc. IEEE Int. Symp. Parallel
applications makes their characterization even more chal- Distrib. Process., 2008, pp. 1–12.
[21] P. Jetley, F. Gioachin, C. Mendes, L. V. Kale, and T. R. Quinn,
lenging compared to the regular iterative scientific applica- “Massively Parallel Cosmological Simulations with ChaNGa,” in
tions considered in this paper. However, their Proc. IEEE Int. Symp. Parallel Distrib. Processing, 2008, pp. 1–12.
asynchronous nature, i.e. lack of fine-grained barrier syn- [22] The ASCII Sweep3D code [Online]. Available: http://wwwc3.
chronizations, makes them promising candidates for hetero- lanl.gov/pal/software/sweep3d
[23] O. Zaki, E. Lusk, W. Gropp, and D. Swider, “Toward scalable per-
geneous and multi-tenant clouds. formance visualization with Jumpshot,” Int. J. High Perform. Com-
put. Appl., vol. 13, no. 3, pp. 277–288, Fall 1999.
[24] T. Watanabe, M. Nakao, T. Hiroyasu, T. Otsuka, and M. Koibuchi,
REFERENCES “Impact of topology and link aggregation on a PC cluster with
[1] K. Yelick, S. Coghlan, B. Draney, R. S. Canon, L. Ramakrishnan, A. ethernet,” in Proc. IEEE CLUSTER, Sep./Oct. 2008, pp. 280–285.
Scovel, I. Sakrejda, A. Liu, S. Campbell, P. T. Zbiegiel, T. Declerck, [25] C. Bischof, D. anMey, and C. Iwainsky. (2011). Brainware
P. Rich, “The magellan report on cloud computing for science”, U. for Green HPC. CS - Research and Development, pp. 1–7 [Online].
S. Department of Energy Office of Science, Office of Advanced Sci- Available: http://dx.doi.org/10.1007/s00450-011-0198-5
entific Computing Research (ASCR), Dec. 2011. [26] NVIDIA GRID [Online]. Available: http://nvidia.com/object/
virtual-gpus.html
Authorized licensed use limited to: Nirma University Institute of Technology. Downloaded on October 03,2024 at 07:17:34 UTC from IEEE Xplore. Restrictions apply.
320 IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 4, NO. 3, JULY-SEPTEMBER 2016
[27] A. Gupta, Abhishek, Kale, V. Laxmikant, D. S. Milojicic, P. Farabo- Paolo Faraboschi received the PhD degree in
schi, R. Kaufmann, V. March, F. Gioachin, C. H. Suen, and B. -S. electrical engineering and computer science from
Lee, “Exploring the Performance and Mapping of HPC Applica- the University of Genoa, Italy. He is currently a
tions to Platforms in the cloud,” in Proc. 21st Int. Symp. High-Per- distinguished technologist at HP Labs. His cur-
form. Parallel Distrib. Comput., 2012, pp. 121–122. rent research interests include intersection of
[28] R. N. Calheiros, R. Ranjan, A. Beloglazov, C. A. F. De Rose, and architecture and software. From 2004 to 2009, he
R. Buyya, “CloudSim: A toolkit for modeling and simulation of led the HPL research activity on system-level
cloud computing environments and evaluation of resource provi- simulation. From 1995 to 2003, he was the princi-
sioning algorithms,” Softw. Pract. Exper., vol. 41, no. 1, pp. 23–50, pal architect of the Lx/ST200 family of VLIW
Jan. 2011. cores. He is a fellow of the IEEE and an active
[29] Parallel Workloads Archive. [Online]. Available: http://www.cs. member of the computer architecture community.
huji.ac.il/labs/parallel/workload/
[30] J. Ekanayake, X. Qiu, T. Gunarathne, S. Beason, and G. C. Fox, Filippo Gioachin received the Laurea degree in
“High performance parallel computing with clouds and cloud computer science and engineering from the Uni-
technologies,” in Cloud Computing. New York, NY, USA: Springer, versity of Padova, and the PhD degree in com-
2010. puter science from the University of Illinois at
[31] E. Roloff, M. Diener, A. Carissimi, and P. Navaux, “High perfor- Urbana-Champaign. He is currently a research
mance computing in the cloud: Deployment, performance and manager and senior researcher at HP Labs
cost efficiency,” in Proc. CloudCom, 2012, pp. 371–378. Singapore, where he is contributing to the innova-
[32] A. Marathe, R. Harris, D. K. Lowenthal, B. R. de Supinski, tion in cloud computing. His main focus is on inte-
B. Rountree, M. Schulz, and X. Yuan, “a comparative study of grated online software development.
high-performance computing on the cloud,” in Proc. 21st Int.
Symp. High-Performance Parallel Distrib. Comput., 2013, pp. 239–250.
[33] J. Lange, K. Pedretti, T. Hudson, P. Dinda, Z. Cui, L. Xia,
P. Bridges, A. Gocke, S. Jaconette, M. Levenhagen, and R. Bright- Laxmikant V. Kale received the BTech degree in
well, “Palacios and Kitten: New high performance operating sys- electronics engineering from Benares Hindu Uni-
tems for scalable virtualized and native supercomputing,” in Proc. versity, India, in 1977, an the ME degree in com-
IEEE Int. Symp. Parallel Distrib. Process., 2010, pp. 1–12. puter science from Indian Institute of Science in
[34] B. Kocoloski, J. Ouyang, and J. Lange, “A case for dual stack virtu- Bangalore, India, in 1979, and the PhD degree in
alization: consolidating HPC and commodity applications in the computer science from State University of New
cloud,” in Proc. 3rd ACM Symp. Cloud Comput., New York, NY, York, Stony Brook, in 1985. He is currently a
USA, 2012, pp. 23:1–23:7. full professor at the University of Illinois at
[35] HeterogeneousArchitectureScheduler. [Online]. Available: http:// Urbana-Champaign. His current research interest
wiki.openstack.org/HeterogeneousArchitectureScheduler includes parallel computing. He is a fellow of
[36] A. Gupta, L. Kale, D. Milojicic, P. Faraboschi, and S. Balle, “HPC- the IEEE.
Aware VM Placement in Infrastructure Clouds,” in Proc. IEEE Int.
Conf. Cloud Eng., Mar. 2013, pp. 11–20. Richard Kaufmann received the BA degree from
[37] A. Gupta, O. Sarood, L. Kale, and D. Milojicic, “Improving HPC the University of California at San Diego, Sin
application performance in cloud through dynamic load bal- 1978, where he was a member of the UCSD Pas-
ancing,” in Proc. 13th IEEE/ACM Int. Symp. Cluster, Cloud, Grid cal Project. He is currently the VP of the Cloud
Comput., 2013, pp. 402–409. Lab at Samsung Data Systems. Previously, he
[38] P. Fan, Z. Chen, J. Wang, Z. Zheng, and M. R. Lyu, “Topology- was chief technologist of HP’s Cloud Services
aware deployment of scientific applications in cloud computing,” Group and HP’s cloud and high-performance
in Proc. IEEE 5th Int. Conf. Cloud Comput., 2012, pp. 319–326. computing server groups.
[39] F. Berman, A. Chien, K. Cooper, J. Dongarra, I. Foster, D. Gannon,
L. Johnsson, K. Kennedy, C. Kesselman, J. Mellor-Crumme, and
others, “The grads project: Software support for high-level grid
application development,” Int. J. High Perform. Comput. Appl.,
vol. 15, pp. 327–344, 2001. Bu Sung Lee received the BSc (Hons.) and PhD
[40] H. Kim, Y. el Khamra, I. Rodero, S. Jha, and M. Parashar, degrees from the Department of Electrical and
“Autonomic management of application workflows on hybrid Electronics, Loughborough University of Technol-
computing infrastructure,” Sci. Program., vol. 19, no. 2, pp. 75–89, ogy, United Kingdom, in 1982 and 1987, respec-
Jan. 2011. tively. He is currently an associate professor with
[41] M. Kulkarni, M. Burtscher, R. Inkulu, K. Pingali, and C. Casçaval, the School of Computer Engineering, Nanyang
“How much parallelism is there in irregular applications?” SIG- Technological University. He held a joint position
PLAN Not., vol. 44, no. 4, pp. 3–14, Feb. 2009. as the director (Research) HP Labs Singapore
from 2010 to 2012. His current research interests
Abhishek Gupta received the BTech degree in include mobile and pervasive networks, distrib-
computer science and engineering from Indian uted systems, and cloud computing.
Institute of Technology (IIT), Roorkee, India, in
2008, and the MS and PhD degrees in computer Verdi March received the BSc degree from Fac-
science from the University of Illinois at Urbana- ulty of Computer Science, University of Indonesia
Champaign (UIUC) in 2011 and 2014, respec- in2000, and the PhD degree from the Department
tively. He is currently a cloud security architect at of Computer Science, National University of Sin-
Intel Corp. His current research interests include gapore in 2007. He is currently a lead research
parallel programming, HPC, scheduling, and scientist with Visa Labs. Prior to working with
cloud computing. Visa Labs, he held various research or engineer-
ing positions with HP Labs, Sun Microsystems
Inc., and the National University of Singapore.
Authorized licensed use limited to: Nirma University Institute of Technology. Downloaded on October 03,2024 at 07:17:34 UTC from IEEE Xplore. Restrictions apply.
GUPTA ET AL.: EVALUATING AND IMPROVING THE PERFORMANCE AND SCHEDULING OF HPC APPLICATIONS IN CLOUD 321
Dejan Milojicic received the BSc and MSc ChunHui Suen received the BEng (First Class
degrees from Belgrade University, Belgrade, Ser- Hons.) from National University of Singapore,
bia, in 1983 and 1986, respectively. He received Singapore, and the MSc and PhD degrees from
the PhD degree from the University of Kaiserslau- Technische Universitaet Munchen, Munich, Ger-
tern, Kaiserslautern, Germany, in 1993. He is cur- many. He is currently an engineer and researcher
rently a senior researcher at HP Labs, Palo Alto, with keen interest in the field of IT security, virtual-
CA. He is the IEEE Computer Society 2014 presi- ization and cloud, with research experience in
dent and the founding editor-in-chief of the IEEE TPM related technologies, various hypervisors
Computing Now. He was in OSF Research Insti- (xen, kvm, vmware), Linux kernel development,
tute, Cambridge, MA, from 1994 to 1998, and and various cloud stacks (openstack, AWS).
Institute Mihajlo Pupin, Belgrade, Serbia, from
1983 to 1991. He is a fellow of the IEEE.
" For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/publications/dlib.
Authorized licensed use limited to: Nirma University Institute of Technology. Downloaded on October 03,2024 at 07:17:34 UTC from IEEE Xplore. Restrictions apply.