0% found this document useful (0 votes)
9 views

Evaluating_and_Improving_the_Performance_and_Scheduling_of_HPC_Applications_in_Cloud

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Evaluating_and_Improving_the_Performance_and_Scheduling_of_HPC_Applications_in_Cloud

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 4, NO.

3, JULY-SEPTEMBER 2016 307

Evaluating and Improving the Performance and


Scheduling of HPC Applications in Cloud
Abhishek Gupta, Paolo Faraboschi, Fellow, IEEE, Filippo Gioachin, Laxmikant V. Kale, Fellow, IEEE,
Richard Kaufmann, Bu-Sung Lee, Verdi March, Dejan Milojicic, Fellow, IEEE, and Chun Hui Suen

Abstract—Cloud computing is emerging as a promising alternative to supercomputers for some high-performance computing (HPC)
applications. With cloud as an additional deployment option, HPC users and providers are faced with the challenges of dealing with
highly heterogeneous resources, where the variability spans across a wide range of processor configurations, interconnects,
virtualization environments, and pricing models. In this paper, we take a holistic viewpoint to answer the question—why and who
should choose cloud for HPC, for what applications, and how should cloud be used for HPC? To this end, we perform comprehensive
performance and cost evaluation and analysis of running a set of HPC applications on a range of platforms, varying from
supercomputers to clouds. Further, we improve performance of HPC applications in cloud by optimizing HPC applications’
characteristics for cloud and cloud virtualization mechanisms for HPC. Finally, we present novel heuristics for online application-aware
job scheduling in multi-platform environments. Experimental results and simulations using CloudSim show that current clouds cannot
substitute supercomputers but can effectively complement them. Significant improvement in average turnaround time (up to 2X)
and throughput (up to 6X) can be attained using our intelligent application-aware dynamic scheduling heuristics compared to
single-platform or application-agnostic scheduling.

Index Terms—HPC, cloud, performance analysis, economics, job scheduling, application-awareness, characterization

1 INTRODUCTION

I NCREASINGLY, some academic and commercial HPC users


are looking at clouds as a cost effective alternative to ded-
icated HPC clusters [1], [2], [3], [4]. Renting rather than own-
HPC clouds rapidly expand the application user base
and the available platform choices to run HPC workloads:
from in-house dedicated supercomputers, to commodity
ing a cluster avoids the up-front and operating expenses clusters with and without HPC-optimized interconnects
associated with a dedicated infrastructure. Clouds offer and operating systems, to resources with different degrees
additional advantages of a) elasticity—on-demand provi- of virtualization (full, CPU-only, none), to hybrid configura-
sioning, and b) virtualization-enabled flexibility, customiza- tions that offload part of the work to the cloud. HPC users
tion, and resource control. and cloud providers are faced with the challenge of choos-
Despite these advantages, it still remains unclear ing the optimal platform based upon a limited knowledge
whether, and when, clouds can become a feasible substitute of application characteristics, platform capabilities, and the
or complement to supercomputers. HPC is performance- target metrics such as cost.
oriented, whereas clouds are cost and resource-utilization ori- This trend results in a potential mismatch between the
ented. Furthermore, clouds have traditionally been designed required and selected resources for HPC application.
to run business and web applications. Previous studies One possible undesirable scenario can result in part of
have shown that commodity interconnects and the overhead the infrastructure being overloaded, and another being
of virtualization on network and storage performance are idle, which in turn yields large wait times and reduced
major performance barriers to the adoption of cloud for HPC overall throughput. Existing HPC scheduling systems
[1], [2], [3], [4], [5], [6]. While the outcome of these studies are not designed to deal with these issues. Hence, novel
paints a rather pessimistic view of HPC clouds, recent efforts scheduling algorithm and heuristics need to be explored
towards HPC-optimized clouds, such as Magellan [1] and to perform well in such scenarios.
Amazon’s EC2 Cluster Compute [7], point to a promising Unlike previous works [1], [2], [3], [4], [5], [8], [9] on
direction to overcome some of the fundamental inhibitors. benchmarking clouds for science, we take a more holistic
and practical viewpoint. Rather than limiting ourselves to
the problem—what is the performance achieved on cloud
 A. Gupta and L.V. Kale are with the Department Computer Science, versus supercomputer, we address the bigger and more
University of Illinois at Urbana Champaign, Urbana, IL 61801.
Email: {charm, kale}@illinois.edu. important question—why and who should choose (or not
 P. Faraboschi, F. Gioachin, R. Kaufmann, B.-S. Lee, V. March, D. Milojicic, choose) cloud for HPC, for what applications, and how
and C.H. Suen are with the Hewlett Packard Labs, Palo Alto, CA94304. should cloud be used for HPC? While addressing this
Manuscript received 3 Feb. 2014; revised 1 July 2014; accepted 2 July 2014. research problem, we make the following contributions.
Date of publication 17 July 2014; date of current version 7 Sept. 2016.
Recommended for acceptance by I. Bojanova, R.C.H. Hua, O. Rana, and  We evaluate the performance of HPC applications
M. Parashar. on a range of platforms varying from supercom-
For information on obtaining reprints of this article, please send e-mail to:
[email protected], and reference the Digital Object Identifier below. puter to cloud. Also, we analyze bottlenecks and
Digital Object Identifier no. 10.1109/TCC.2014.2339858 the correlation between application characteristics
Authorized licensed use limited to: Nirma University Institute of Technology. Downloaded on October 03,2024 at 07:17:34 UTC from IEEE Xplore. Restrictions apply.
2168-7161 ß 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
308 IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 4, NO. 3, JULY-SEPTEMBER 2016

TABLE 1
Testbed

Resource Platform
Ranger Taub Open Cirrus Private Cloud Public Cloud EC2-CC Cloud
Processors 16AMD Opteron 12Intel Xeon 4Intel Xeon 2QEMU Virtual 4QEMU Virtual 16Xen HVM
in a Node QC @2.3 GHz X5650 @2.67 GHz E5450 @3.00 GHz CPU @2.67 GHz CPU @2.67 GHz VCPU @2.6 GHz
Memory 32 GB 48 GB 48 GB 6 GB 16 GB 60 GB
Network Infiniband QDR Infiniband 10GigE internal, Emulated 1GigE Emulated 1GigE Emulated 10GigE
(1 GB/s) 1GigE x-rack
OS Linux Sci. Linux Ubuntu 10.04 Ubuntu 10.04 Ubuntu 10.10 Ubuntu 12.04

and observed performance, identifying what appli- user. Table 1 shows the details of each platform. In case of
cations are suitable for cloud. (Section 3, Section 4) cloud a node refers to a virtual machine and a core refers to
 We also evaluate the performance when running a virtual core. For example, “2  QEMU Virtual CPU
the same benchmarks on exactly same hardware, @2.67 GHz” means each VM has two virtual cores. Ranger
without and with different virtualization technolo- [10] at TACC was a supercomputer (decommissioned in
gies, thus providing a detailed analysis of the iso- Feb. 2013), and Taub at UIUC is an HPC-optimized cluster.
lated impact of virtualization on HPC applications Both use Infiniband as interconnect. Moreover, Taub uses
(Section 6). scientific Linux as OS and has QDR Infiniband with band-
 To bridge the divide between HPC and clouds, we width of 40 Gbps. We used physical nodes with commodity
present the complementary approach of (1) mak- interconnect at Open Cirrus testbed at HP Labs site [11]. The
ing HPC applications cloud-aware by optimizing next two platforms are clouds—a private cloud setup using
an application’s computational granularity and Eucalyptus [12], and a public cloud. We use KVM [13] for
problem size for cloud and (2) making clouds virtualization since it has been shown to be a good candidate
HPC-aware using thin hypervisors, OS-level con- for HPC virtualization [14]. Finally, we also used an HPC-
tainers, and hypervisor- and application-level CPU optimized cloud—Amazon EC2 Cluster Compute Cloud [7]
affinity, addressing – how to use cloud for HPC. of US West (Oregon) zone, cc2.8xlarge instances with Xen
(Section 5, Section 6) HVM virtualization launched in same placement group for
 We investigate the economic aspects of running in best networking performance [7].
cloud and discuss why it is challenging or rewarding Another dedicated physical cluster at HP Labs Singa-
for cloud providers to operate business for HPC pore (HPLS) is used for controlled tests of the effects of vir-
compared to traditional cloud applications. We also tualization (see Table 2). This cluster is connected with a
show that small/medium-scale users are the likely Gigabit Ethernet network on a single switch. Every server
candidates who can benefit from an HPC-cloud. (Sec- has two CPU sockets, each populated with a six-core CPU,
tion 7) resulting in 12 physical cores per node. The experiment on
 Instead of considering cloud as a substitute of the HPLS cluster involved benchmarking on four configu-
supercomputer, we investigate the co-existence of ration: physical machines (bare), LXC containers [15], VMs
multiple platforms—supercomputer, cluster, and configured with the default emulated network (plain VM),
cloud. We research novel heuristics for applica- and VMs with pass-through networking by enabling
tion-aware scheduling of jobs in this multi-plat- input/output memory management unit (IOMMU) on the
form scenario significantly improving average job Linux hosts to allow VMs to directly access the Ethernet
turnaround time (up to 2X) and job throughput hardware (thin VM) [16]. Both the plain VM and thin VM
(up to 6X), compared to running all jobs on super- run atop KVM. This testbed is designed to test the isolated
computer. (Section 8) impact of virtualization, impossible to execute on public
The insights from performance evaluation, characteriza- clouds, due to the lack of direct access to public cloud’s
tion, and multi-platform scheduling are useful for both— hardware.
HPC users and cloud providers. Users can better quantify
the benefits of moving to a cloud. Cloud providers can opti- TABLE 2
mize the allocation of applications to their infrastructure to Virtualization Testbed
maximize utilization and turnaround times.
Resource Virtualization
2 EVALUATION METHODOLOGY Phy., Container Thin VM Plain VM
In this section, we describe the platforms which we com- Processors 12Intel Xeon 12QEMU 12QEMU
pared and the applications suite used in this study. in a X5650 @2.67 GHz Virtual CPU Virtual CPU
Node/VM @2.67 GHz @2.67 GHz
2.1 Experimental Testbed Memory 120 GB 100 GB 100 GB
Network 1GigE 1GigE Emulated
We selected platforms with different interconnects, operat- 1GigE
ing systems, and virtualization support to cover the domi- OS Ubuntu 11.04 Ubuntu 11.04 Ubuntu 11.04
nant classes of infrastructures available today to an HPC
Authorized licensed use limited to: Nirma University Institute of Technology. Downloaded on October 03,2024 at 07:17:34 UTC from IEEE Xplore. Restrictions apply.
GUPTA ET AL.: EVALUATING AND IMPROVING THE PERFORMANCE AND SCHEDULING OF HPC APPLICATIONS IN CLOUD 309

Fig. 1. Time in seconds (y-axis) versus core count (x-axis) for different applications (strong scaling except Sweep3D). All applications scale well on
supercomputers and most scale moderately well on Open Cirrus. On clouds, some applications scale well (e.g., EP), some scale till a point (e.g.,
ChaNGa) whereas some do not scale (e.g., IS).

2.2 Benchmarks and Applications in our runs) so that no two queens attack each other.
To gain thorough insights into the performance of selected Communication happens only for load balancing.
platform, we chose benchmarks and applications from dif- On Ranger and Taub, we used MVAPICH2 for MPI and
ferent scientific domains and those which differ in the CHARM++ ibverbs layer. On remaining platforms we
nature, amount, and pattern of inter-processor communica- installed Open MPI and net layer of CHARM++.
tion. Similarly to previous work [2], [3], [4], [5], we used
NAS parallel benchmarks (NPB) class B [17] (the MPI ver- 3 BENCHMARKING HPC PERFORMANCE
sion, NPB3.3-MPI), which exhibit a good variety of compu-
tation and communication requirements. Moreover, we Fig. 1 shows the scaling behavior of our testbeds for the
chose additional benchmarks and real world applications, selected applications. These results are averaged across
written in two different parallel programming environ- multiple runs (five executions) performed at different times.
ments – MPI [18] and CHARM++ [19]: We show strong scaling results for all applications except
Sweep3D, where we chose to perform weak scaling runs.
 Jacobi2D—A five-point stencil kernel, which aver- For NPB, we present results for only embarrassingly paral-
ages values in a 2D grid, and is common in scientific lel (EP), LU solver (LU), and integer sort (IS) benchmarks
simulations, numerical linear algebra, solutions of due to space constraints. The first observation is the differ-
partial differential equations, and image processing. ence in sequential performance: ranger takes almost twice
 NAMD [20]—A highly scalable molecular dynamics as long as the other platforms, primarily because of the
application representative of a complex real world older and slower processors. The slope of the curve shows
application used ubiquitously on supercomputers. how the applications scale on different platforms. Despite
We used the ApoA1 input (92k atoms). the poor sequential speed, Ranger’s performance crosses
 ChaNGa [21]—A cosmological simulation applica- Open Cirrus, private cloud and public cloud for some appli-
tion which performs collisionless N-body interac- cations at around 32 cores, yielding a much more linearly
tions using Barnes-Hut tree for calculating forces. scalable parallel performance. We investigated the reasons
We used a 300,000 particle system. for better scalability of these applications on Ranger using
 Sweep3D [22]—A particle transport code widely application profiling, performance tools, and microbe-
used for evaluating HPC architectures. Sweep3D nchmarking and found that network performance is a domi-
exploits parallelism via a wavefront process. We ran nant factor (see Section 4).
the MPI-Fortran77 code in weak scaling mode main- We observed three different patterns for applications on
taining 5  5  400 cells per processor. these platforms. First, some applications such as EP, Jaco-
 NQueens—A backtracking state space search imple- bi2D, and NQueens scale well on all the platforms up to
mented as tree structured computation. The goal is 128–256 cores. The second pattern is that some applications
to place N queens on an N  N chessboard (N ¼ 18 such as LU, NAMD, and ChaNGa scale on private cloud till
Authorized licensed use limited to: Nirma University Institute of Technology. Downloaded on October 03,2024 at 07:17:34 UTC from IEEE Xplore. Restrictions apply.
310 IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 4, NO. 3, JULY-SEPTEMBER 2016

TABLE 3
Application Communication Characteristics: MPI Collectives in
Parentheses, Bottlenecks in Bold

Application Message Count Message Volume


(K/s) (MB/s)
Jacobi2D 220.9 309.62
NAMD 293.3 1705.63
NQueens 37.4 3.99
ChaNGa 2075.8 308.53
Fig. 2. Performance variation. NPB-EP 0 (0.2) 0 (0.01)
NPB-LU 907.3(0.2) 415.31 (0.01)
NPB-IS 0.3 (9.0) 0.00 (5632)
32 cores and stop scaling afterwards. These do well on other
Sweep3D 2312.2 (7.2) 2646.17 (0.02)
platforms including Open Cirrus. The likely reason for this
trend is the impact of virtualization on network perfor-
mance (which we confirm below). On public cloud, we used
applications and MPE and Jumpshot [23] for MPI applica-
VM instances with four virtual cores, hence inter-VM com-
tions. Table 3 shows the results obtained by running on
munication starts after four cores, resulting in sudden per-
64 cores of Taub. These numbers are cumulative across
formance penalty above four cores. Similar performance
all processes. For MPI applications, we have listed the
dip can be observed for EC2-CC cloud at 16 cores where
data for point-to-point and collective operations (such
each VM had 16 cores. However, in contrast to private and
as MPI_Barrier and MPI_AlltoAll) separately. The
public cloud, EC2-CC cloud provides good scalability to
numbers in parentheses correspond to collectives. It is
NAMD. Finally, some applications, especially the IS bench-
clear from Table 3 that Jacobi2D, NQueens, and EP per-
mark, perform very poorly on the clouds and Open Cirrus.
form relatively small amount of communication. More-
Sweep3D also exhibits poor weak scaling after four-eight
over, we can categorize applications as latency-sensitive,
cores on cloud.
i.e. large message counts with relatively small message
In case of cloud, we observed variability in the execution
volume, e.g. ChaNGa, or bandwidth-sensitive, i.e. large
time across runs, which we quantified by calculating the
message volume with relatively small message count, e.g.
coefficient of variation (standard deviation/mean) for run-
NAMD, or both, e.g. Sweep3D. The point-to-point com-
time across five executions. Fig. 2a shows that there is sig-
munication in IS is negligible. However, it is the only
nificant performance variability on cloud compared to
application in the set which performs heavy communica-
supercomputer and that the variability increases as we scale
tion using collectives. This was validated by using Jump-
up, partially due to decrease in computational granularity.
shot visualization tool for MPE logs [23]. Fig. 4 shows the
At 256 cores on public cloud, standard deviation is equal to
timeline of execution of IS during this benchmarking,
half the mean, resulting in low run to run predictability. In
with red (dark) color representing MPI_AlltoAllv col-
contrast, EC2-CC cloud shows less variability. Also, perfor-
lective communication with contributions of 2 MB by
mance variability is different for different applications (See
each of the 64 processors. It is evident that this operation
Fig. 2b). Co-relating Figs. 1 and 2b, we can observe that the
is the dominant component of execution time for this
applications which scale poorly, e.g. ChaNGa and LU, are
benchmark.
the ones which exhibit more performance variability.
Juxtaposing Table 3 and Fig. 1, we can observe the corre-
lation between the applications’ communication characteris-
4 PERFORMANCE BOTTLENECKS IN CLOUD tics and the performance attained, especially on cloud. To
To investigate the reasons for the different performance validate that communication performance is the primary
trends for different applications, we obtained the bottleneck in cloud, we used Projections performance analy-
applications’ communication characteristics. We used sis tool. Fig. 3a shows the CPU utilization for a 64-core Jaco-
tracing and visualization – Projections tool for CHARM++ bi2D experiment on private cloud, x-axis being the (virtual)

Fig. 3. (a) CPU utilization for Jacobi2D on 32 two-core VMs of private cloud. White portion: idle time, colored portions: application functions. (b) Net-
work performance on private and public clouds is off by almost two orders of magnitude compared to supercomputers. EC2-CC provides high band-
width but poor latency.
Authorized licensed use limited to: Nirma University Institute of Technology. Downloaded on October 03,2024 at 07:17:34 UTC from IEEE Xplore. Restrictions apply.
GUPTA ET AL.: EVALUATING AND IMPROVING THE PERFORMANCE AND SCHEDULING OF HPC APPLICATIONS IN CLOUD 311

This can be attributed to two features intrinsic in clouds—


hardware heterogeneity and multi-tenancy, that is multiple
users sharing the cloud, which cause variability in the fol-
lowing ways. (1) Heterogeneity in physical hardware cou-
pled with hardware-agnostic VM placement results in non-
uniformity across different allocations (same total number of
VMs placed on different nodes). (2) Multi-tenancy at cluster-
Fig. 4. Timeline of execution of IS on 64 cores on Taub. Red/dark level results in shared cluster network, which may result in
(MPI_AlltoAllv) is the dominating factor. network contention and dynamic communication heterogeneity
[1]. (3) Multi-tenancy at the node-level (sometimes even
core number. It is clear that CPU is under-utilized for core-level) results in dynamic compute heterogeneity and also
almost half the time, as shown by the idle time (white por- degrades performance due to sharing of resources (such as
tion) in the figure. A detailed timeline view revealed that cache, memory, disk and network bandwidth, CPU) in a
this time was spent waiting to receive data from other pro- multi-core node with other users’ VM placed on the same
cesses. Similarly, for other applications which performed node. Like most cloud environments, in our private and pub-
poorly on cloud, communication time was a considerable lic clouds, physical nodes (not cores) were shared by VMs of
portion of the parallel execution time in cloud. external users, hence providing a multi-tenant environment.
Since many HPC applications are highly sensitive to com-
munication, we focused on network performance. Fig. 3b 5 OPTIMIZING HPC FOR CLOUD
shows the results of a simple ping-pong benchmark written In Section 4, we found that the poor cloud network perfor-
in Converse, the underlying substrate of CHARM++. Unsur- mance is a major bottleneck for HPC. Hence, to achieve
prisingly, we found that the latencies and bandwidth on pri- good performance in cloud, it is imperative to either adapt
vate and public clouds are a couple of orders of magnitude HPC runtime and applications to slow cloud networks
worse compared to Ranger and Taub, making it challenging (cloud-aware HPC), or improve networking performance in
for communication-intensive applications, such as IS, LU, cloud (HPC-aware clouds). Next, we explore the former
and NAMD, to scale. EC2-CC cloud provides high band- approach, that is making HPC cloud-aware. The latter
width, enabling bandwidth-sensitive applications, such as approach is discussed in Section 6.
NAMD, to scale. However, large latency results in poor per-
formance of latency-sensitive applications (e.g. ChaNGa). 5.1 Computational Granularity/Grain Size
While the inferior network performance explains the large One way to minimize the sensitivity to network perfor-
idle time in Fig. 3a, the surprising observation is the notable mance is to hide network latencies by overlapping com-
difference in idle time for alternating cores (0 and 1) of each putation and communication. A promising direction is
VM. We traced this effect to network virtualization. The light asynchronous object/thread-centric execution rather than
(green) colored portion at the very bottom in Fig. 3a repre- MPI-style processor-centric approach.
sents the application function which initiates inter-processor When there are multiple medium-grained work/data
communication through socket operations, and interacts units (objects/tasks) per processors (referred to as over-
with the virtual network. The application process on core 0 decomposition), and an object needs to wait for a message,
of the VM shares the CPU with the network emulator. This control can be asynchronously transferred to another object
interference increases as the application communicates more which has a message to process. Such scheduling keeps the
data. Hence, virtualized network degrades HPC perfor- processor utilized and results in automatic overlap between
mance in multiple ways: increases network latency, reduces computation and communication. Our hypothesis is that
bandwidth, and interferes with application process. overdecomposition and proper grainsize control can be cru-
We also observed that even when we used only core 0 of cial in clouds with slow networks.
each VM, for iterative applications containing a barrier after To validate our hypothesis, we analyze the effect of the
each iteration, there was significant idle time on some pro- CHARM++ object grain size (or decomposition block size)
cesses at random times. Communication time could not on execution time of Jacobi2D on 32 cores of different plat-
explain such random idle times. These random idle times can forms (Fig. 5a). Fig. 5a shows that the variation in execution
be attributed to the interference (noise or jitter) by other system time with grain size is significantly more for private cloud as
or application processes on the same node. Quantification of compared to other platforms. As we decrease the grain size,
this noise using a micro-benchmark showed that a fixed tiny hence increasing number of objects per processor, execution
sequential work on commodity server can have up to 100 per- time decreases due to increased overlap of communication
cent variation in runtime across multiple runs (details in and computation. However, after a threshold execution time
Section 6.2). Noise can severely degrade performance, espe- increases. This trend results from the tradeoff between the
cially for bulk-synchronous HPC applications since the slow- speedup due to the overlap and the slowdown due to paral-
est thread dictates the speed. Unlike supercomputers, where lel runtime’s overhead of managing large number of objects.
operating system (OS) is specifically tuned to minimize noise, We used Projections tool to visualize the achieved bene-
e.g., Scientific Linux on Taub, clouds typically use non-tuned fit. Fig. 5b shows the timelines of 12 processes of Jacobi2D
OS. In addition, clouds have a further intrinsic disadvantage execution with and without over-decomposition. Blue rep-
due to the presence of the hypervisor. resent application functions whereas white represents idle
Causes of performance variability: As seen in Fig. 2, there is time. There is lot less idling in Fig. 5b resulting in reduced
also significant run-to-run performance variability in clouds. overall execution time.
Authorized licensed use limited to: Nirma University Institute of Technology. Downloaded on October 03,2024 at 07:17:34 UTC from IEEE Xplore. Restrictions apply.
312 IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 4, NO. 3, JULY-SEPTEMBER 2016

Fig. 5. Optimizing HPC for cloud: Effect of grain size and problem size.

5.2 Problem Sizes and Runtime Retuning we explore optimizations to mitigate the virtualization over-
Fig. 6 shows the effect of problem size on performance head, hence making clouds HPC-aware.
(speedup) of different applications on private cloud and
supercomputer (Taub). With increasing problem sizes 6.1 Lightweight Virtualization
(A!B!C), applications scale better, and the gap between We consider two lightweight virtualization techniques, thin
cloud and supercomputer reduces. Fig. 5c reaffirms the pos- VMs configured with PCI pass-through for I/O, and contain-
itive impact of problem size. For Jacobi, we denote class A ers, that is OS-level virtualization. Lightweight virtualization
as 1k  1k, class B as 4k  4k, and class C as 16k  16k grid reduces the overhead of network virtualization by granting
size. As problem size increases (say by a factor of X) with VMs native accesses to physical network interfaces. Using
fixed number of processors, for most scalable HPC applica- thin VM with IOMMU, a physical network interface is allo-
tions, the increase in communication (e.g. uðXÞ for Jacobi2D) cated exclusively to a VM, preventing the interface to be
is less than the increase in computation (uðX2 Þ for Jacobi2D). shared by the sibling VMs and the hypervisor. This may
Hence, the communication to computation ratio decreases lead to under-utilization when the thin VM generates insuf-
with increase in problem size, which results in reduced per- ficient network load. Containers such as LXC [15] share the
formance penalty of execution on a platform with poor physical network interface with its sibling containers and its
interconnect. Thus, adequately large problem sizes such host. However, containers must run the same OS as their
that the communication to computation ratio is adequately underlying host. Thus, there is a trade-off between resource
small can be run more effectively in cloud. Furthermore, multiplexing and flexibility offered by VM.
applying our cost analysis methodology (Section 7), Fig. 5c Table 4 first five columns, validate that network virtuali-
can be used to estimate the cross-over points of the problem zation is the primary bottleneck of cloud. These experiments
size where it would be cost-effective to run on supercom- were conducted on the virtualization testbed described ear-
puter versus cloud. lier (Table 2). Plain VM attains poor scaling, but on thin
While performing experiments, we learned that parallel VM, NAMD execution times closely track those on the
runtime systems have been tuned to exploit fast HPC net- physical machine even as multiple nodes are used (i.e., 16
works. For best performance on cloud, some of the network cores onwards). The performance trend of containers also
parameters need to be re-tuned for commodity cloud net- resembles that of physical machine. This demonstrates that
work. E.g., in case of CHARM++, increasing the maximum thin VM and containers significantly lower the communica-
datagram size from 1;400 to 9;000, reducing the windows tion overhead. This low overhead was further validated by
size from 32 to 8, and increasing the acknowledgement the ping-pong test.
delay from 5 to 18 ms resulted in 10-50 percent performance
improvements for our applications. 6.2 Impact of CPU Affinity
CPU affinity instructs the OS to bind a process (or thread) to
a specific CPU core. This prevents the OS from inadver-
6 OPTIMIZING CLOUD FOR HPC tently migrating a process. If all important processes have
Cloud-aware HPC execution reduces the penalty caused by non-overlapping affinity, it practically prevents multiple
the underlying slow physical network in clouds, but it does processes from sharing a core. In addition, cache locality
not address the overhead of network virtualization. Next, can be improved by processes remaining on the same core

Fig. 6. Effect of problem size class on attained speedup on supercomputer (Taub) versus private cloud.
Authorized licensed use limited to: Nirma University Institute of Technology. Downloaded on October 03,2024 at 07:17:34 UTC from IEEE Xplore. Restrictions apply.
GUPTA ET AL.: EVALUATING AND IMPROVING THE PERFORMANCE AND SCHEDULING OF HPC APPLICATIONS IN CLOUD 313

Fig. 7. Impact of CPU affinity on CPU performance.

throughout their execution. In the cloud, CPU affinity can operations unrelated to applications. Even on the physical
be enforced at the application level, which refers to binding machine, execution time increased by 3–5 percent due to
processes to the virtual CPUs of a VM, and at the hypervisor disk I/O generated by the launcher shell script and its
level, which refers to binding virtual CPUs to physical CPUs. stdout/stderr redirection. The spikes on the physical
Fig. 7 presents the results of our micro-benchmarks with machine in Fig. 7c are caused by short ssh sessions which
various CPU affinity settings on different types of virtual simulate the scenarios where users log in to check the job
environments. In this experiment, we executed 12 processes progress. Thus, minimizing unrelated I/O is an important
on a single 12-core virtual or physical machine. Each pro- issue for HPC cloud providers.
cess runs 500 iterations, where each iteration executes 200 Table 4 also shows the positive impact of enabling hyper-
millions of y ¼ y þ randðÞ=c operations, with c being a con- visor-level affinity , application-level affinity, or both (dual-
stant. Without CPU affinity (Fig. 7a), we observe wide fluc- AFF). Significant benefits are obtained for thin-VM dualAFF
tuation on the process execution times, up to twice the case compared to the case with no affinity.
minimum execution time (i.e., 2.7 s). This clearly demon-
strates that frequently two or more of our benchmark pro- 6.3 Network Link Aggregation
cesses are scheduled to the same core. The impact of CPU Even though network virtualization cannot improve net-
affinity is even more profound on virtual machines. Fig. 7b work performance, an approach to reduce network
shows the minimum and maximum execution times of the latency using commodity Ethernet hardware is to imple-
12 processes with CPU affinity enabled on the physical ment link aggregation and a better network topology.
machine, while only application-level affinity (appAFF) is Experiments from [24] show that using four-six aggre-
enabled on the thin VM. We observe that the gap between gated Ethernet links in a torus topology can provide up
minimum and maximum execution times is narrowed in to 650 percent improvement in overall HPC performance.
this case. However, on the thin VM, we still notice the fre- This would allow cloud infrastructure using commodity
quent spikes, which is attributed to the absence of hypervi- hardware to improve raw network performance. Software
sor-level affinity (HyperAFF). Even though each process is defined networking (SDN) based on open standards such
pinned to a specific virtual core, multiple virtual cores may as Openflow, or similar concepts embedded in the cloud
still be mapped to the same physical core. With hypervisor- software stack, can be used to orchestrate the link aggre-
level affinity enabled, execution times across virtual cores gation and Vlan isolation necessary to achieve such com-
stabilize close to those of the physical machine (Fig. 7c). plex network topologies on an on-demand basis. The use
In conducting these experiments, we also learned some of SDN for controlling link aggregation is applicable to
lessons. First, virtualization introduces a small amount of both bare-metal and virtualized compute instances. How-
computation overhead – execution times on containers, thin ever, in a virtualized environment, SDN can be integrated
VM, and plain VM are higher by 1–5 percent (Fig. 7c). Sec- into network virtualization to provide link aggregation to
ond, for best performance, it is crucial to minimize I/O VM transparently.

TABLE 4
Impact of Virtualization and CPU Affinity Settings on NAMD’s Performance

Cores Execution Timeper step (s) of NAMD for specific virtualization and affinity setting
bare container plain- thin- bare- container- plainVM- plainVM- plainVM- thinVM- thinVM- thinVM-
VM VM appAFF appAFF hyperAFF dualAFF appAFF hyperAFF dualAFF appAFF
1 1.479 1.473 1.590 1.586 1.460 1.477 1.584 1.486 1.500 1.630 1.490 1.586
2 0.744 0.756 0.823 0.823 0.755 0.752 0.823 0.785 0.789 0.859 0.854 0.823
4 0.385 0.388 0.428 0.469 0.388 0.385 0.469 0.422 0.429 0.450 0.449 0.469
8 0.230 0.208 0.231 0.355 0.202 0.203 0.355 0.226 0.228 0.354 0.244 0.355
16 0.259 0.267 0.206 0.160 0.168 0.197 0.227 0.166 0.189 0.186 0.122 0.160
32 0.115 0.140 0.174 0.108 0.079 0.082 0.164 0.141 0.154 0.106 0.079 0.108
64 0.088 0.116 0.166 0.090 0.079 0.071 0.150 0.184 0.195 0.089 0.066 0.090
128 0.067 0.088 0.145 0.077 0.062 0.056 0.128 0.154 0.166 0.074 0.051 0.077
Authorized licensed use limited to: Nirma University Institute of Technology. Downloaded on October 03,2024 at 07:17:34 UTC from IEEE Xplore. Restrictions apply.
314 IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 4, NO. 3, JULY-SEPTEMBER 2016

Fig. 8. Cost ratio of running in cloud and a dedicated supercomputer for different scale (cores) and cost ratios (1x–5x). Ratio>1 imply savings of run-
ning in the cloud, <1 favor supercomputer execution.

7 HPC ECONOMICS IN THE CLOUD HPC users and cloud providers? Unlike large supercom-
puting centers, HPC users in small-medium enterprises
There are several reasons why many commercial and web
are much more sensitive to the CAPEX/OPEX argument.
applications are migrating to public clouds from fully
These include startups with nascent HPC requirements
owned resources or private clouds: variable usage in time
(e.g., simulation or modeling) and small-medium enter-
resulting in lower utilization, trading capital expenditure
prises with growing business and an existing HPC infra-
(CAPEX) for operating expenditure (OPEX), and the shift
structure. Both of them may prefer the pay-as-you-go
towards a delivery model of Software as a Service. These
approach in clouds versus establishing/growing on-prem-
arguments apply both to cloud providers and cloud users.
ise resources in volatile markets. Moreover, the ability to
Cloud users benefit from running in the cloud when their
leverage a large variety of heterogeneous architectures in
applications fit the profile we described e.g., variable utili-
clouds can result in better utilization at global scale, com-
zation. Cloud providers benefit if the aggregated resource
pared to the limited choices available in any individual
utilization of all their tenants can sustain a profitable pricing
organization. Running applications on the most economi-
model when compared to the substantial upfront invest-
cal architecture while meeting the performance needs can
ments required to offer computing and storage resources
result in savings for consumers.
through a cloud interface.
Why not cloud for HPC: HPC is however quite different
from the typical web and service-based applications. (1) Uti- 7.1 Quantifiable Analysis
lization of the computing resources is typically quite high on To illustrate a few possible HPC-in-the-cloud scenarios,
HPC systems. This conflicts with the desirable property of we collected and compared cost and price data of super-
low average utilization that makes the cloud business model computer installations and typical cloud offerings. Based
viable. (2) Clouds achieve improved utilization through con- on our survey of cloud prices, known financial situations
solidation enabled by virtualization—a foundational tech- of cloud operators, published supercomputing costs, and
nology for the cloud. However, as evident from our analysis, a variety of internal and external data sources [25], we
the overhead and noise caused by virtualization and multi- estimate that a cost ratio between 2x and 3x is a reason-
tenancy can significantly affect HPC applications’ perfor- able approximate range capturing the differences between
mance and scalability. For a cloud provider that means that a cloud deployment and on-premise supercomputing
the multi-tenancy opportunities are limited and the pricing resources today. In our terminology, 2x indicates the case
has to be increased to be able to profitably rent a dedicated where one supercomputer core-hour is twice as expensive
computing resource to a single tenant. (3) Many HPC appli- as one cloud core-hour. Since these values can fluctuate,
cations rely on optimized interconnect hardware to attain we expand the range to [1x–5x] to capture different
best performance, as shown by our experimental evaluation. future, possibly unforeseen scenarios.
This is in contrast with the commodity Ethernet network Using the performance evaluations for different appli-
(1Gbps today moving to 10 Gbps) typically deployed in most cations (Fig. 1), we calculated the cost differences of run-
cloud infrastructures to keep costs small. When networking ning the application in the public cloud versus running it
performance is important, we quickly reach diminishing in a dedicated supercomputer (Ranger), assuming differ-
returns of scaling-out a cloud deployment to meet a certain ent per core-hour cost ratios from 1x to 5x. Fig. 8 shows
performance target. If too many VMs are required to meet the cost differences for three applications, where values>1
performance, the cloud deployment quickly becomes uneco- indicate savings of running in the cloud and values<1 an
nomical. (4) The CAPEX/OPEX argument is less clear for advantage of running it on a dedicated supercomputer.
HPC users. Publicly funded supercomputing centers typi- We can see that for each application there is a scale in
cally have CAPEX in the form of grants, and OPEX budgets terms of the number of cores up to which it is more cost-
may actually be tighter and almost fully consumed by the effective to execute in the cloud versus on a supercom-
support and administration of the supercomputer with little puter. For example, for Sweep3D, NAMD, and ChaNGa,
headroom for cloud bursting. (5) Software-as-a-service offer- this scale is higher than 4, 8, and 16 cores respectively.
ing are also rare in HPC to date. This break-even point is a function of the application scal-
Why cloud for HPC: So, what are the conditions that can ability and the cost ratio. However our observation is that
make HPC in the cloud a viable business model for both, there is little sensitivity to the cost ratio and it is relatively
Authorized licensed use limited to: Nirma University Institute of Technology. Downloaded on October 03,2024 at 07:17:34 UTC from IEEE Xplore. Restrictions apply.
GUPTA ET AL.: EVALUATING AND IMPROVING THE PERFORMANCE AND SCHEDULING OF HPC APPLICATIONS IN CLOUD 315

the number of instances offered (many more compute than


GPUs), and the power consumed.
One additional complication arising from the use of
accelerators is that they do not virtualize well. While there
is ongoing work making good progress in that direction,
like NVidia GRID [26], it is still a young area with several
unresolved issues. For example, the current sharing model
of virtual GPUs is appropriate for concurrent execution of
multiple jobs in a dedicated supercomputer, but does not
provide the encapsulation, protection, and security support
that would make it appropriate in the cloud. Any resource
that does not virtualize at fine granularity poses a serious
challenge to the cloud adoption model because it forces the
Fig. 9. Application-aware scheduling on platforms with different process- cloud provider to adopt a very rigid pricing scheme if the
ors/VMs and interconnections.
resource cannot be sliced for multiple concurrent users. We
believe this is an interesting area for future research. Finally,
straightforward to determine the break-even point. This is
we would like to conclude that it is was non-trivial to do a
true even for the cost ratio of 1. This might be the artifact
fair comparison of HPC in the cloud and supercomputers.
of slower processors for the Ranger versus newer and
For clouds, the prices are well documented and they are
faster processors in the cloud.
public information, however the costs are undocumented
and they are proprietary and really a differentiator for each
7.2 Qualitative Discussion cloud provider. On contrary, the costs for supercomputers
Next, we address few economic topics on HPC in cloud. are well documented by owners while the prices are typi-
We have primarily compared supercomputers versus cally hidden and not publicized due to subsidies and price
HPC in clouds. However, there are other alternatives that reduction. That was one of the primary reasons why we
we discuss in Section 8, such as bursting out to cloud. Yet used range of cost ratios in Section 7.3 (Fig. 8). Hence, the
another alternative is outsourcing supercomputer. In many economic comparison needs to be taken conservatively.
ways, we consider the latter similar to HPC in the cloud,
with the exception of the way of use. Supercomputers are 8 SCHEDULING HPC JOBS IN THE CLOUD
batch oriented while clouds offer dedicated use, at least at
the virtualization level. There is also no reason why some- In the previous sections, we provided empirical evidence
one would not put a whole supercomputer behind cloud that applications behave quite differently on different plat-
interfaces, and make it available on demand. Hence, these forms. This observation opens up several opportunities to
are variations of the key cases discussed in the paper. At the optimize the mapping and scheduling of HPC jobs to plat-
same time, current supercomputers are almost fully uti- forms. In our terminology, mapping refers to selecting a plat-
lized, so there is little incentive to benefit from on-demand form for a job, and scheduling includes mapping and
use of supercomputers, as compared to departmental level deciding when to execute the job on the chosen platform.
of servers, e.g. in computer aided design (CAD) or com- Here, we research techniques to perform intelligent sched-
puter aided engineering (CAE) which can largely benefit uling and evaluate the benefits.
from improved utilization.
In addition, cloud providers can offer the most recent 8.1 Problem Definition
equipment. Because they will share it among many custom- Consider the case when the dedicated HPC infrastructure
ers they can amortize the high cost more easily than any sin- cannot meet peak demands and the provider is considering
gle customer. This equipment can be used for exploration or offloading jobs to additional available cluster or cloud. The
in a production manner for early adopters. Movie rendering problem can be defined as follows. Given a set of owned
is a classic case of a cloud HPC (compute-intensive) applica- platforms with resources having different processor config-
tion. Most recent case is post-production of the film “Life of urations, interconnection networks, and degrees of virtuali-
Pi”. Movie companies can always use the most recent zation, how can we effectively schedule an incoming stream
equipment in the cloud and eventually acquire those that of HPC jobs to these platforms based on intrinsic applica-
benefits them. tion characteristics, job requirements, platform capabilities,
In this paper, we have not discussed accelerators, such as and dynamic demand and load fluctuations to achieve the
GPUs, which are becoming important for the HPC and com- goals of improved job completion time, makespan, and
pute-intensive applications. Because we have not conducted hence throughput.
any experiments with GPUs, we cannot elaborate with any
substance on the implications of the use of GPUs in the 8.2 Methodology
cloud. However, there is no reason not to treat them the To address this problem, we adopt a two-step methodology,
same way as the regular compute instances. For example, shown in Fig. 9: 1) perform a one-time offline benchmarking
Amazon prices them in the similar dedicated instances class or analytical modeling of applications-to-platforms perfor-
with the price of $0.715 for GPU (g2.2xlarge) instance versus mance sensitivity, and 2) use heuristics to schedule the
similarly sized (c3.2xlarge) compute instance for $0.462 per application stream to available platforms based on the out-
hour. The price difference is attributed to the hardware cost, put of step 1 and current resource availabilities. In this
Authorized licensed use limited to: Nirma University Institute of Technology. Downloaded on October 03,2024 at 07:17:34 UTC from IEEE Xplore. Restrictions apply.
316 IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 4, NO. 3, JULY-SEPTEMBER 2016

In Fig. 10, each grid cell represents the execution of a par-


ticular application (x-axis) at a particular scale (number of
cores on y-axis). Here, light colors (blue and white) repre-
sent that job suffered no slowdown. In some cases, job
attained speedup due to worse sequential performance on
Ranger compared to the other platforms. Dark (reddish)
shades represent slowdown. Based on Fig. 10, two possible
Fig. 10. Effect of application and scale on slowdown. Light: no slow- heuristics for static mapping are:
down, dark (red): more slowdown.
 ScalePartition: Assign large scale jobs (say 64-256
paper, our focus is on step 2, which translates to an online cores) to supercomputer, medium scale (16–32 cores)
job scheduling problem with the additional complexity of to cluster, and small scale (1-8 cores) to cloud by par-
having to decide which platform a job should be run on titioning the slowdown map along y-dimension.
besides the decision regarding which job to execute next. The  ApplicationPartition: Assign specific applications to
problem is even more challenging since different applica- specific platforms by partitioning the map along
tion react differently to different platforms. We will refer to x-dimension. E.g. IS, NAMD, and LU to supercom-
this as Multi-Platform Application-Aware Online Job Scheduling puter, ChaNGa and Sweep3D to cluster, and EP,
(MPA2 OJS). Jacobi, and NQueens to cloud. A variation of Appli-
For step 1, we rely on one-time benchmarking of appli- cationPartition can be to use finer application char-
cation performance on different platforms to drive the acteristics such as message count and volume for
scheduling decisions. In our earlier work, we have shown partitioning (Table 3, Section 4).
that in the absence of the benchmark data, it is possible to Other examples of static policies include scheduling all jobs
perform application characterization followed by relative to a supercomputer (SCOnly), to a cluster (ClusterOnly), or
performance prediction when considering multiple plat- to a cloud (CloudOnly).
forms [27]. Also, other known techniques for performance
prediction can be used. These include analytical model-
8.2.2 Dynamic Mapping Heuristics
ing, simulations, application profiling through sample
execution (e.g. the first few iterations) on actual platform, The motivation for dynamic selection of a platform for a job
and interpolation. It is not our intention in this paper to is to perform resource availability driven scheduling. Some
research accurate techniques for parallel performance pre- such heuristics that we explored are:
diction of complex applications. Our goal is to quantify  MostFreeFirst: Assign job to least loaded platform.
the benefits of MPA2 OJS to develop an understanding  RoundRobin: Assign jobs to platforms round-robin.
and foundation for HPC in cloud, which can promote fur-  BestFirst: Assign the current job to platform with best
ther research towards additional characterization and available resources. E.g. in the order supercomputer,
scheduling techniques. cluster, and cloud.
Traditional HPC job scheduling algorithms do not con-  Adaptive: Assign the job to the platform with largest
sider the presence of multiple platforms. Hence, they are Effective Job-specific Supply (EJS).
agnostic of the application to platform performance sensitiv- EJS is defined to capture both, current resource availability
ity. In MPA2 OJS, the mapping decision could be static or and a job’s suitability to a particular platform. We define a
dynamic. Static decisions are independent of current platform platform’s EJS for a particular job as the product of free cores
load and made a-priori to job scheduling, whereas dynamic on that platform and the job’s normalized performance
decisions are aware of the current resource availability and obtained on that platform. The intuition is that the core-hours
load and they are made when job is scheduled. With taken for a job to complete on a platform are directly propor-
dynamic mapping, the same job can be scheduled to run on tional to the slowdown it suffers on that platform compared to
different platforms across its multiple executions depending the supercomputer. Hence, the Adaptive heuristic optimizes
upon the state of the system when it was scheduled. along two dimensions: it balances load across multiple plat-
Hence, MPA2 OJS algorithms can be classified as static forms and it matches application characteristics to platforms.
versus dynamic, or job-characteristics aware versus In contrast, the first three dynamic heuristics are application-
unaware. Next, we present some heuristics for MPA2 OJS. agnostic.
Table 5 classifies our heuristics into static versus dynamic,
8.2.1 Static Mapping Heuristics and application-aware versus application-agnostic.
The analysis in Section 4 showed that the slowdown in
cloud versus supercomputer depends on the application 8.3 Implementation and Evaluation Using CloudSim
under consideration. Also, for the same application, the sen- We implemented the MPA2 OJS heuristics in CloudSim
sitivity to a platform varies with scale, i.e., with core counts. [28], which is a widely used tool for simulation of schedul-
To visualize the behavior of HPC jobs along these two ing algorithms in a data center or a cloud. We modified
dimensions, i.e., application type and scale, we used our CloudSim to enable simulation of HPC job scheduling
performance data to generate the map of a job’s slowdown across multiple platforms.
when running on the commodity cluster, i.e. Open Cirrus In CloudSim, a fixed number of VMs are created at the
(Fig. 10a) and private cloud (Fig. 10b) with respect to its exe- start of simulation, and jobs (cloudlets in CloudSim termi-
cution on supercomputer (Ranger). nology) can be submitted to these VMs. For our simulation
Authorized licensed use limited to: Nirma University Institute of Technology. Downloaded on October 03,2024 at 07:17:34 UTC from IEEE Xplore. Restrictions apply.
GUPTA ET AL.: EVALUATING AND IMPROVING THE PERFORMANCE AND SCHEDULING OF HPC APPLICATIONS IN CLOUD 317

TABLE 5 8.3.2 Results: Makespan and Throughput


Classification of Heuristics for MPA2 OJS
Using simulation we found that most application-agnostic
Heuristics Dynamic App-aware strategies of Table 5, specifically ClusterOnly, CloudOnly,
RoundRobin, and MostFreeFirst, performed very poorly
SCOnly, ClusterOnly, CloudOnly
ScalePartition, ApplicationPartition Yes
compared to other heuristics. This is attributed to the tre-
RoundRobin, BestFirst, MostFreeFirst Yes mendous slowdown that some application suffer when run-
Adaptive Yes Yes ning on cloud versus their execution on supercomputer
(e.g. up to 400X for IS). Hence, in this paper, we present and
analyze results of the remaining five heuristics, which yield
purpose, a one-to-one mapping of cloudlets to VMs is suffi-
more reasonable solutions.
cient but we needed to provide dynamic VM creation and
Fig. 11 compares the makespan (total completion time)
termination. Also, we extended existing VM allocation pol-
and throughput for different heuristics under varying sys-
icy in CloudSim to enable first come first serve (FCFS)
tem load. It is evident from Fig. 11 that Adaptive heuristic
scheduling of HPC jobs to resources.
outperforms the rest when the system is reasonably loaded
The scheduling of cloudlets is performed by the datacen-
(medium and high load). Moreover, the benefits increase as
ter broker. Hence, we created a new datacenter broker to
the system load increases. Adaptive heuristic results in
perform MPA2 OJS. We introduced a periodic event in
around 1.05X, 1.6X, and 1.8X improvement in makespan at
CloudSim, which checks for new job arrivals. The scheduler
low, medium, and high load respectively compared to
is triggered when a new job arrives or when a running job
SCOnly, that is running all applications on supercomputer.
completes. Based on the current state of available datacen-
Similarly, there is significant improvement in throughput
ters and the scheduling heuristic, new jobs are assigned to a
(number of completed jobs per second). For instance, after 1
specific datacenter queue. Internally within a datacenter,
hour of execution (3,600 s), Adaptive strategy attains 1.25X,
FCFS policy is honored.
4X, and 6X better throughput compared to SCOnly under
different system loads. The benefits are even higher com-
8.3.1 Simulation Approach pared to other application-agnostic strategies, such as
For our simulation, we created three datacenters—super- RoundRobin, as mentioned earlier. AppPartition performs
computer, cluster, and cloud (256 cores each). These cor- well under low and medium load but yields poor results
respond to Ranger, Open Cirrus (typical commodity under high load.
cluster), and private cloud (typical hypervisor-based For better understanding of the reasons for the benefits
resources) respectively. We simulated the execution of and sensitivity to load, we measured various other met-
first 1000 jobs corresponding to the METACENTRUM-02. rics. A potential cause of the benefits is the improvement
swf job logs of parallel workload archive [29]. Each job in average response time (job’s start time—arrival time).
record contains the job’s arrival time, requested number Adaptive, ScalePartition, and AppPartition achieved the
of cores (P), and its runtime. However, our goal is to most benefits in terms of response time. In some cases,
simulate multiple platforms, where the runtime will vary AppPartition (at low and medium load) or ScalePartition
from one platform to the other. Hence, we used a uni- (at low and high load) achieve even better response time
form distribution random number generator to map each compared to Adaptive. However, from Fig. 11, we saw
job to one of the applications from the set evaluated in that overall Adaptive performed significantly better at
this paper. We modified the job records to contain the medium and high load. This is because Adaptive per-
application name (AppName) and the normalized perfor- forms the best in terms of average runtime in all three
mance for various platforms corresponding to (App- cases (loads) since ScalePartition and AppPartition are
Name,P). We used the same seed for random number static mapping schemes.
generator while comparing different heuristics. Static heuristics cannot dynamically change the mapping
Furthermore, to evaluate how our heuristics perform of a job even if a better platform is available. Hence, high-end
under varying system load, we modified the runtimes of resources may be left unused waiting for a matching applica-
jobs in the log file. Medium load represents the original run- tion to arrive. Also, on further investigation, we learned that
times, low load represent runtimes scaled down by 2X, and 1D characterization may not be sufficient since that can still
high load represent runtimes scaled up by 2X. result in some suboptimal mappings, e.g. ScalePartition

Fig. 11. Adaptive heuristic significantly improves makespan and throughput when system is reasonably loaded.
Authorized licensed use limited to: Nirma University Institute of Technology. Downloaded on October 03,2024 at 07:17:34 UTC from IEEE Xplore. Restrictions apply.
318 IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 4, NO. 3, JULY-SEPTEMBER 2016

TABLE 6
Findings and Our Approach to Address the Research Questions on HPC in Cloud

Question Answers
Who (1) Small and medium scale organizations, startups or growing businesses, which can benefits from
pay-as-you-go model.
(2) Users with applications which result in best performance/cost ratio in cloud versus other platforms.
What (1) Applications with less-intensive communication patterns and less sensitivity to interference.
(2) Applications with performance needs that can be met at small to medium scale execution
(in terms of number of cores).
Why (1) Small-medium enterprises benefit from pay-as-you-go model since they are highly sensitive to
CAPEX/OPEX argument.
(2) Clouds enables multiple organizations to access a large variety of shared architectures, leading to improved
utilization.
How (1) Technical approaches: (a) Making HPC cloud-aware e.g. tuning computational granularity and problem sizes, and
(b) making clouds HPC-aware e.g. providing lightweight virtualization and enabling CPU affinity.
(2) Business models: Hybrid supercomputer–cloud approach with application-aware scheduling and cloud bursting.

maps IS at 32 cores to cluster even if supercomputer is free. 9.2 Bridging the Gap between HPC and Cloud
For benefitting from multiple platforms, we need to a) con- The approaches towards reducing the gap between tradi-
sider both, the application characteristics and the scale at tional cloud offerings and HPC demands can be classified
which it will be run, and b) dynamically adapt to the plat- into two categories—(1) those which make clouds HPC-
form loads. Adaptive heuristic meets these two goals. aware, and (2) those which makes HPC clouds-aware. In this
paper, we presented techniques for both. For (1), We explored
techniques in low-overhead virtualization, and quantified
9 RELATED WORK how close we can get to physical machine’s performance for
HPC workloads. There are other recent efforts on HPC-opti-
In this section, we summarize the related research on HPC
in cloud, including performance evaluation studies. mized hypervisors [33], [34]. Other examples of (1) include
HPC-optimized clouds such as Amazon Cluster Compute [7]
and DoE’s Magellan [1] and hardware- and HPC-aware cloud
9.1 Performance and Cost Studies of HPC on Cloud schedulers (VM placement algorithms) [35], [36].
Walker [5], followed by several others [1], [2], [3], [4], [6], [8], The latter approach (2) has been relatively less explored,
[9], [27], [30], conducted the study on HPC in cloud using but has shown tremendous promise. Cloud-aware load bal-
benchmarks such as NPB and real applications. Their con- ancers for HPC applications [37] and topology aware
clusions can be summarized as: deployment of scientific applications in cloud [38] have
shown encouraging results. In this paper, we demonstrated
 Primary challenges for HPC in cloud are insufficient how we can tune the HPC runtime and applications to
network and I/O performance in cloud, resource clouds to achieve improved performance.
heterogeneity, and unpredictable interference arising
from other VMs [1], [2], [3], [4]. 9.3 HPC Characterization, Mapping, and Scheduling
 Considering cost into the equation results in interest- There are several tools for scheduling HPC jobs on clusters,
ing trade-offs; execution on clouds may be more eco- such as ALPS, OpenPBS, SLURM, TORQUE, and Condor.
nomical for some HPC applications, compared to They are all job schedulers or resource management systems
supercomputers [4], [27], [31], [32]. which aim to utilize system resources in an efficient manner.
 For large-scale HPC or for centers with large user They differ from our work on scheduling since we perform
base, cloud cannot compete with supercomputers application-aware scheduling and provide solution for
based on the metric $/GFLOPS [1], [9]. multi-platform case. GrADS [39] project addressed the prob-
In this paper, we explored some of the similar questions lem of scheduling, monitoring, and adapting applications to
from the perspective of smaller scale HPC users, such as heterogeneous and dynamic grid environment. Our focus is
small companies and research groups who have limited on clouds and hence we address additional challenges such
access to supercomputer resources and varying demand as virtualization, cost and pricing models for HPC in cloud.
over time. We also considered the perspective of cloud pro- Kim et al. [40] presented three usage models for hybrid
viders who want to expand their offerings to cover the HPC grid and cloud computing: acceleration, conservation,
aggregate of these smaller scale HPC users. and resilience. However, they use cloud for sequential tasks
Furthermore, our work explored additional dimensions: and do not consider execution of parallel applications.
(1) With a holistic viewpoint, we considered all the different Inspired by that work, we evaluate models for HPC-clouds:
aspects of running in cloud-performance, cost, and business substitute, complement, and burst.
models, and (2) we explored techniques for bridging the
gap between HPC and clouds. We improved HPC perfor-
mance in cloud by (a) improving execution time of HPC in 10 CONCLUSIONS, LESSONS, FUTURE WORK
cloud and (b) by improving the turnaround time with intel- Through a performance, economic, and scheduling analysis
ligent scheduling in cloud. of HPC applications on a range of platforms, we have
Authorized licensed use limited to: Nirma University Institute of Technology. Downloaded on October 03,2024 at 07:17:34 UTC from IEEE Xplore. Restrictions apply.
GUPTA ET AL.: EVALUATING AND IMPROVING THE PERFORMANCE AND SCHEDULING OF HPC APPLICATIONS IN CLOUD 319

shown that different applications exhibit different charac- [2] P. Mehrotra, J. Djomehri, S. Heistand, R. Hood, H. Jin, A. Lazan-
off, S. Saini, and R. Biswas, “Performance evaluation of Amazon
teristics that determine their suitability towards a cloud EC2 for NASA HPC applications,” in Proc. 3rd Workshop Scientific
environment. Table 6 presents our conclusions. Next, we Cloud Comput. 2012, pp. 41–50.
summarize the lessons learned from this research and the [3] C. Evangelinos and C. N. Hill, “Cloud computing for parallel sci-
emerging future research directions. entific HPC applications: Feasibility of running coupled atmo-
sphere-ocean climate models on Amazon’s EC2,” in Proc. IEEE
Clouds can successfully complement supercomputers, but Cloud Comput. Appl., Oct. 2008, pp. 2–34.
using clouds to substitute supercomputers is infeasible. Burst- [4] A. Gupta and D. Milojicic, “Evaluation of HPC Applications on
ing to cloud is also promising. We have shown that by Cloud,” in Proc. Open Cirrus Summit (Best Student Paper), Atlanta,
GA, USA, Oct. 2011, pp. 22–26.
performing multi-platform dynamic application-aware [5] E. Walker, “Benchmarking Amazon EC2 for high-performance
scheduling, a hybrid cloud-supercomputer platform envi- scientific computing,” LOGIN, vol. 33, pp. 18–23, 2008.
ronment can actually outperform its individual constitu- [6] A. Gupta, L. V. Kale, D. S. Milojicic, P. Faraboschi, R. Kaufmann,
ents. By using an underutilized resource which is “good V. March, F. Gioachin, C. H. Suen, and B.-S. Lee, “The who, what,
why, and how of HPC applications in the cloud,” in Proc. 5th IEEE
enough” to get the job done sooner, it is possible to get Intl. Conf. Cloud Comp. Techno. and Sc. Best Paper, 2013, pp. 306–314.
better turnaround time for job (user perspective) and [7] High Performance Computing (HPC) on AWS [Online].
improved throughput (provider perspective). Another Available: http://aws.amazon.com/hpc-applications
potential model for HPC in cloud is to use cloud only [8] A. Iosup, S. Ostermann, M. N. Yigitbasi, R. Prodan, T. Fahringer,
and D. H. J. Epema, “Performance Analysis of Cloud Computing
when there is high demand (cloud burst). Our evaluation Services for Many-Tasks Scientific Computing,” IEEE Trans. Paral-
showed that application-agnostic cloud bursting (e.g. lel Distrib. Syst., vol. 22, no. 6, pp. 931–945, Jun. 2011.
BestFirst heuristic) is unrewarding, but application-aware [9] J. Napper and P. Bientinesi, “Can Cloud Computing reach the
Top500?’ in Proceedings of the Combined Workshops on Unconven-
bursting is a promising research direction. More work is tional High Performance Computing Workshop Plus Memory Access
needed to consider other factors in multi-platform sched- Workshop. New York, NY, USA: ACM, 2009.
uling: job quality of service (QoS) contracts, deadlines, pri- [10] Ranger User Guide. [Online]. Available: http://services.tacc.
orities, and security. Also, future research is required in utexas.edu/index.php/ranger-user-guide
[11] A. I. Avetisyan, R. Campbell, I. Gupta, M. T. Heath, S. Y. Ko, G. R.
cloud pricing in multi-platform environments. Market Ganger, M. A. Kozuch, D. O’Hallaron, M. Kunze, T. T. Kwan, and
mechanisms and equilibrium factors in game theory can others, “Open Cirrus: A Global Cloud Computing Testbed,” Com-
help automate such decisions. puter, vol. 43, no. 4, pp. 35–43, Apr. 2010.
[12] D. Nurmi R. Wolski, C. Grzegorczyk, G. Obertelli, S. Soman, L.
For efficient HPC in cloud, HPC need to be cloud-aware and Youseff, and D. Zagorodnov, “Cluster Computing and the Grid,
clouds needs to be HPC-aware. HPC applications and runtimes 2009. CCGRID’09. 9th IEEE/ACM International Symposium on,”
must adapt to minimize the impact of slow network, hetero- in Proc. Cloud Comput. Appl., Oct. 2009, pp. 124–131.
geneity, and multi-tenancy in clouds. Simultaneously, [13] Kivity, A. Y. Kamay, D. Laor, U. Lublin, and A. Liguori, “kvm: the
Linux virtual machine monitor,” in Proc. of the Linux Symposium,
clouds should minimize overheads for HPC using techni- vol. 1, pp. 225–230, 2007.
ques such as lightweight virtualization and link aggregation [14] A. J. Younge, R. Henschel, J. T. Brown, G. Von Laszewski, J. Qiu,
with HPC-optimized network topologies. With low-over- and G. C. Fox, “Analysis of virtualization technologies for high
performance computing environments,” in Proc. IEEE Int. Conf.
head virtualization, web-oriented cloud infrastructure can
Cloud Comput., 2011, pp. 9–16.
be reused for HPC. We envisage hybrid clouds that support [15] D. Schauer et al, (June). Linux containers version 0.7.0. (2010)
both HPC and commercial workloads through tuning or [Online]. Available: http://lxc.sourceforge.net/
VM re-provisioning. [16] Intel Corporation. (Feb.) Intel(r) Virtualization Technology for
Directed I/O. Tech. Rep. D51397-006, (2011) [Online]. Available:
Application characterization for analysis of the performance- http://download.intel.com/technology/computing/vptech/Intel
cost tradeoffs for complex HPC applications is a non-trivial task, (r)_VT_for_D irect_IO.pdf
but the economic benefits are substantial. More research is nec- [17] NPB [Online]. Available: http://nas.nasa.gov/publications/npb.
essary to quickly identify important traits for complex html
[18] “MPI: A Message Passing Interface Standard,” in MPI Forum,
applications with dynamic and irregular communication 1994.
patterns. A future direction is to evaluate and characterize [19] L. Kale and S. Krishnan, “CHARM++: A Portable Concurrent
applications with irregular parallelism [41] and dynamic Object Oriented System Based on C++,” in Proc. 8th Annu. Conf.
Object-Oriented Program. Syst., Languages, Appl., 1993, pp. 91–108.
datasets. For example, challenging data sets arise from 4D [20] A. Bhatele, S. Kumar, C. Mei, J. C. Phillips, G. Zheng, and L. V.
CT imaging, 3D moving meshes, and computational fluid Kale, “Overcoming scaling challenges in biomolecular simula-
dynamics (CFD). The dynamic and irregular nature of such tions across multiple platforms,” in Proc. IEEE Int. Symp. Parallel
applications makes their characterization even more chal- Distrib. Process., 2008, pp. 1–12.
[21] P. Jetley, F. Gioachin, C. Mendes, L. V. Kale, and T. R. Quinn,
lenging compared to the regular iterative scientific applica- “Massively Parallel Cosmological Simulations with ChaNGa,” in
tions considered in this paper. However, their Proc. IEEE Int. Symp. Parallel Distrib. Processing, 2008, pp. 1–12.
asynchronous nature, i.e. lack of fine-grained barrier syn- [22] The ASCII Sweep3D code [Online]. Available: http://wwwc3.
chronizations, makes them promising candidates for hetero- lanl.gov/pal/software/sweep3d
[23] O. Zaki, E. Lusk, W. Gropp, and D. Swider, “Toward scalable per-
geneous and multi-tenant clouds. formance visualization with Jumpshot,” Int. J. High Perform. Com-
put. Appl., vol. 13, no. 3, pp. 277–288, Fall 1999.
[24] T. Watanabe, M. Nakao, T. Hiroyasu, T. Otsuka, and M. Koibuchi,
REFERENCES “Impact of topology and link aggregation on a PC cluster with
[1] K. Yelick, S. Coghlan, B. Draney, R. S. Canon, L. Ramakrishnan, A. ethernet,” in Proc. IEEE CLUSTER, Sep./Oct. 2008, pp. 280–285.
Scovel, I. Sakrejda, A. Liu, S. Campbell, P. T. Zbiegiel, T. Declerck, [25] C. Bischof, D. anMey, and C. Iwainsky. (2011). Brainware
P. Rich, “The magellan report on cloud computing for science”, U. for Green HPC. CS - Research and Development, pp. 1–7 [Online].
S. Department of Energy Office of Science, Office of Advanced Sci- Available: http://dx.doi.org/10.1007/s00450-011-0198-5
entific Computing Research (ASCR), Dec. 2011. [26] NVIDIA GRID [Online]. Available: http://nvidia.com/object/
virtual-gpus.html

Authorized licensed use limited to: Nirma University Institute of Technology. Downloaded on October 03,2024 at 07:17:34 UTC from IEEE Xplore. Restrictions apply.
320 IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 4, NO. 3, JULY-SEPTEMBER 2016

[27] A. Gupta, Abhishek, Kale, V. Laxmikant, D. S. Milojicic, P. Farabo- Paolo Faraboschi received the PhD degree in
schi, R. Kaufmann, V. March, F. Gioachin, C. H. Suen, and B. -S. electrical engineering and computer science from
Lee, “Exploring the Performance and Mapping of HPC Applica- the University of Genoa, Italy. He is currently a
tions to Platforms in the cloud,” in Proc. 21st Int. Symp. High-Per- distinguished technologist at HP Labs. His cur-
form. Parallel Distrib. Comput., 2012, pp. 121–122. rent research interests include intersection of
[28] R. N. Calheiros, R. Ranjan, A. Beloglazov, C. A. F. De Rose, and architecture and software. From 2004 to 2009, he
R. Buyya, “CloudSim: A toolkit for modeling and simulation of led the HPL research activity on system-level
cloud computing environments and evaluation of resource provi- simulation. From 1995 to 2003, he was the princi-
sioning algorithms,” Softw. Pract. Exper., vol. 41, no. 1, pp. 23–50, pal architect of the Lx/ST200 family of VLIW
Jan. 2011. cores. He is a fellow of the IEEE and an active
[29] Parallel Workloads Archive. [Online]. Available: http://www.cs. member of the computer architecture community.
huji.ac.il/labs/parallel/workload/
[30] J. Ekanayake, X. Qiu, T. Gunarathne, S. Beason, and G. C. Fox, Filippo Gioachin received the Laurea degree in
“High performance parallel computing with clouds and cloud computer science and engineering from the Uni-
technologies,” in Cloud Computing. New York, NY, USA: Springer, versity of Padova, and the PhD degree in com-
2010. puter science from the University of Illinois at
[31] E. Roloff, M. Diener, A. Carissimi, and P. Navaux, “High perfor- Urbana-Champaign. He is currently a research
mance computing in the cloud: Deployment, performance and manager and senior researcher at HP Labs
cost efficiency,” in Proc. CloudCom, 2012, pp. 371–378. Singapore, where he is contributing to the innova-
[32] A. Marathe, R. Harris, D. K. Lowenthal, B. R. de Supinski, tion in cloud computing. His main focus is on inte-
B. Rountree, M. Schulz, and X. Yuan, “a comparative study of grated online software development.
high-performance computing on the cloud,” in Proc. 21st Int.
Symp. High-Performance Parallel Distrib. Comput., 2013, pp. 239–250.
[33] J. Lange, K. Pedretti, T. Hudson, P. Dinda, Z. Cui, L. Xia,
P. Bridges, A. Gocke, S. Jaconette, M. Levenhagen, and R. Bright- Laxmikant V. Kale received the BTech degree in
well, “Palacios and Kitten: New high performance operating sys- electronics engineering from Benares Hindu Uni-
tems for scalable virtualized and native supercomputing,” in Proc. versity, India, in 1977, an the ME degree in com-
IEEE Int. Symp. Parallel Distrib. Process., 2010, pp. 1–12. puter science from Indian Institute of Science in
[34] B. Kocoloski, J. Ouyang, and J. Lange, “A case for dual stack virtu- Bangalore, India, in 1979, and the PhD degree in
alization: consolidating HPC and commodity applications in the computer science from State University of New
cloud,” in Proc. 3rd ACM Symp. Cloud Comput., New York, NY, York, Stony Brook, in 1985. He is currently a
USA, 2012, pp. 23:1–23:7. full professor at the University of Illinois at
[35] HeterogeneousArchitectureScheduler. [Online]. Available: http:// Urbana-Champaign. His current research interest
wiki.openstack.org/HeterogeneousArchitectureScheduler includes parallel computing. He is a fellow of
[36] A. Gupta, L. Kale, D. Milojicic, P. Faraboschi, and S. Balle, “HPC- the IEEE.
Aware VM Placement in Infrastructure Clouds,” in Proc. IEEE Int.
Conf. Cloud Eng., Mar. 2013, pp. 11–20. Richard Kaufmann received the BA degree from
[37] A. Gupta, O. Sarood, L. Kale, and D. Milojicic, “Improving HPC the University of California at San Diego, Sin
application performance in cloud through dynamic load bal- 1978, where he was a member of the UCSD Pas-
ancing,” in Proc. 13th IEEE/ACM Int. Symp. Cluster, Cloud, Grid cal Project. He is currently the VP of the Cloud
Comput., 2013, pp. 402–409. Lab at Samsung Data Systems. Previously, he
[38] P. Fan, Z. Chen, J. Wang, Z. Zheng, and M. R. Lyu, “Topology- was chief technologist of HP’s Cloud Services
aware deployment of scientific applications in cloud computing,” Group and HP’s cloud and high-performance
in Proc. IEEE 5th Int. Conf. Cloud Comput., 2012, pp. 319–326. computing server groups.
[39] F. Berman, A. Chien, K. Cooper, J. Dongarra, I. Foster, D. Gannon,
L. Johnsson, K. Kennedy, C. Kesselman, J. Mellor-Crumme, and
others, “The grads project: Software support for high-level grid
application development,” Int. J. High Perform. Comput. Appl.,
vol. 15, pp. 327–344, 2001. Bu Sung Lee received the BSc (Hons.) and PhD
[40] H. Kim, Y. el Khamra, I. Rodero, S. Jha, and M. Parashar, degrees from the Department of Electrical and
“Autonomic management of application workflows on hybrid Electronics, Loughborough University of Technol-
computing infrastructure,” Sci. Program., vol. 19, no. 2, pp. 75–89, ogy, United Kingdom, in 1982 and 1987, respec-
Jan. 2011. tively. He is currently an associate professor with
[41] M. Kulkarni, M. Burtscher, R. Inkulu, K. Pingali, and C. Casçaval, the School of Computer Engineering, Nanyang
“How much parallelism is there in irregular applications?” SIG- Technological University. He held a joint position
PLAN Not., vol. 44, no. 4, pp. 3–14, Feb. 2009. as the director (Research) HP Labs Singapore
from 2010 to 2012. His current research interests
Abhishek Gupta received the BTech degree in include mobile and pervasive networks, distrib-
computer science and engineering from Indian uted systems, and cloud computing.
Institute of Technology (IIT), Roorkee, India, in
2008, and the MS and PhD degrees in computer Verdi March received the BSc degree from Fac-
science from the University of Illinois at Urbana- ulty of Computer Science, University of Indonesia
Champaign (UIUC) in 2011 and 2014, respec- in2000, and the PhD degree from the Department
tively. He is currently a cloud security architect at of Computer Science, National University of Sin-
Intel Corp. His current research interests include gapore in 2007. He is currently a lead research
parallel programming, HPC, scheduling, and scientist with Visa Labs. Prior to working with
cloud computing. Visa Labs, he held various research or engineer-
ing positions with HP Labs, Sun Microsystems
Inc., and the National University of Singapore.

Authorized licensed use limited to: Nirma University Institute of Technology. Downloaded on October 03,2024 at 07:17:34 UTC from IEEE Xplore. Restrictions apply.
GUPTA ET AL.: EVALUATING AND IMPROVING THE PERFORMANCE AND SCHEDULING OF HPC APPLICATIONS IN CLOUD 321

Dejan Milojicic received the BSc and MSc ChunHui Suen received the BEng (First Class
degrees from Belgrade University, Belgrade, Ser- Hons.) from National University of Singapore,
bia, in 1983 and 1986, respectively. He received Singapore, and the MSc and PhD degrees from
the PhD degree from the University of Kaiserslau- Technische Universitaet Munchen, Munich, Ger-
tern, Kaiserslautern, Germany, in 1993. He is cur- many. He is currently an engineer and researcher
rently a senior researcher at HP Labs, Palo Alto, with keen interest in the field of IT security, virtual-
CA. He is the IEEE Computer Society 2014 presi- ization and cloud, with research experience in
dent and the founding editor-in-chief of the IEEE TPM related technologies, various hypervisors
Computing Now. He was in OSF Research Insti- (xen, kvm, vmware), Linux kernel development,
tute, Cambridge, MA, from 1994 to 1998, and and various cloud stacks (openstack, AWS).
Institute Mihajlo Pupin, Belgrade, Serbia, from
1983 to 1991. He is a fellow of the IEEE.
" For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/publications/dlib.

Authorized licensed use limited to: Nirma University Institute of Technology. Downloaded on October 03,2024 at 07:17:34 UTC from IEEE Xplore. Restrictions apply.

You might also like