0% found this document useful (0 votes)
28 views30 pages

OceanofPDF - Com Grid - Hamid R Arabnia

Uploaded by

dcostaamanda48
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views30 pages

OceanofPDF - Com Grid - Hamid R Arabnia

Uploaded by

dcostaamanda48
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

PROCEEDINGS OF THE 2019 INTERNATIONAL CONFERENCE ON

PROCEEDINGS OF THE 2019 INTERNATIONAL CONFERENCE ON

WORLDCOMP’19
GRID, CLOUD, & CLUSTER COMPUTING
GRID, CLOUD, & CLUSTER COMPUTING

GCC’19

Grid, Cloud, and Cluster Computing


Editors
Hamid R. Arabnia
Leonidas Deligiannidis, Fernando G. Tinetti

U.S. $49.95
ISBN 9781601324993
54995
Publication of the 2019 World Congress in Computer Science,
Computer Engineering, & Applied Computing (CSCE’19)
July 29 - August 01, 2019 | Las Vegas, Nevada, USA
https://americancse.org/events/csce2019
Arabnia

9 781601 324993
Copyright © 2019 CSREA Press

EMBD-GCC19_Full-Cover.indd All Pages 18-Feb-20 5:28:50 PM


This volume contains papers presented at the 2019 International Conference on Grid, Cloud, & Cluster
Computing. Their inclusion in this publication does not necessarily constitute endorsements by editors or by the
publisher.

Copyright and Reprint Permission

Copying without a fee is permitted provided that the copies are not made or distributed for direct
commercial advantage, and credit to source is given. Abstracting is permitted with credit to the source.
Please contact the publisher for other copying, reprint, or republication permission.

American Council on Science and Education (ACSE)

Copyright © 2019 CSREA Press


ISBN: 1-60132-499-5
Printed in the United States of America
https://americancse.org/events/csce2019/proceedings
Foreword

It gives us great pleasure to introduce this collection of papers to be presented at the 2019 International
Conference on Grid, Cloud, and Cluster Computing (GCC’19), July 29 – August 1, 2019, at Luxor Hotel (a
property of MGM Resorts International), Las Vegas, USA. The preliminary edition of this book (available
in July 2019 for distribution on site at the conference) includes only a small subset of the accepted research
articles. The final edition (available in August 2019) will include all accepted research articles. This is due
to deadline extension requests received from most authors who wished to continue enhancing the write-up
of their papers (by incorporating the referees’ suggestions). The final edition of the proceedings will be
made available at https://americancse.org/events/csce2019/proceedings .

An important mission of the World Congress in Computer Science, Computer Engineering, and Applied
Computing, CSCE (a federated congress to which this conference is affiliated with) includes "Providing a
unique platform for a diverse community of constituents composed of scholars, researchers, developers,
educators, and practitioners. The Congress makes concerted effort to reach out to participants affiliated
with diverse entities (such as: universities, institutions, corporations, government agencies, and research
centers/labs) from all over the world. The congress also attempts to connect participants from institutions
that have teaching as their main mission with those who are affiliated with institutions that have research
as their main mission. The congress uses a quota system to achieve its institution and geography diversity
objectives." By any definition of diversity, this congress is among the most diverse scientific meeting in
USA. We are proud to report that this federated congress has authors and participants from 67 different
nations representing variety of personal and scientific experiences that arise from differences in culture and
values. As can be seen (see below), the program committee of this conference as well as the program
committee of all other tracks of the federated congress are as diverse as its authors and participants.

The program committee would like to thank all those who submitted papers for consideration. About 70%
of the submissions were from outside the United States. Each submitted paper was peer-reviewed by two
experts in the field for originality, significance, clarity, impact, and soundness. In cases of contradictory
recommendations, a member of the conference program committee was charged to make the final decision;
often, this involved seeking help from additional referees. In addition, papers whose authors included a
member of the conference program committee were evaluated using the double-blinded review process.
One exception to the above evaluation process was for papers that were submitted directly to
chairs/organizers of pre-approved sessions/workshops; in these cases, the chairs/organizers were
responsible for the evaluation of such submissions. The overall paper acceptance rate for regular papers
was 18%; 20% of the remaining papers were accepted as poster papers (at the time of this writing, we had
not yet received the acceptance rate for a couple of individual tracks.)

We are very grateful to the many colleagues who offered their services in organizing the conference. In
particular, we would like to thank the members of Program Committee of GCC’19, members of the
congress Steering Committee, and members of the committees of federated congress tracks that have topics
within the scope of GCC. Many individuals listed below, will be requested after the conference to provide
their expertise and services for selecting papers for publication (extended versions) in journal special
issues as well as for publication in a set of research books (to be prepared for publishers including:
Springer, Elsevier, BMC journals, and others).

 Prof. Emeritus Nizar Al-Holou (Congress Steering Committee); Professor and Chair, Electrical
and Computer Engineering Department; Vice Chair, IEEE/SEM-Computer Chapter; University of
Detroit Mercy, Detroit, Michigan, USA
 Prof. Hamid R. Arabnia (Congress Steering Committee); Graduate Program Director (PhD, MS,
MAMS); The University of Georgia, USA; Editor-in-Chief, Journal of Supercomputing (Springer);
Editor-in-Chief, Transactions of Computational Science & Computational Intelligence (Springer);
Fellow, Center of Excellence in Terrorism, Resilience, Intelligence & Organized Crime Research
(CENTRIC).
 Prof. Dr. Juan-Vicente Capella-Hernandez; Universitat Politecnica de Valencia (UPV),
Department of Computer Engineering (DISCA), Valencia, Spain
 Prof. Emeritus Kevin Daimi (Congress Steering Committee); Director, Computer Science and
Software Engineering Programs, Department of Mathematics, Computer Science and Software
Engineering, University of Detroit Mercy, Detroit, Michigan, USA
 Prof. Leonidas Deligiannidis (Congress Steering Committee); Department of Computer
Information Systems, Wentworth Institute of Technology, Boston, Massachusetts, USA; Visiting
Professor, MIT, USA
 Prof. Mary Mehrnoosh Eshaghian-Wilner (Congress Steering Committee); Professor of
Engineering Practice, University of Southern California, California, USA; Adjunct Professor,
Electrical Engineering, University of California Los Angeles, Los Angeles (UCLA), California,
USA
 Prof. Louie Lolong Lacatan; Chairperson, Computer Engineerig Department, College of
Engineering, Adamson University, Manila, Philippines; Senior Member, International Association
of Computer Science and Information Technology (IACSIT), Singapore; Member, International
Association of Online Engineering (IAOE), Austria
 Prof. Hyo Jong Lee; Director, Center for Advanced Image and Information Technology, Division
of Computer Science and Engineering, Chonbuk National University, South Korea
 Dr. Ali Mostafaeipour; Industrial Engineering Department, Yazd University, Yazd, Iran
 Dr. Houssem Eddine Nouri; Informatics Applied in Management, Institut Superieur de Gestion de
Tunis, University of Tunis, Tunisia
 Prof. Dr., Eng. Robert Ehimen Okonigene (Congress Steering Committee); Department of
Electrical & Electronics Engineering, Faculty of Engineering and Technology, Ambrose Alli
University, Edo State, Nigeria
 Ashu M. G. Solo (Publicity), Fellow of British Computer Society, Principal/R&D Engineer,
Maverick Technologies America Inc.
 Prof. Fernando G. Tinetti (Congress Steering Committee); School of Computer Science,
Universidad Nacional de La Plata, La Plata, Argentina; also at Comision Investigaciones
Cientificas de la Prov. de Bs. As., Argentina
 Prof. Layne T. Watson (Congress Steering Committee); Fellow of IEEE; Fellow of The National
Institute of Aerospace; Professor of Computer Science, Mathematics, and Aerospace and Ocean
Engineering, Virginia Polytechnic Institute & State University, Blacksburg, Virginia, USA
 Prof. Jane You (Congress Steering Committee); Associate Head, Department of Computing, The
Hong Kong Polytechnic University, Kowloon, Hong Kong
 Dr. Farhana H. Zulkernine; Coordinator of the Cognitive Science Program, School of Computing,
Queen's University, Kingston, ON, Canada

We would like to extend our appreciation to the referees, the members of the program committees of
individual sessions, tracks, and workshops; their names do not appear in this document; they are listed on
the web sites of individual tracks.

As Sponsors-at-large, partners, and/or organizers each of the followings (separated by semicolons)


provided help for at least one track of the Congress: Computer Science Research, Education, and
Applications Press (CSREA); US Chapter of World Academy of Science; American Council on Science &
Education & Federated Research Council (http://www.americancse.org/). In addition, a number of
university faculty members and their staff (names appear on the cover of the set of proceedings), several
publishers of computer science and computer engineering books and journals, chapters and/or task forces of
computer science associations/organizations from 3 regions, and developers of high-performance machines
and systems provided significant help in organizing the conference as well as providing some resources.
We are grateful to them all.

We express our gratitude to keynote, invited, and individual conference/tracks and tutorial speakers - the
list of speakers appears on the conference web site. We would also like to thank the followings: UCMSS
(Universal Conference Management Systems & Support, California, USA) for managing all aspects of the
conference; Dr. Tim Field of APC for coordinating and managing the printing of the proceedings; and the
staff of Luxor Hotel (Convention department) at Las Vegas for the professional service they provided. Last
but not least, we would like to thank the Co-Editors of GCC’19: Prof. Hamid R. Arabnia, Prof. Leonidas
Deligiannidis, and Prof. Fernando G. Tinetti.

We present the proceedings of GCC’19.

Steering Committee, 2019


http://americancse.org/
Contents
SESSION: HIGH-PERFORMANCE COMPUTING - CLOUD COMPUTING

The Design and Implementation of Astronomical Data Analysis System on HPC Cloud 3
Jaegyoon Hahm, Ju-Won Park, Hyeyoung Cho, Min-Su Shin, Chang Hee Ree

SESSION: HIGH-PERFORMANCE COMPUTING - HADOOP FRAMEWORK


A Speculation and Prefetching Model for Efficient Computation of MapReduce Tasks on 9
Hadoop HDFS System
Lan Yang

SESSION: LATE BREAKING PAPER: CLOUD MIGRATION


Critical Risk Management Practices to Mitigate Cloud Migration Misconfigurations 15
Michael Atadika, Karen Burke, Neil Rowe
Int'l Conf. Grid, Cloud, & Cluster Computing | GCC'19 | 1

SESSION
HIGH-PERFORMANCE COMPUTING - CLOUD
COMPUTING

Chair(s)
TBA

ISBN: 1-60132-499-5, CSREA Press ©


2 Int'l Conf. Grid, Cloud, & Cluster Computing | GCC'19 |

ISBN: 1-60132-499-5, CSREA Press ©


Int'l Conf. Grid, Cloud, & Cluster Computing | GCC'19 | 3

The Design and Implementation of Astronomical Data


Analysis System on HPC Cloud
Jaegyoon Hahm1, Ju-Won Park1, Hyeyoung Cho1, Min-Su Shin2, and Chang Hee Ree2
1
Supercomputing Infrastructure Center, Korea Institute of Science and Technology Information, Daejeon,
Republic of Korea
2
Galaxy Evolution Research Group, Korea Astronomy and Space Science Institute, Daejeon, Republic of
Korea

Particular, the astronomical research has demands to utilize


Abstract - Astronomy is a representative data-intensive
cloud computing, which is the ability to acquire resources for
science that can take advantage of cloud computing because
simulation-driven numerical experiments or mass data
it requires flexible infrastructure services for variable
analysis in an immediate and dynamic way. Therefore, the
workloads and various data analysis tools. The purpose of
type of cloud service that is expected by astronomical science
this study is to show the usefulness of cloud computing as a
researchers will be an Infrastructure as a Service (IaaS) for
research environment required to analyze large scale data in
flexible resources for running with existing software and
science, such as astronomy. We implemented an OpenStack
research methodologies and a Platform as a Service (PaaS) to
cloud and a Kubernetes-based orchestration service for
be applied with new data analytic tools.
scientific data analysis. On the cloud, we have successfully
constructed data analysis systems with a task scheduler and
In this paper, we propose a methodology and feasibility of
an in-memory database tool to support the task processing
cloud computing that focuses on flexible use of resources and
and data I/O environment which are required in astronomical
astronomical science researchers' problems when using cloud
researches. Furthermore, we aim to construct high-
services. Section 2 introduces related researches, and Section
performance cloud service for various data-intensive
3 describes the features and requirements of the target
research in more scientific fields.
application. In Section 4 we describe the implementation of
the data analysis system for the target application. Finally, in
Keywords: cloud computing, astronomical data analysis,
Section 5 we provide the conclusions and future plans.
data analysis platform, openstack, kubernetes

2 Related Works
1 Introduction There have been several examples of cloud applications
Recently, in the field of science and technology, more for astronomical research. The Gemini Observatory has been
and more data is generated through advanced data-capturing building a new archive using EC2, EBS, S3 and GLACIER from
sources [1]. And naturally, researchers are increasingly using the Amazon Web Services (AWS) cloud to replace the
cutting-edge data analysis techniques, such as big data existing Gemini Science Archive (GSA) [4]. In addition,
analysis and machine learning. Astronomy is a typical field Williams et al.(2018) have conducted studies to reduce the
of collecting and analyzing large amounts of data through Panchromatic Hubble Andromeda Treasury (PHAT)
various observation tools, such as astronomical telescopes, photometric data set using Amazon EC2 [5].
and data growth rate will increase rapidly in the near future.
As a notable example, Large Synoptic Survey Telescope Unlike the cases of using public clouds, there are also studies
(LSST) will start to produce large volume of datasets up to that use an private cloud environment to be built to perform
20TB per day from observing large area of the sky in full astronomical researches. AstroCloud [6] is a distributed
operations from 2023. Total database for ten years is cloud platform which integrates lots of data management and
expected to be 60 PB for the raw data, and 15 PB for the processing tasks for Chinese Virtual Observatory (China-
catalog database [2]. As another big data project, Square VO). In addition, Hahm et al.(2012) developed a platform for
Kilometer Array (SKA), which will be constructed as the constructing virtual machine-based condor clusters for
largest in the world radio telescope until 2024, is also analyzing astronomical time-series data in a private cloud
projected to generate and archive 130-300PB per year [3]. [7]. The purpose of this study was to confirm the possibility
of constructing a cluster type analysis platform to perform
In this era of data deluge, there is a growing demand for mass astronomical data analysis in a cloud environment.
utilizing cloud computing for data intensive sciences.

ISBN: 1-60132-499-5, CSREA Press ©


4 Int'l Conf. Grid, Cloud, & Cluster Computing | GCC'19 |

3 Application Requirements and Design


The application used in this study is MAGPHYS SED
fitting code, which reads and analyzes the data of the
brightness and color of galaxies to estimate their physical
properties. The used data is the large-scale survey data of
Galaxy And Mass Assembly (GAMA), which is a project to
exploit the latest generation of ground-based and space-borne
survey facilities to study cosmology and galaxy formation
and evolution [8]. On a single processor, MAGPHYS will
typically take 10 min to run for a single galaxy. In Figure 1,
the data analysis in the application starts with the data Fig. 3. Data Analysis with In-memory DB
obtained by analyzing original image data collected from the
telescope through preprocessing. The preprocessed data is a configured by using asynchronous task scheduler for analysis
text file DB, which is input data for the analysis. The work (see Figure 2). In this case, we need a shared file system
application extracts the data one line at a time from the input that can be accessed by a large number of workers
file, submits it to the task queue together with the spectral performing analysis tasks. Second, as shown in Figure 3,
analysis code, and creates a new DB by storing the analyzed data input is performed through in-memory DB instead of
result in the output file. In a traditional research file reading for faster I/O, and the output of the analysis
environment, the analysis will be done by building its own result is also stored in the in-memory DB.
clusters for data analysis, or through a job scheduler in a
shared cluster. 4 Cloud Data Analysis System
Implementation
4.1 KISTI HPC Infrastructure as a Service
Korea Institute of Science and Technology Information
(KISTI) is building a high-performance cloud service in
order to support data-intensive researches in various science
and technology research fields. This is because emerging
data-centric sciences require more flexible and dynamic
Fig. 1. Data Analysis Workflow computing environment than traditional HPC service.
Especially big data and deep learning researches need
The main technical challenge of the application is to achieve customized HPC resources in a flexible manner. So, KISTI
faster data I/O and use their own task queue for convenience. cloud will be a service providing customizable high-
GAMA dataset has information of 197,000 galaxies performance computing and storage resources, such as
approximately. However, file-based I/O is too slow and hard supercomputer, GPU cluster, etc.
to manage in this size of dataset. Therefore, a fast data I/O
tools and a job scheduler for high-throughput batch
processing are required.

To satisfy these requirements, we designed two types of


lightweight data analysis systems. First, data is read through
file I/O as usual and data processing environment is

Fig. 4. OpenStack Cloud Testbed

In the first stage, the cloud service will be implemented on a


supercomputer. KISTI's newly introduced supercomputer
Fig. 2. Data Analysis with Task Scheduler and File I/O
NURION is a national strategic research infrastructure to

ISBN: 1-60132-499-5, CSREA Press ©


Int'l Conf. Grid, Cloud, & Cluster Computing | GCC'19 | 5

support R&Ds in various fields. In particular, it has a plan to


. technological researchers and are likely to be used in a
utilize it for data intensive computing and artificial common data analysis environment. The structure of the
intelligence. In order to build such a service environment, we scheduler is very simple, consisting of a scheduler and
will be leverage cloud computing technologies. workers. We write a Python code to submit tasks to the
scheduler and manage data. The difference between DASK
In order to design the service and verify the required skills, and CELERY is that DASK allocates and monitors tasks in its
we have constructed an OPENSTACK-based IaaS cloud testbed own scheduler module, whereas CELERY worker’s tasks are
system using a computational cluster (see Figure 4). assigned from a separate message queue, such as RABBITMQ.
OPENSTACK [9] is a cloud deployment platform that is used as a
de facto standard in industry and research, and is well suited In the data I/O environment configuration, the initial
to cloud deployments for high-performance computing too. experiment was conducted by configuring a shared file
The cluster used for the testbed has thirteen Intel Xeon-based system using OPENSTACK MANILA for the file-based I/O
servers: one deployment node, one controller node, three processing used in the existing analysis environment.
storage nodes and compute nodes for the rest. OPENSTACK However, in data processing, file I/O is significantly slower
services implemented here are NOVA (Compute), GLANCE than computing performance, which causes severe
(Image), NEUTRON (Network), CINDER (Block Storage), SWIFT performance degradation in analyzing the entire data. In
(Object Storage), KEYNOTE (Identity), HEAT (Orchestration), order to solve this bottleneck problem and improve the
HORIZON (Dashboard), MANILA (Network Filesystem) and MAGNUM overall performance, we used an in-memory DB tool called
(Container). In the case of storage, CEPH [10] storage was REDIS [14]. REDIS is a memory-based key-value store that is
configured using three servers and used as a backend for known to handle more than 100,000 transactions per second.
GLANCE, CINDER, SWIFT, and MANILA services. Apart from this,
we have configured the Docker-based KUBERNETES heat_template_version: queens
orchestration environment using MAGNUM. KUBERNETES is an …
open source platform for automating Linux container parameters:
operations [11]. In this study, it is composed with one worker_num:
default: 16
KUBERNETES master and four workers. …

4.2 Implementation of Data Analysis Platform resources:
scheduler:
in the Cloud type: OS::Nova::Server
properties:
The data analysis system constructed in this study name: dask-scheduler
focuses on how to configure task scheduler and data I/O image: Ubuntu 18.04 Redis with Dask
environment for task processing. We describe the …
template: |
architecture of the analysis system in Figure 5. First, the task #!/bin/bash
scheduler should efficiently distribute and process individual pip3 install dask distributed --upgrade
tasks asynchronously. We adopted a lightweight task …
scheduler so that it can be dynamically configured and used dask-scheduler &
workers:
independently, differently from the shared job schedulers, type: OS::Heat::ResourceGroup
such as PBS and SLURM in the conventional HPC system. In properties:
particular, tasks for astronomical data analysis, which count: { get_param: worker_num }
require long research time, are often asynchronous resource_def:
type: OS::Nova::Server
processing rather than synchronous processing. In the properties:
experiments, we used DASK [12] and CELERY [13] as task name: dask-worker%index%
schedulers, which are readily available to scientific and image: Ubuntu 18.04 Redis with Dask

template: |
#!/bin/bash
apt-get install redis-server -y
pip3 install dask distributed --upgrade

dask-worker dask-scheduler:8786 &
outputs:
instance_name:

instance_ip:

Fig. 5. Data Analytics System Architecture Fig. 6. HEAT Template for Analysis Platform with REDIS & DASK

ISBN: 1-60132-499-5, CSREA Press ©


6 Int'l Conf. Grid, Cloud, & Cluster Computing | GCC'19 |

A combination of task scheduler and data I/O environment processing in stream, and various experiments need be
can be created and configured automatically in an performed through the cloud. Based on the experiences of
orchestration environment through OPENSTACK HEAT or building astronomical big data processing environment in
KUBERNETES in our self-contained cloud. Figure 6 is the this study, we will provide more flexible and high
structure of one of the HEAT templates used in this performance cloud service and let researchers utilize it in
experiment. The template is structured with parameters and various fields of data-centric researches.
resources. The resource is composed of scheduler and
workers, and the required softwares are installed and 6 References
configured for each scheduler and workers after boot-up.
[1] T. Hey, S. Tansley and K. Tolle, The Fourth Paradigm:
Data-intensive Scientific Discovery, Microsoft Research,
5 Conclusion and Future Work 2009.
Through experiments, we have successfully analyzed [2] LSST Corporation. About LSST: Data Management.
about 5,300 galaxy brightness and color data in a parallel [Online]. Available from: https://www.lsst.org/about/dm/
distributed processing environment consisting of DASK or 2019.03.10
CELERY with REDIS. Figure 7 shows one of the example galaxy
from the GAMA data showing a result of the MAGPHYS [3] P. Diamond, SKA Community Briefing. [Online].
analysis in the cloud. With OPENSTACK-based cloud, we Available from https://www.skatelescope.org/ska-
confirmed that the research environment, especially data community-briefing-18jan2017/ 2019.03.10
analysis system with tools like task scheduler and in-memory [4] P. Hirest and R. Cardenes, “The new Gemini
DB, can be automatically configured and well-utilized. In Observatory archvieve: a fast and low cost observatory data
addition, we confirmed the availability of an elastic service archive running in the cloud”, Proc. SPIE 9913, Software
environment through the cloud to meet the demand for large- and Cyberinfrastructure for Astronomy IV, 99131E (8
scale data analysis with volatility. August 2016); doi: 10.1117/12.2231833
[5] B. F. Williams, K. Olsen, R. Khan, D. Pirone and K.
Rosema, “Reducing and analyzing the PHAT survey with the
cloud”, The Astrophysical Journal Supplemement Series,
Volume 236, Number 1
[6] C. Cui et al., “AstroCloud: a distributed cloud
computing and application platform for astronomy”, Proc.
WCSN2016
[7] J. Hahm et al., “Astronomical time series data analysis
leveraging sceince cloud”, Proc. Embedded and Multimedia
Computing Tehnology and Service, pp493-500, 2012
[8] S. P. Driver et al., “Galaxy And Mass Assembly
(GAMA): Panchromatic Data Release (far-UV-far-IR) and
the low-z energy budget”, MNRAS 455, 3911-3942, 2016.
[9] OpenStack Foundation. OpenStack Overview. [Online].
Available from: https://www.openstack.org/software/
Fig. 7. An example result of the MAGPHYS analysis on the cloud 2019.03.10
[10] Red Hat Inc. Ceph Introduction. [Online]. Available
In this study, we have identified some useful aspects of the from: https://ceph.com/ceph-storage/ 2019.03.10
cloud for data-driven research. First, we confirmed that it is
[11] The Kubernetes Authors. What is Kubernetes?. [Online].
easy to build an independent execution environment that
Available from: https://kubernetes.io/docs/concepts/overview/what-
provides the necessary software stack for research through
is-kubernetes/ 2019.03.10
the cloud. Also in a cloud environment, researchers can
easily reuse the same research environment and share [12] Dask Core Developers, Why Dask?. [Online]. Available
research experience by reusing virtual machines or container from: https://docs.dask.org/en/latest/why.html 2019.03.10
images deployed by the research community. [13] A. Solem, Celery - Distributed Task Queue. [Online].
Available from: http://docs.celeryproject.org/en/latest/index.html
In the next step, we will configure an environment for real- 2019.03.10
time processing of in-memory cache data. For practical real-
[14] S. Sanfilippo, Introduction to Redis. [Online]. Available
time data processing, it is necessary to construct an optimal
environment for data I/O as well as memory-based data from: https://redis.io/topics/introduction 2019.03.10.

ISBN: 1-60132-499-5, CSREA Press ©


Int'l Conf. Grid, Cloud, & Cluster Computing | GCC'19 | 7

SESSION
HIGH-PERFORMANCE COMPUTING - HADOOP
FRAMEWORK

Chair(s)
TBA

ISBN: 1-60132-499-5, CSREA Press ©


8 Int'l Conf. Grid, Cloud, & Cluster Computing | GCC'19 |

ISBN: 1-60132-499-5, CSREA Press ©


Int'l Conf. Grid, Cloud, & Cluster Computing | GCC'19 | 9

A Speculation and Prefetching Model for Efficient


Computation of MapReduce Tasks on Hadoop HDFS
System

Lan Yang
Computer Science Department
California State Polytechnic University, Pomona
Pomona, CA 91768, USA

Abstract - MapReduce programming model and Hadoop processing and for batch processing only [5]. The
software framework are keys to big data processing on high Apache Spark [6] partially solved Hadoop’s real time
performance computing (HPC) clusters. The Hadoop and batch processing problems by introducing in-
Distributed File System (HDFS) is designed to stream large memory processing [7]. As a model of Hadoop
data sets at high bandwidth. However, Hadoop suffers from a ecosystem Spark doesn’t have its own distributed
set of drawbacks, particularly having issues with small files
filesystem, though it can use HDFS. Hadoop does not
as well as dynamic datasets. In this research we target big
data applications working with many on-demand datasets of suit for small data due to the factor that HDFS lacks the
varying sizes. We propose a speculation model that ability to efficiently support the random reading of
prefetches anticipated datasets for upcoming tasks in support small files because of its high capacity design. Small
of efficient big data processing on HPC clusters. files are the major problem in HDFS.
In this research, we study a special type of iterative
Keywords: Prefetching, Speculation, Hadoop, MapReduce, MapReduce tasks working on HDFS with input datasets
High performance computing cluster. coming from many small files dynamically, i.e. on-
demand. We propose a data prefetching speculation
1 Introduction model aiming at improving the performance and
flexibility of big data processing on Hadoop HDFS for
Along with the emerging technology of cloud that special type of MapReduce tasks.
computing, Google proposed the MapReduce
programming model [1] that allows for massive 2 Background
scalability of unstructured data across hundreds or
thousands of high performance computing nodes. 2.1 Description of a special type of MapReduce
Hadoop is an open source software framework that tasks
performs distributed processing for huge data sets across
the cluster of commodity servers simultaneously. [2] In today’s big data world, MapReduce programming
Now distributed as Apache Hadoop [3] many cloud model and Hadoop software framework remain as
services such as AWS, Cloudera, HortonWorks, and popular tools for big data processing. Based on a
IBM InfoSphere Insights employ Apace Hadoop to offer number of big data applications performed on Hadoop
big data solutions. The Hadoop Distributed File System we observed the following:
(HDFS) [2], inspired by Google File System (GFS) [4], (1) An HDFS file splits into chunks, typically of 64-
is a reliable filesystem of Hadoop designed for storing 128MB in size. To benefit from Hadoop’s parallel
very large files running on a cluster of commodity processing ability an HDFS file must be large enough to
hardware. To process big data in Apache Hadoop, the be divided into multiple chunks. Therefore, a file is
client submits data and program to Hadoop. HDFS considered as small if it is significantly smaller than the
stores the data while MapReduce processes the data. HDFS chunk size.
(2) While many big data applications use large data
While Hadoop is a powerful tool for processing files that could be pushed to HDFS input directory prior
massive data it suffers from a set of drawbacks to task execution, some applications use many small
including issues with small files, no real time data datasets distributed across a wide range.

ISBN: 1-60132-499-5, CSREA Press ©


10 Int'l Conf. Grid, Cloud, & Cluster Computing | GCC'19 |

(3) With the increasing demand of big data processing, 2.3 Computation time vs. data fetch time
more and more applications now require multiple
rounds (or iterations) of processing with each round In this research, we first tested and analyzed data
requiring new datasets determined on the outcome of accessing time ranging from 1K to 16MB on an HPC
previous computation. For example, in a data processing cluster which consists of 2 DL360 management nodes,
application for a legal system, the first round 20 DL160 compute nodes, 3.3 TB RAM, 40GBit
MapReduce computation uses prest ored case InfiniBand, 10GBit external Ethernet connection with
documents, while the second round might require overall system throughput at 36.6 Tflp at double
accessing to certain assets or utilities datasets based on prevision mode and 149.6 Tflp. Slurm job scheduler [8]
the case outcomes resulted from the first-round analysis. is the primary software we use for our testing. The
The assets or utilities datasets could consist of hundreds performance data shown in Figure 1 serve as our basis
to thousands of files ranging from 1KB to 10MB with for deriving the performance of our speculation
only dozens of files relevant depending on the outcome algorithms.
of the first round. It would be very inefficient or
inflexible if we have to divide these two rounds into
separate client requests. Also, if we could overlap
computation and data access time by speculating and
prefetching data we could reduce the overall processing
time significantly. Here we refer to big data applications
with one or more of the above characteristics (i.e.
requiring iterative or multiple passes of MapReduce
computation, using many small files to form a HDFS
chunk, dynamic datasets that are dependent on the
outcome of previous rounds of computation) as a special
type of MapReduce tasks.

2.2 Observation: execution time and HDFS Figure 1: Data Access Performance Base
chunks
We conducted several dozens of big data applications
using Hadoop on a high-performance computing cluster. 3 Speculation and Prefetching Models
Table 1 summarizes the MapReduce performance of 3.1 Speculation model
three relatively large big data analytics tasks.
We establish a connection graph (CG) to represent
    ! & relations of commonly used tasks with tasks as nodes
  %    and edges as links between tasks. For example, link a
! birthday party planning task to restaurant reservation
tasks as well as entertainment or recreation tasks. An
! ! , ,*, +. address change task is linked with moving or furniture
  shopping tasks. The links on CG are prioritized, for
example, for birthday task, the restaurant task initially is
#  '
set with higher priority than the movie ticketing task.
The priorities are in 0.0 to 1.0 range and are
 #  0&-. --* -+
dynamically updated based on the outcome of our
"$'  prediction. For example, based on the connection in CG
graph and priorities of the links we predict the top two
# /&.0 .0 . tasks following the birthday task are in order of
($  restaurant task and movie task. If for that particular
application it turns out movie task is the correct choice
Table 1: Performance data for some big data thus we will increase the priority by a small fraction,
applications (*requires multi-phase analysis) say 0.1 and capped to 1.0 maximum.

ISBN: 1-60132-499-5, CSREA Press ©


Int'l Conf. Grid, Cloud, & Cluster Computing | GCC'19 | 11

3.2 Prefetching algorithm CG directory, i.e. CG[t][0] as p and starts the


prefetching process. When the t completes it will choose
Prefetching concept is inspired by the compiler- the next task t’. If t’ is the same as p, let t be p and the
directed instruction/data prefetching technique that process continues. If t’s is different from t, restart the
speculates and prefetches instructions for prefetching process, reduce the priority for p by one
multiprocessing [9] [10]. Our basic fetching strategy is: level (currently 0.1) but not less than 0.0, and increase
overlapping with the computation of current task, we the priority of t’ by 0.1 (capped at 1.0) if it’s already in
prefetch associated datasets for the next round of t’s connect link or add to t’s connect link with a
computation based on the task speculation. randomly assigned priority (between 0.1 and 0.5) if it’s
not in t’s connection link yet.
The association between task and data files can be
represented as a many-to-many relations between tasks
and data files. Each task is pre-associated with a list of 4.2 Prefetching Model
files in the order of established ranks. For example, for (1) Configuration: one file node N (i.e. a process that
the restaurant task could be associated with pizza only reads data in and writes to certain shared
delivery files, restaurant location files etc. The ranks are location), created four shared storages (arrays or
initialized based on the popularity of the service with a dictionaries) representing the containers, C1 to C4.
value between 0.0 to 1.0 range with higher value as Initially all Ci’s are empty and each container has a
most popular or most recommended services. The ranks current capacity and a maximum capacity (all
are then adjusted based on the network distance of file containers may have the same maximum capacity.)
locations with priority given to local or nearby files. It’s easily expendable to multiple file nodes and
Again, after the task execution, if a prefetched file larger number of containers.
turned out to be irrelevant (i.e. the whole file was
filtered out at easy MapReduce stage) the rank of that (2) Assume the process p selected by the speculation
file with regard to that task is reduced. scheme is associated with n small files respectively,
say F1, ... Fn. Read in files in the order of F1, ...,
Based on the system configuration we also preset two Fn. For each file read in, record its size as sj, then
constant values K and S with K as the searches for a container with its current capacity + sj
optimized/recommended number of containers and S the < maximum capacity, locks it once found and then
size of each container (suggest S to be the HDFS chunk pushes the content in. If no available container
size and K to be the desired number of chunks with found, the file content is set aside and we increased
regard to requested compute nodes.) When prefetching our failure rate by 1 (failure rate initially is set to 0).
datasets for a set of speculated tasks, the prefetching Continue to fetch next file until it reaches the
process repeatedly reads files until it fills up all the condition as spelled in (3).
containers.
(3) The pre-fetching process ends when all containers
4 Simulation Design reach certain percentage full (i.e. at least 80% full)
or when the failure rate reaches to a certain number
We used a Python dictionary to implement the (say 3). Note: one failure doesn’t mean the
connection graph CG with each distinct task name as a containers are full. It could be the scenario that we
key. The value for a key is a list of task links sorted in fetched a very large dataset that couldn’t fit into any
descending order of priorities. The task and data of current containers. However, in this case we may
relations are also represented as a Python dictionary further fetch the files next in the list as these might
with task names as keys and a list of data file names be smaller files.
sorted in descending order of ranks as values. Currently
we simulate the effectiveness of prefetching by using
parallel processes created by Slurm directly. Once the
5 Conclusions
approaches are validated we will test it on In this research work, we studied the possibility of
Hadoop/HDFS. congregating small datasets dynamically to form large
data chunk suitable for MapReduce task on Hadoop
4.1 Speculation Model HDFS. We proposed task speculation and file
prefetching models to speed up overall processing tasks.
For any current task t, the simulated speculation We have setup a primitive simulation test suite to assess
model always fetches the top candidate task p from the

ISBN: 1-60132-499-5, CSREA Press ©


12 Int'l Conf. Grid, Cloud, & Cluster Computing | GCC'19 |

the feasibility of the speculation and prefetching


models. Since currently we are designing the schemes
on Slurm multiprocess environments without using
HDFS, no performance gain could be measured. Our
future (and on-going) work is to implement the design
schemes from HPC Slurm processes onto Hadoop
HDFS system and measure the effectiveness using real-
world big data applications.

6 References
[1] Jeffrey Dean and Sanjay Ghemawat, MapReduce:
Simplified Data Processing on Large Clusters, Google
Research,
https://research.google.com/archive/mapreduce-
osdi04.pdf
[2] Konstantin Shvachko, Hairong Kuang, Sanjay
Radia, Robert Chansler, The Hadoop Distributed File
System, 2010 IEEE 26th Symposium on Mass Storage
Systems and Technologies (MSST)
[3] Apache Hadoop https://hadoop.apache.org/
[4] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak
Leung, The Google File System,
https://static.googleusercontent.com/media/research.goo
gle.com/en//archive/gfs-sosp2003.pdf
[5] DATAFLAIR Team, 13 Big Limitations of Hadoop
& Solution To Hadoop Drawbacks, https://data-
flair.training/blogs/13-limitations-of-hadoop/, March 7,
2019.
[6] Apache Spark https://spark.apache.org/
[7] Matei Zaharia, Mosharaf Chowdhury, Michael
Franklin, Scott Shenker, Ion Stoica, Spark: Cluster
Computing with Working Sets, Proceedings of the 2nd
USENIX conference on Hot topics in cloud computing,
2010.
[8] Slurm job scheduler, https://slurm.schedmd.com/
[9] Seung Woo Son, Mahmut Kandemir, Mustafa
Karakoy, Dhruva Chakrabarti, A compiler-directed data
prefetching scheme for chip multiprocessors,
Proceedings of the 14th ACM SIGPLAN symposium on
Principles and practice of parallel programming (PPoPP
'09)
[10] Ricardo Bianchinia, Beng-Hong Limb, Evaluating
the Performance of Multithreading and Prefetching in
Multiprocessors, https://doi.org/10.1006/jpdc.1996.0109

ISBN: 1-60132-499-5, CSREA Press ©


Int'l Conf. Grid, Cloud, & Cluster Computing | GCC'19 | 13

SESSION
LATE BREAKING PAPER: CLOUD MIGRATION

Chair(s)
TBA

ISBN: 1-60132-499-5, CSREA Press ©


14 Int'l Conf. Grid, Cloud, & Cluster Computing | GCC'19 |

ISBN: 1-60132-499-5, CSREA Press ©


Int'l Conf. Grid, Cloud, & Cluster Computing | GCC'19 | 15

Critical Risk Management Practices to Mitigate Cloud


Migration Misconfigurations
M. Atadika, K. Burke, and N. Rowe
Computer Science Department, Naval Postgraduate School, Monterey, CA, US

Regular Research Paper

Abstract - We identified that as private enterprises continue Figure 1. Quarterly revenue of AWS from Q1 to Q4 (in USD
to gravitate toward the cloud to benefit from cost savings, millions). Source: [1].
they may be unprepared to confront four major issues
inherent to cloud architecture. Mitigating risks will require
that migrating organizations properly recognize and
understand: the critical misalignment between service model
selection and consumer expectations within the cloud
architecture, the cloud-borne vulnerabilities and cloud-
specific threats that together create technological
challenges, the causal relationship between customer
misconfigurations and cloud spills, and the complexity of
implementing security controls. Collectively, the four
substantive issues cause risk management to manifests itself
in more complicated permutations in the cloud. To address
these vexing cybersecurity risks, this paper introduces the
unifying concept of transformational migration and
recommends decoding the cloud service model selection,
employing cryptographic erase for applicable use cases,
consulting the broadest cloud security control catalogs in For example, on June 1, 2017, the Washington Post reported
addressing cloud-negative controls, managing supply-chain that a large federal contractor for the Department of Defense
risk through cloud service providers, and adopting a (DoD) accidentally leaked government passwords on an
reconfigured Development Security Operations AWS server related to a work assignment for the National
(DevSecOps) workforce. Geospatial-Intelligence Agency [2]. Regrettably, this is not
an isolated episode but the third recently documented
Keywords: Cloud, Misconfigurations, Risk, Migration, instance of data mishandling by the well-established
Service model government contracting firm. The report went on to describe
a prevalence of government agencies pivoting to the cloud,
with industry leaders substantiating that this is, in fact,
1 Introduction indicative of a more universal shift toward cloud-centric
computing [2].
During the five-year period from 2014-18, the largest
cloud service provider, Amazon Web Services (AWS), a As private enterprises rush to the cloud to reap the
proxy of the accelerating technological migration, benefits of financial savings and increased services, they will
experienced revenue growth at a compound annual growth confront four major issues inherent to cloud architecture.
rate of 47.9% [1]. See Figure 1. This paper posits that the velocity of cloud adoption—
multiplied by the immaturity of the available cloud
This growth in revenue directly corresponds to a growing workforce pool—warrants a rigorous investigation into the
trend of data departing on-premises architectures to cloud sufficiency of risk management capabilities and
destinations. Cost may be a primary causal factor for this preparedness. Managing or mitigating risks will require that
uptick in cloud migration. Cloud service providers charge migrating organizations properly recognize and understand
fixed unitized fees for the work/cycle performed by each
instance of utilization. The tradeoff for these cost-savings,  the critical misalignment between service model
however, is potentially magnified insecurity. selection and consumer expectations within the cloud
architecture,

ISBN: 1-60132-499-5, CSREA Press ©


16 Int'l Conf. Grid, Cloud, & Cluster Computing | GCC'19 |

 the cloud-borne vulnerabilities and cloud-specific Figure 2. Cloud Security Responsibility Matrix (On-premises
threats that together create technological challenges, Application). Source: [8].
 the causal relationship between customer
misconfigurations and cloud spills, and
 the complexity of implementing security controls.

Collectively, the four substantive issues cause risk


management to manifest itself in more complicated
machinations in the cloud. To address these issues and their
related vexing cybersecurity risks, this paper introduces the
unifying concept of transformational migration and
recommends decoding the cloud service model selection,
employing cryptographic erase for applicable use cases,
consulting the broadest cloud security control catalogs in
addressing cloud-negative controls, managing supply-chain
risk through cloud service providers, and adopting a
Whereas there is heterogeneity in on-premises applications,
reconfigured Development Security Operations
cloud applications have greater homogeneity requiring fewer
(DevSecOps) workforce [3]. The following sections of this
application development specialists and more generalists.
paper justify these recommendations.
When an organization transitions to the cloud, it loses
2 Background governance because it no longer owns resources but rather
Prior to the cloud, on-premises system applications rents them. The computing resources are also remote to the
were highly customized and expected to operate within a organization and under the control of the cloud service
standalone data center. Accordingly, application data was provider. The cloud interjects the cloud service provider
structured for minimal to no interaction with other relationship into the customer’s workflow. The SANS
applications [4]. In contrast, cloud applications are highly Institute’s 2016 white paper, titled Implementing the Critical
agile and are expected to operate in multiple data centers; Security Controls in the Cloud, underscores that roles have to
permissioned cloud data is available on-demand for maximal be clearly defined to accommodate the interjection of the cloud
interaction with other applications. For transference to the service provider [9]. The consumer and the cloud service
cloud, which uses multiple servers, legacy applications provider both share responsibilities in the cloud relationship.
developed prior to 2005 likely need to have their source code From the cloud service provider’s perspective, the client has
refactored, since it is doubtful those applications were responsibilities that span from running applications down to
written to accommodate running on multiple servers [5]. the guest operating system, while the provider is responsible
Consumers often execute a cloud migration incorrectly, for the host operating system down to the physical data center
assuming that they can simply port their entire traditional IT [10]. This type of cooperative security is commonly referred to
architecture to the cloud without any modification (often as a “shared responsibility model” in the cloud, and this very
referred to as lift and shift; [6]). The lift and shift assumption division of responsibilities can create confusion or uncertainty
can contribute to a redundancy of consumer that contributes to customer misconfiguration.
misconfigurations. Lift and shift is problematic because the
logic assumes that an on-premises application and its Fortunately (or unfortunately), there is no one right way
security controls are technically compatible with cloud to configure all of the available settings: a one-size-fits-all
architectures without any modifications [7]. Contrary to cloud does not exist. To this point, the Central Intelligence
popular perception, it is not the responsibility of the cloud Agency (CIA) and National Security Agency (NSA) pursued
service provider to make a lift and shift migration work, two different paths in achieving similar cloud computing
because this cloud transition “strategy” is orthogonal to capabilities. Even though they are two seemingly similar,
cloud architectures. This underscores why on-premises technologically sophisticated U.S. intelligence agencies, each
applications require changes at multiple logical layers to organization had to make requirement-dependent choices with
properly function in a cloud service model, as depicted in the respect to cloud deployment and service models. In a 2014
cloud stack in Figure 2. interview, the CIA’s chief information officer (CIO), Doug
Wolfe, confirmed that the two clandestine agencies chose to
The high customization of on-premises system applications build out their respective cloud architectures differently [11].
created two deficiencies relative to scalability when compared Wolfe explained that the CIA cloud was built using
to cloud systems: it inhibited applications from leveraging data commercial cloud products with participation from a
from other applications and it limited administrator knowledge commercial cloud service provider, while the NSA cloud was
to a small subset of applications, creating pockets of designed in-house, also using commercially available products
specialization. but without participation from a commercial cloud service
provider [11]. The service model an organization selects

ISBN: 1-60132-499-5, CSREA Press ©


Int'l Conf. Grid, Cloud, & Cluster Computing | GCC'19 | 17

determines the level of involvement it must have in application this break in the event of a cyber-incident is the inefficiency of
development within the cloud service provider’s environment. locating stored media, which includes artifacts, log files, and
The service model selection in itself also determines a other evidentiary traces [12]. In on-premises systems, the
particular set of consumer challenges balanced against greater operating systems dependably and centrally manage the
autonomy in managing cloud-specific settings, configurations, consistent generation and storage of valuable evidence traces
and controls. Thankfully, the selection of the cloud service and the information is well documented. The NCC FSWG also
model can highlight the cloud layers for which the consumer is observed that in the cloud, “user based login and controls are
responsible and, therefore, which security controls to typically in the application rather than in the operating system”
implement. The security controls may essentially be the same. [12]. Cloud technologies decouple user identification
However, the implementer of the controls may (and probably credentials from a corresponding physical workstation [12].
will) differ by cloud service model. This suggests that any These idiosyncrasies of cloud architecture also create
meaningful discussion about cloud security will not refer to the inefficiencies in data retrieval.
ubiquitous cloud but will instead reference a specific selected
architecture instantiation, reflecting committed organizational Not only do organizations need to consider the primary
choices. and secondary consequences of diverging from traditional
operating security models, but they also must recognize that
Once organizations have selected the appropriate service the cloud exposes them to new vulnerabilities and threats.
model, they must then address the technological challenges Several cloud vulnerabilities are distinct and completely cloud-
inherent in the cloud with respect to their service model specific. Before designating a vulnerability as cloud-native, it
selection. The confounding aspect of interoperability is that the needs to meet a set of criteria; a litmus test to decide if a
cloud integrates multiple sophisticated technologies, cloud vulnerability should be assigned as cloud-specific.
service providers, servicing counterparties, logical layers, Determining whether a vulnerability is cloud-native is helpful
hardware, and endpoint devices. A cloud service provider’s in discussions with reluctant managers about the relative risk
trustworthiness is compromised if any of the multiple parties of the cloud. Published by the Institute of Electrical and
or technological interchanges is compromised. At the National Electronics Engineers (IEEE), “Understanding Cloud
Institute of Standards Technology (NIST), the NIST Cloud Computing Vulnerabilities” provides a rubric that helps
Computing Forensic Science Working Group (NCC FSWG) determine if vulnerabilities are cloud-specific [13]. According
shares an example of the enmeshed relationships a forensic to the rubric, a vulnerability is cloud-specific if it:
investigator may have to unwind: “A cloud Provider that
provides an email application (SaaS [software as a service])  is intrinsic to or prevalent in a core cloud computing
may depend on a third-party provider to host log files (i.e., technology,
PaaS [platform as a service]), which in turn may rely on a  has its root cause in one of NIST’s essential cloud
partner who provides the infrastructure to store log files (IaaS characteristics,
[infrastructure as a service])” [12]. Therefore, technological  is caused when cloud innovations make tried and tested
capabilities and limitations dictate the realties that cloud security controls difficult or impossible to implement, or
service providers must integrate. The remote delivery of cloud  is prevalent in established state-of-the-art cloud offerings.
services and the cloud service provider’s capacity as an [13]
intermediary give rise to organizational boundary challenges.
The multi-geographical operations of cloud service providers The first bullet refers to web applications, virtualization,
create additional legal challenges, as consumers might fall and cryptography as the core cloud technologies [13]. The
under regulations in multiple jurisdictions if they do not limit second bullet alludes to the five essential characteristics
the location of servers to only organizationally acceptable attributed to NIST—on-demand self-service, broad network
jurisdictions. access, resource pooling, rapid elasticity, and measured service
[13]. The third bullet identifies instances when on-premises
The cloud’s remote delivery presents obstacles to data system security practices do not transfer to the cloud—for
retrieval that are foreign to on-premises systems. Unlike in on- example, the “cloud-negative controls” identified by [9] and
premises systems, cloud storage is neither local nor persistent; elaborated upon in section 3.1, which covers the
physically attached data storage is only temporary, but not after implementation of security controls. The fourth bullet
the abstraction that enables pooling and dynamic customer describes the cloud as pushing present technological
provisioning. The process of abstraction decouples the boundaries. If a vulnerability is identified in an advanced cloud
physical resources through a process called virtualization, offering—one that has not been previously identified—then it
which enables resource pooling. Furthermore, storage is must be a cloud-specific vulnerability. Although there is some
designated as a cloud service provider responsibility, as merit to the argument, the IEEE paper erroneously includes
depicted in Figure 2. The NCC FSWG characterized the weak authentication implementations, which are not
separation of a virtual machine from local persistent storage: technically exclusive to the cloud [13]. Due to the flaw in this
“Thus, the operational security model of the application, which interpretation, this fourth indicator can only be seen as partially
assumes a secure local log file store, is now broken when attributed, or a hybrid cloud-specific vulnerability.
moved into a cloud environment” [12]. The consequence of

ISBN: 1-60132-499-5, CSREA Press ©


18 Int'l Conf. Grid, Cloud, & Cluster Computing | GCC'19 |

Considering the vulnerabilities borne of cloud and mitigate. An informed service model selection can
architectures, it is important to determine which cloud-specific facilitate better prioritization of the pertinent cloud services,
threats could exploit those vulnerabilities. All organizations logical layers, and underlying data structures. The initial
must update their threat model to include cloud-generated benefit of focusing on service model selection is that doing so
threats. For example, bad actors are presently exploiting cloud raises the awareness of additional cloud security challenges,
services in their attacks by remaining anonymous enabling the consumer to abate these issues through a
inexpensively, decentralizing their operations by using combination of policy changes or contracts with additional
multiple cloud service providers, and provisioning superior security services. Data security considerations will directly
computing power for a fraction of the cost with pay-as-you-go address the cloud’s information structures, which is data either
pricing. In 2016, the Cloud Security Alliance released the to be stored or processed by computing processes. Application
“Treacherous 12: Cloud Computing Top Threats in 2016,” security considerations will directly address the cloud’s
which it compiled by surveying cloud industry experts. The application structures, which comprises the application
Treacherous 12 ranks a dozen security concerns in order of services used in building applications or the resultant cloud-
severity (Table 1). deployed application itself [6]. Infrastructure security
considerations will directly address the cloud’s scalable and
Table 1. Treacherous 12 Threats Summary. elastic infrastructure, which comprises the enormity of the
Adapted from [14]. cloud service provider’s pooled core computing, networking,
Rank Threat in Conventional Architectures Threat in the Cloud
and storage resources. Configuration, management, and
1 Data breaches Data breaches administrative security considerations will directly impact the
2
3
Weak access management Weak access management
Insecure APIs
cloud’s metastructures, which enable cohesive functioning of
4 System and application vulnerabilities System and application vulnerabilities communication interoperability protocols between the various
5
6
Account hijacking
Malicious insiders
Account hijacking
Malicious insiders
layer interfaces; critical configuration and management
7 Advance Persistent Threats Advance Persistent Threats settings are embedded in metastructure signals [6].
8 Data loss Data loss
9 Insufficient due diligence Insufficient due diligence
10 Nefarious use of cloud services The merit of this lower-level understanding is a firmer
11
12
Denial of service Denial of service
Shared technology vulnerabilities
comprehension of how standard cloud communication
functions at different layers within the cloud’s shared-
Note that from this analysis, of the 12 greatest estimated responsibilities model. Accordingly, security practitioners map
threats that experts say emanate from the cloud, only three their organizational responsibilities to their service model
point to truly cloud-specific vulnerabilities. Insecure selections. This approach maximizes information security
application programming interfaces (API) (no. 3), nefarious signal-to-noise ratios by only isolating the actionable logical
use of cloud services (no. 10), and shared technology layers. Migrating organizations can begin by replacing
vulnerabilities (no. 12) are the cloud-specific threats that merit applications with software as a service to abandon legacy code,
additional in-depth defense security measures. While not followed by rebuilding cloud native or refactoring backward-
cloud-specific, weak access management, account hijacking, compatible application code with platform as a service, and
malicious insiders, and insufficient due diligence are the next finally by re-hosting (lift and shift) applications to
tier of cloud threats to address. infrastructure as a service that will not benefit from either
current or future cloud capabilities [15]. Software-as-a-service
solutions lack customer customization and can lead to vendor
3 Risk Management Considerations lock-in by making the porting of data more challenging. A
To securely operate in the cloud, risk management vendor lock-in mitigant is service-oriented architecture
considerations must protect an organization’s assets against a development, which produces applications that are treated like
range of undesirable events and associated consequences. “services,” as in “anything as a service.” Once the application
Cloud spills are a compelling example of such events. A data can be treated as a service, it should be able to port or “plug”
spill is any event involving the unauthorized transfer of into any cloud service provider seamlessly and temper the fears
confidential data from an accredited information system to one of having to make large-scale changes to existing code bases
that is not accredited [8]. A cloud spill is a type of data spill, for interoperability with proprietary requirements of the new
specifically originating from a cloud environment. As early as cloud service provider. Service-oriented architecture is easily
2013, the government had investigated data spillage specific to reconfigurable.
the cloud, documented in a Department of Homeland Security
(DHS) presentation on February 14, 2013, “Spillage and Cloud The prevailing methods for either mitigating or responding
Computing.” Clearly, all migrating organizations, but to cloud data spills are insufficient in terms of consumer
especially agencies involved in national security matters, must autonomy and cloud confidentiality. In regard to autonomy,
effectively reduce cloud spills; however, they still have not cloud service providers have invented the concept of bring your
found a solution to this problem. own key (BYOK), which bolsters a false sense of security
regarding consumer encrypted data. BYOK solutions imply that
Instead of reacting to the aftereffects of cloud spills, the consumer’s key is the sole key involved in encrypting and
migrating organizations need to determine how to anticipate decrypting the customer’s data, which is not the case [16]. In

ISBN: 1-60132-499-5, CSREA Press ©


Int'l Conf. Grid, Cloud, & Cluster Computing | GCC'19 | 19

fact, the consumer key is an unnecessary input for the cloud should generate their requirements, map architecture, and
service provider to access the consumer’s data (e.g. responding conclude by diagnosing and then prioritizing the remaining
to subpoena requests). In practice, the cloud service provider security gaps of the cloud service provider [6]. The 2016 SANS
first uses their own key to encrypt the data and the customer key paper indicates that it is imperative for any organization’s
second to encrypt the cloud service provider key. The DoD has security architect to have the ability to discern how on-
recognized this deficiency of the BYOK construct and has premises networks differ from virtualized architecture. The
secured an alternate remediation: cryptographic erase. SANS paper categorizes security controls into cloud-positive,
cloud-negative, and cloud-neutral controls [9]. The three tiers
Cryptographic erase is credited by the DoD as “high- correspond to the ease of application within the cloud. The
assurance data destruction … [in which] media sanitization is SANS recommendation, based upon this awareness, allows the
performed by sanitizing the cryptographic keys used to encrypt security architect to direct greater attention to the cloud-
the data, as opposed to sanitizing the storage locations on media negative controls. Cloud-negative controls emerge when
containing the encrypted data itself” [8]. Sanitization is the implementation is more difficult or cost-prohibitive in the
process of making data unrecoverable. Cryptographic erase cloud [9]. The paper specifically identifies logging, boundary
achieves the goal of data destruction indirectly, by way of key defense, and incident response management as cloud-negative
erasure. Cryptographic erase also accommodates “partial controls.
sanitization,” in which a subset of the data is sanitized, but this
requires the use of unique keys for each subset [8]. NIST 800-53 is heralded as an exhaustive set of security
Cryptographic erase paired with deleting files is more expedient controls. However, the first revision of NIST 800-53 was
than physically sanitizing a cloud service provider environment. published in December 2006, predates widespread cloud
However, cryptographic erase is only effective for encrypted adoption and was better suited for on-premises environments.
data. Therefore, the DoD explicitly tasks its components and In response, FedRAMP (the Federal Risk and Authorization
agencies with ensuring that all DoD data at rest is encrypted. Management Program) is a 2011 federal policy that details the
This acknowledges that any data that is in unencrypted states is minimally required security authorization procedures with
data at risk. Furthermore, the DoD must have exclusive control which an agency must comply when engaging with a cloud
of both the encryption keys and key management; this facilitates service provider for contracted cloud services. FedRAMP was
the DoD’s ability to remediate unilaterally, high-assurance data specifically drafted to direct federal cloud computing
destruction, without any cloud service provider cooperation [8]. acquisitions, and its goal was to accelerate adoption of cloud
services and enforce standardized cybersecurity requirements
However, cryptographic erase is not a panacea. This government-wide. Cloud requirements for the DoD exceed
technology is an effective tool to resolve data spills due to requirements for other federal government agencies; for that
human error, but it would likely prove ineffective against data reason, the DoD issued the Cloud Computing Security
spills initiated by malicious code. Cryptographic erase would Requirements Guide [8], which describes FedRAMP+.
be unable to contain a running process while data is still in use. FedRAMP+ adds DoD-specific security controls to fulfill the
Additionally, cryptographic erase is only effective in DoD’s mission requirements. FedRAMP+ is the cloud-
infrastructure as a service—and some platform as a service— computing customized approach to NIST 800-53 security
cloud deployments when the consumer determines exactly how controls. These controls “were selected primarily because they
the data is stored. Although the DoD has been able to resort to address issues such as the Advanced Persistent Threat (APT)
cryptographic erase as a reactionary measure, private and/or Insider Threat, and because the DoD … must categorize
enterprise consumers now aware of the BYOK misnomer its systems in accordance with CNSSI 1253, beginning with its
should focus their attention on prevention. Customer baselines, and then tailoring as needed” [8]. CNSSI 1253 is the
misconfiguration prevention begins when the consumer Committee on National Security Systems Instruction No. 1253
directly maps security controls to the logical layers for which Security Categorization and Control Selection for National
they are explicitly responsible as a result of their service model Security Systems [18]. A comparison of security controls
selection. indicates that 32 CNSSI 1253 controls were added to the NIST
SP 800-53 moderate baseline and 88 NIST 800-53 moderate
controls were subtracted from the CNSSI 1253 moderate
3.1 Implementation of Security Controls
baseline [8]. Non-DoD entities also seeking security controls
When consumers transition from on-premises systems, that surpass federal government agency standards may refer to
they will find gaps within their existing security policies and CNSSI 1253 for more granular control options. Additionally,
how they interplay with the contracted terms and conditions of the Cloud Controls Matrix, published by the CSA, is a rational
an executed service-level agreement. A wide variability exists catalog to begin with because it maps its controls side-by-side
among cloud service providers with respect to defined terms with many other control catalogs for easy comparison.
and related metrics [17]. Consumers should focus on the
definitions used in each agreement. The goals of each specific 4 Transformational Migration
cloud project, service model, and cloud service provider
platform are the critical inputs in determining the additional Ultimately, there is a viable solution for the challenges
countermeasures the project should integrate. Organizations that migrating organizations face when transitioning to a

ISBN: 1-60132-499-5, CSREA Press ©


20 Int'l Conf. Grid, Cloud, & Cluster Computing | GCC'19 |

robust, secure cloud environment. However, the solution will effectively implement critical risk management controls while
require those organizations to reorganize people and processes avoiding detrimental misconfigurations when migrating to the
to minimize the existing gaps between how traditional cloud.
applications operate and how cloud computing applications are
configured. It will also require organizations to incorporate The research presented in this paper is part of Michael
broad uses of encryption, digital forensic incidence-response Atadika’s thesis conducted at and published for public
processes tailored to cloud architectures, practicable release by the Naval Postgraduate School [20].
workarounds that address cloud-negative security controls, and
continuous mandatory cloud training. Transformational 6 References
migration accounts for these requirements by better aligning
processes with how the cloud actually functions.
[1] Statista [Internet]. [date unknown]. Amazon Web
Transformational migration mandates the collocation of Services: quarterly revenue 2014-2018. Hamburg (Germany):
relevant data sets through secure application programming Statista; [cited 2019 Feb 22]. Available from:
interface calls. Additionally, it supports extending the https://www.statista.com/statistics/250520/forecast-of-
perimeter from the network boundary to include the boundary amazon-web-services-revenue
of specific chunks of data. Extending the perimeter enables the
migrating organization to leverage metadata tagging to [2] Gregg, A [Internet]. 2017, Jun. 1. Booz Allen Hamilton
administer stricter enforcement of file authorizations and legal employee left sensitive passwords unprotected online.
compliance. Transformational migration mandates security Washington (DC): Washington Post; [cited 2018 Mar 2].
through the complete data security lifecycle: creating, storing, Available from: https://www.washingtonpost.com/business/
processing, sharing, archiving, and destructing [6]. capitalbusiness/government-contractor-left-sensitive-
passwords-unprotected-online/2017/06/01/916777c6-46f8-
“Deliver Uncompromised” is a new strategy to address 11e7-bcde-624ad94170ab_story.html?utm_term=
cybersecurity lapses that extend to DoD contractors [19]. .6cad14ff8b95
Deliver Uncompromised encourages adding security
assessment attainment levels in the awarding of contracts along [3] [DoD] Department of Defense. [Internet]. [updated 2018
with traditional cost and performance considerations. The new Oct 2; cited 2019 Mar 14]. Defense innovation board do’s and
supply-chain risk management strategy believes the cloud can don’ts for software. Washington (DC): Department of Defense.
contribute to protecting the DoD supply-chain by specifically Available from:
encouraging its contractors “to shift information systems and https://media.defense.gov/2018/Oct/09/2002049593/-1/-
applications to qualified, secure cloud service providers” [19]. 1/0/DIB_DOS_DONTS_SOFTWARE_2018.10.05.PD
This strategy can also be applied to non-DoD supply-chains.
[4] Bommadevara N., Del Miglio A., Jansen S [Internet].
5 Conclusion 2018. Cloud adoption to accelerate IT modernization. New
York (NY): McKinsey Digital; [cited 2018 May 18]. Available
Transformational migration is a strategy to prevail the well- from: https://www.mckinsey.com/business-functions/digital-
worn pattern of human misunderstandings largely driving cloud mckinsey/our-insights/cloud-adoption-to-accelerate-it-
misconfigurations, which eventually become cloud data spills modernization
that require a digital forensic incident-response. A better
understanding of how the service model relates to the intent of [5] Odell, L., Wagner, R., & Weir, T [Internet]. 2015.
the application can reduce the risk of customer Department of Defense use of commercial cloud computing
misconfigurations, which produces a more robust cybersecurity capabilities and services. Alexandria (VA): Institute for
risk posture. Migrating organizations will also require Defense Analyses; [cited 2018 Aug 23]. Available from:
transitioning application professionals to a new dynamic: a http://www.dtic.mil/dtic/tr/fulltext/u2/1002758.pdf
transformational workforce with the dexterity to remediate issues
at multiple cloud logical layers. The DevSecOps model, [6] [CSA] Cloud Security Alliance [Internet]. 2017. Security
comprised of both the newly hired and the retrained workforce, guidance: For critical areas of focus in cloud computing v4.0.
is an integrated team of problem solvers with diverse experiences Seattle (WA): Cloud Security Alliance; [cited 2018 Apr 10].
from application development, engineering, and security Available from: https://cloudsecurityalliance.org/
disciplines. The DevSecOps teams are tasked with developing guidance/#_overview
and continuously tuning applications by addressing security at
multiple layers and for the complete data life cycle. The [7] van Eijk, P H J [Internet]. 2018. Cloud migration
DevSecOps model is endorsed by the Defense Innovation Board strategies and their impact on security and governance Seattle
for its comprehensive resolution of existing misalignments (WA): Cloud Security Alliance; [cited 2019 Mar 14].
between information security professionals and cloud Available from https://blog.cloudsecurityalliance.org/2018/06/
technologies [3]. Using the recommendations of transformational 29/cloud-migration-strategies-impact-on-security-
migration as a guide, DevSecOps teams will be able to more governance/

ISBN: 1-60132-499-5, CSREA Press ©


Int'l Conf. Grid, Cloud, & Cluster Computing | GCC'19 | 21

[8] [DISA] Defense Information Systems Agency [Internet]. [17] [CIO & CAO]. Chief Information Officer Council &
2017, Mar 6. Department of Defense Cloud Computing Chief Acquisition Officers Council [Internet]. 2012. Creating
Security Requirements Guide, version 1, release 3. effective cloud computing contracts for the federal
Washington (DC): Department of Defense; [cited 2018 Apr government: Best practices for acquiring IT as a service.
10]. Available from: https://www.complianceweek.com/ Washington (DC): Chief Information Officer Council & Chief
sites/default/files/department_of_defense_cloud_computing_s Acquisition Officers Council; [cited 2018 Dec 07]. Available
ecurity_requirements_guide.pdf from: https://www.cio.gov/2012/02/24/cloud-computing-
update-best-practices-for-acquiring-it-as-a-service/
[9] SANS Institute [Internet]. 2016. Implementing the
critical security controls in the cloud. North Bethesda (MD): [18] [CNSS] Committee on National Security Systems
SANS Institute; [cited 2017 Oct 20]. Available from: [Internet]. 2014. Security categorization and control selection
https://www.sans.org/reading-room/whitepapers/critical/ for national security systems, CNSSI No. 1253. Washington
implementing-critical-security-controls-cloud-36725 (DC): Department of Defense; [cited 2018 May 21]. Available
from: http://www.dss.mil/documents/CNSSI_No1253.pdf
[10] Clarke, G [Internet]. 2015, Apr. 13. Self preservation is
AWS security’s biggest worry, says gros fromage. London [19] Nakashima, E., Sonne, P [Internet]. 2018, Aug 13.
(UK): The Register; [cited 2017 Oct 9]. Available from: Pentagon is rethinking its multibillion-dollar relationship with
https://www.theregister.co.uk/2015/04/13/aws_security_sleep U.S. defense contractors to boost supply chain security.
less_nights/ Washington (DC): Washington Post; [cited 2018 Aug 13].
Available from: https://www.washingtonpost.com/world/
[11] [CIA] Central Intelligence Agency [Internet]. 2014, Dec. national-security/the-pentagon-is-rethinking-its-multibillion-
17. CIA creates a cloud: An interview with CIA’s chief dollar-relationship-with-us-defense-contractors-to-stress-
information officer, Doug Wolfe, on cloud computing at the supply-chain-security/2018/08/12/31d63a06-9a79-11e8-
agency. Washington (DC): Central Intelligence Agency; [cited b60b-1c897f17e185_story.html?utm_term=.60664aebdfb8
2018 Mar 8]. Available from: https://www.cia.gov/news-
information/featured-story-archive/2014-featured-story- [20] Atadika, M. Applying U.S. military cybersecurity
archive/cia-creates-a-cloud.html policies to cloud architectures [master’s thesis]. Monterey
(CA): Naval Postgraduate School. 2018. 102p.
[12] [NCC FSWG] NIST Cloud Computing Forensic Science
Working Group (NCC FSWG) [Internet]. 2014. NIST cloud
computing forensic science challenges, Draft NISTIR 8006.
Gaithersburg (MD): NIST; [cited 2018 May 7]. Available
from: https://csrc.nist.gov/publications/detail/nistir/8006/draft

[13] Grobauer, B., Walloschek, T., & Stöcker, E. 2011.


Understanding cloud computing vulnerabilities. IEEE Sec &
Pri. [Internet]. [cited 2017 Oct 15]. 9(2), 50–57. Available
from: https://doi/org.10.1109/MSP.2010.115

[14] [CSA] Cloud Security Alliance [Internet]. 2016. The


treacherous 12: Cloud computing top threats in 2016. Seattle
(WA): Cloud Security Alliance; [cited 2017 Nov 1]. Available
from: https://downloads.cloudsecurityalliance.org/
assets/research/top-threats/Treacherous-12_Cloud-
Computing_Top-Threats.pdf

[15] Woods, J [Internet]. 2011. Five options for migrating


applications to the cloud: Rehost, refactor, revise, rebuild or
replace. Stamford (CT): Gartner; [cited 2018 Aug 22].
Available from: https://gartnerinfo.com/
futureofit2011/MEX38L_A2%20mex38l_a2.pdf

[16] Rich, P [Internet]. 2017. SaaS Encryption: lies, damned


lies, and hard truths. Redmond (WA): Microsoft; [cited 2019
Mar 11]. Available from: https://channel9.msdn.com/Events/
Ignite/Microsoft-Ignite-Orlando-2017/BRK2392

ISBN: 1-60132-499-5, CSREA Press ©


Author Index

Atadika, Michael - 15
Burke, Karen - 15
Cho, Hyeyoung - 3
Hahm, Jaegyoon - 3
Park, Ju-Won - 3
Ree, Chang Hee - 3
Rowe, Neil - 15
Shin, Min-Su - 3
Yang, Lan - 9

You might also like