Computing Science Engineering September-October 2016
Computing Science Engineering September-October 2016
cise.aip.org
www.computer.org/cise/
Contents | Zoom in | Zoom out For navigation instructions please click here Search Issue | Next Page
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
Login to mycs.computer.org
• Mobile friendly./RRNVJUHDWRQDQ\GHYLFHŜPRELOHWDEOHW
ODSWRSRUGHVNWRS
• Customizable.:KDWHYHU\RXUHUHDGHUOHWV\RXGR\RXFDQGR
RQP\&6&KDQJHWKHSDJHFRORUWH[WVL]HRUOD\RXWHYHQXVH
DQQRWDWLRQVRUDQLQWHJUDWHGGLFWLRQDU\ŜLWŞVXSWR\RX
• Adaptive. 'HVLJQHGVSHFLƮFDOO\IRUGLJLWDOGHOLYHU\DQG
UHDGDELOLW\
• Personal.6DYHDOO\RXULVVXHVDQGVHDUFKRUUHWULHYHWKHP
TXLFNO\RQ\RXUSHUVRQDOP\&6VLWH
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
Contents | Zoom in | Zoom out For navigation instructions please click here Search Issue | Next Page
cise.aip.org
www.computer.org/cise/
Contents | Zoom in | Zoom out For navigation instructions please click here Search Issue | Next Page
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
www.computer.org/bda
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
EDITOR IN CHIEF
George K. Thiruvathukal, Loyola Univ. Chicago, [email protected]
_________
DEPARTMENT EDITORS
Books: Stephen P. Weppner,
Eckerd College, [email protected]
____________
IEEE Signal Processing Society Liaison:
Computing Prescriptions: Ernst Mucke, Identity Solutions, ernst.mucke@gmail.
___________ Mrityunjoy Chakraborty, Indian Institute of Technology, [email protected]
______________
com,
__ and Francis Sullivan, [email protected],
________ IDA/Center for
Computing Sciences CS MAGAZINE OPERATIONS COMMITTEE
Computer Simulations: Barry I. Schneider, NIST, [email protected],
_______ and Forrest Shull (chair), Brian Blake, Maria Ebling, Lieven Eeckhout,
Gabriel A. Wainer, Carleton University, [email protected]
_____________ Miguel Encarnacao, Nathan Ensmenger, Sumi Helal, San Murugesan,
Education: Rubin H. Landau, Oregon State Univ., [email protected].
_______________
Ahmad-Reza Sadeghi, Yong Rui, Diomidis Spinellis, George K. Thiruvathukal,
edu,
__ and Scott Lathrop, University of Illinois, [email protected]
__________
Mazin Yousif, Daniel Zeng
Leadership Computing: James J. Hack, ORNL, ________
[email protected],
and Michael E. Papka, ANL, [email protected]
________
Novel Architectures: Volodymyr Kindratenko, University of Illinois, CS PUBLICATIONS BOARD
[email protected],
___________ David S. Ebert (VP for Publications), Alfredo Benso, Irena Bojanova, Greg Byrd,
and Pedro Trancoso, Univ. of Cyprus, [email protected]
___________ Min Chen, Robert Dupuis, Niklas Elmqvist, Davide Falessi, William Ribarsky,
Scientific Programming: Konrad Hinsen, CNRS Orléans, Forrest Shull, Melanie Tory
[email protected]
____________
and Matthew Turk, NCSA, [email protected]
______________ EDITORIAL OFFICE
Software Engineering Track: Jeffrey Carver, University of Alabama,
Publications Coordinator: [email protected]
__________
[email protected],
_________ and Damian Rouson, Sourcery Institute,
[email protected]
___________ COMPUTING IN SCIENCE & ENGINEERING
The Last Word: Charles Day, [email protected]
_______ c/o IEEE Computer Society
Visualization Corner: Joao Comba, UFRGS, [email protected],
__________ and 10662 Los Vaqueros Circle, Los Alamitos, CA 90720 USA
Daniel Weiskopf, Univ. Stuttgart, _________________
[email protected] Phone +1 714 821 8380; Fax +1 714 821 4010
Your Homework Assignment: Nargess Memarsadeghi, NASA Goddard Space Websites: www.computer.org/cise or http://cise.aip.org/
Flight Center, [email protected]
__________________
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
SCIENCE AS A SERVICE
CLOUD COMPUTING
HYBRID SYSTEMS
For more information on these and other computing topics, please visit the
________
IEEE Computer Society Digital Library at www.computer.org/csdl.
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
COLUMNS
98 Visualization Corner
Renato R.O. da Silva, Paulo E. Rauber,
4 From the Editors and Alexandru C. Telea
Steven Gottlieb Beyond the Third Dimension: Visualizing
The Future of NSF Advanced Computing High-Dimensional Data with Projections
Infrastructure Revisited
78 Computer Simulations
Christian D. Ott
Massive Computation for Understanding
Core-Collapse Supernova Explosions
94 Leadership Computing
Laura Wolf
Multiyear Simulation Study Provides
Breakthrough in Membrane Protein Research
Editorial: Unless otherwise stated, bylined articles, as well as product and service descriptions,
reflect the author’s or firm’s opinion. Inclusion in Computing in Science & Engineering does
not necessarily constitute endorsement by IEEE, the IEEE Computer Society, or the AIP. All
submissions are subject to editing for style, clarity, and length. IEEE prohibits discrimination,
harassment, and bullying. For more information, visit www.ieee.org/web/aboutus/whatis/policies/
_____ Circulation: Computing in Science & Engineering (ISSN 1521-9615) is published
p9-26.html.
bimonthly by the AIP and the IEEE Computer Society. IEEE Headquarters, Three Park Ave.,
17th Floor, New York, NY 10016-5997; IEEE Computer Society Publications Office, 10662
Los Vaqueros Cir., Los Alamitos, CA 90720, phone +1 714 821 8380; IEEE Computer Society
Headquarters, 2001 L St., Ste. 700, Washington, D.C., 20036; AIP Circulation and Fulfillment
Department, 1NO1, 2 Huntington Quadrangle, Melville, NY, 11747-4502. Subscribe to
Computing in Science & Engineering by visiting www.computer.org/cise. Reuse Rights and
Reprint Permissions: Educational or personal use of this material is permitted without fee,
provided such use: 1) is not made for profit; 2) includes this notice and a full citation to the
original work on the first page of the copy; and 3) does not imply IEEE endorsement of any
third-party products or services. Authors and their companies are permitted to post the accepted
version of IEEE-copyrighted material on their own web servers without permission, provided
that the IEEE copyright notice and a full citation to the original work appear on the first screen
of the posted copy. An accepted manuscript is a version that has been revised by the author to
incorporate review suggestions, but not the published version with copy-editing, proofreading
and formatting added by IEEE. For more information, please go to: http://www.ieee.org/
publications_standards/publications/rights/paperversionpolicy.html.
____________________________ Permission to reprint/
republish this material for commercial, advertising, or promotional purposes or for creating
new collective works for resale or redistribution must be obtained from IEEE by writing to the
IEEE Intellectual Property Rights Office, 445 Hoes Lane, Piscataway, NJ 08854-4141 or pubs- ___
_________ Copyright © 2016 IEEE. All rights reserved. Abstracting and Library
[email protected].
Use: Abstracting is permitted with credit to the source. Libraries are permitted to photocopy for
private use of patrons, provided the per-copy fee indicated in the code at the bottom of the first
page is paid through the Copyright Clearance Center, 222 Rosewood Dr., Danvers, MA 01923.
Postmaster: Send undelivered copies and address changes to Computing in Science & Engineering,
445 Hoes Ln., Piscataway, NJ 08855. Periodicals postage paid at New York, NY, and at additional
mailing offices. Canadian GST #125634188. Canada Post Corporation (Canadian distribution)
publications mail agreement number 40013885. Return undeliverable Canadian addresses to PO
Box 122, Niagara Falls, ON L2E 6S8 Canada. Printed in the USA.
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
I
am in Sunriver, Oregon, having just enjoyed three days at the annual Blue Waters
Symposium for Petascale Science and Beyond. It was a perfect opportunity to catch
up on all the wonderful science being done on Blue Waters, the National Science
Foundation’s flagship supercomputer, located at the University of Illinois’s National
Center for Supercomputing Applications (NCSA). To be honest, you can’t really catch
up on all the science: most of the presentations are in parallel sessions with four simul-
Steven Gottlieb taneous talks. There were also very interesting tutorials to help attendees make the best
Indiana University use of Blue Waters.
But what I’m most interested in discussing here isn’t the petascale science, but the
“beyond” issue. CiSE readers might recall that in the March/April 2015 issue, I used
this space for a column entitled “Whither the Future of NSF Advanced Computing
Infrastructure?” (vol. 17, no. 2, 2015, pp. 4–6). One focus of that piece was the in-
terim report of the Committee on Future Directions for NSF Advanced Computing
Infrastructure to Support US Science in 2017–2020. This committee was appointed
through the Computer Science and Telecommunications Board of the National Re-
search Council (NRC) and was expected to issue a final report in mid-2015 (in fact, it
was announced nearly a year later, in a 4 May 2016 NSF press release). I had a chance
to sit down with Bill Gropp (University of Illinois Urbana-Champaign), who cochaired
the committee with Robert Harrison (Stony Brook) and gave a very well-received after-
dinner talk at the symposium about the report.
Over the years, there has been a growing gap between requests for computer time
through NSF’s XSEDE (Extreme Science and Engineering Discovery Environment)
program and the availability of such time. Making matters worse, Blue Waters is sched-
uled to shut down in 2018. At the symposium, William Kramer announced that the
NCSA had requested a zero-cost extension to continue operations of Blue Waters until
sometime in 2019. Extension of Blue Waters operations would be a very positive devel-
opment. Unfortunately, the NSF hasn’t announced a plan to replace Blue Waters with
a more powerful computer, even in light of the NSF’s role in the National Strategic
Computer Initiative announced by President Obama on 29 July 2015. There could be a
very serious shortage of computer time in the next few years that would broadly impact
science and engineering research in the US.
My previous article mentioned that the Division of Advanced Cyberinfrastructure
(ACI) is now part of the NSF’s Directorate of Computer & Information Science & Engi-
neering (CISE). Previously, the Office of Cyberinfrastructure reported directly to the NSF
director. The NSF has asked for comments on the impact of this change, but the deadline
is 30 June, well before you’ll see this column. The NSF’s request for comments was a major
topic of conversation in an open meeting at the symposium held by NCSA Director Ed
Seidel. I plan to let the NSF know that I think it’s essential to go back to the previous ar-
rangement: scientific computing isn’t part of computer science, and it’s very important
that the people at the NSF planning for supercomputing be at the same level as the science
directorates in order to get direct input on each directorate’s computing needs.
The committee report I mentioned earlier has seven recommendations, most of which
contain subpoints (see the “Committee Recommendations” sidebar for more information).
The recommendations are organized into four main issues: maintaining US leadership in
science and engineering, ensuring that resources meet community needs, helping compu-
tational scientists deal with the rapid changes in high-end computers, and sustaining the
4 Computing in Science & Engineering 1521-9615/16/$33.00 © 2016 IEEE Copublished by the IEEE CS and the AIP September/October 2016
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
Committee Recommendations
T he full report is at http://tinyurl.com/advcomp17-20; the text here is a verbatim, unedited excerpt, reprinted with permission from
“Future Directions for NSF Advanced Computing Infrastructure to Support US Science and Engineering in 2017-2020,” Nat’l
Academy of Sciences, 2015 (doi:10.17226/21886).
A: Position US for continued leadership in science and engineering
Recommendation 1. NSF should sustain and seek to grow its investments in advanced computing—to include hardware and
services, software and algorithms, and expertise—to ensure that the nation’s researchers can continue to work at frontiers of science
and engineering.
Recommendation 1.1. NSF should ensure that adequate advanced computing resources are focused on systems and services
that support scientific research. In the future, these requirements will be captured in its road maps.
Recommendation 1.2. Within today’s limited budget envelope, this will mean, first and foremost, ensuring that a predominant
share of advanced computing investments be focused on production capabilities and that this focus not be diluted by undertaking too
many experimental or research activities as part of NSF’s advanced computing program.
Recommendation 1.3. NSF should explore partnerships, both strategic and financial, with federal agencies that also provide
advanced computing capabilities as well as federal agencies that rely on NSF facilities to provide computing support for their
grantees.
Recommendation 2. As it supports the full range of science requirements for advanced computing in the 2017-2020 timeframe,
NSF should pay particular attention to providing support for the revolution in data driven science along with simulation. It should
ensure that it can provide unique capabilities to support large-scale simulations and/or data analytics that would otherwise be unavail-
able to researchers and continue to monitor the cost-effectiveness of commercial cloud services.
Recommendation 2.1. NSF should integrate support for the revolution in data-driven science into NSF’s strategy for advanced
computing by (a) requiring most future systems and services and all those that are intended to be general purpose to be more data-
capable in both hardware and software and (b) expanding the portfolio of facilities and services optimized for data-intensive as well as
numerically-intensive computing, and (c) carefully evaluating inclusion of facilities and services optimized for data-intensive comput-
ing in its portfolio of advanced computing services.
Recommendation 2.2. NSF should (a) provide one or more systems for applications that require a single, large, tightly coupled
parallel computer and (b) broaden the accessibility and utility of these large-scale platforms by allocating high-throughput as well as
high-performance work flows to them.
Recommendation 2.3. NSF should (a) eliminate barriers to cost-effective academic use of the commercial cloud and (b) carefully
evaluate the full cost and other attributes (e.g., productivity and match to science work flows) of all services and infrastructure mod-
els to determine whether such services can supply resources that meet the science needs of segments of the community in the most
effective ways.
B. Ensure resources meet community needs
Recommendation 3. To inform decisions about capabilities planned for 2020 and beyond, NSF should collect community re-
quirements and construct and publish roadmaps to allow NSF to set priorities better and make more strategic decisions about ad-
vanced computing.
Recommendation 3.1. NSF should inform its strategy and decisions about investment trade-offs using a requirements analysis
that draws on community input, information on requirements contained in research proposals, allocation requests, and foundation-
wide information gathering.
Recommendation 3.2. NSF should construct and periodically update roadmaps for advanced computing that reflect these re-
quirements and anticipated technology trends to help NSF set priorities and make more strategic decisions about science and engi-
neering and to enable the researchers that use advanced computing to make plans and set priorities.
Recommendation 3.3. NSF should document and publish on a regular basis the amount and types of advanced computing capa-
bilities that are needed to respond to science and engineering research opportunities.
Recommendation 3.4. NSF should employ this requirements analysis and resulting roadmaps to explore whether there are more
opportunities to use shared advanced computing facilities to support individual science programs such as Major Research Equipment
and Facilities Construction projects.
Recommendation 4. NSF should adopt approaches that allow investments in advanced computing hardware acquisition, comput-
ing services, data services, expertise, algorithms, and software to be considered in an integrated manner.
Recommendation 4.1. NSF should consider requiring that all proposals contain an estimate of the advanced computing
resources required to carry out the proposed work and creating a standardized template for collection of the information as one step
September/October 2016 5
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
of potentially many toward more efficient individual and collective use of these finite, expensive, shared resources. (This information
would also inform the requirements process.)
Recommendation 4.2. NSF should inform users and program managers of the cost of advanced computing allocation requests in
dollars to illuminate the total cost and value of proposed research activities.
C. Aid the scientific community in keeping up with the revolution in computing
Recommendation 5. NSF should support the development and maintenance of expertise, scientific software, and software tools
that are needed to make efficient use of its advanced computing resources.
Recommendation 5.1. NSF should continue to develop, sustain, and leverage expertise in all programs that supply or use
advanced computing to help researchers use today’s advanced computing more effectively and prepare for future machine
architectures.
Recommendation 5.2. NSF should explore ways to provision expertise in more effective and scalable ways to enable
researchers to make their software more efficient; for instance, by making more pervasive the XSEDE (Extreme Science and
Engineering Discovery Environment) practice that permits researchers to request an allocation of staff time along with computer
time.
Recommendation 5.3. NSF should continue to invest in and support scientific software and update the software to support
new systems and incorporate new algorithms, recognizing that this work is not primarily a research activity but rather is support of
software infrastructure.
Recommendation 6. NSF should also invest modestly to explore next-generation hardware and software technologies to explore
new ideas for delivering capabilities that can be used effectively for scientific research, tested, and transitioned into production
where successful. Not all communities will be ready to adopt radically new technologies quickly, and NSF should provision advanced
computing resources accordingly.
D. Sustain the infrastructure for advanced computing
Recommendation 7. NSF should manage advanced computing investments in a more predictable and sustainable way.
Recommendation 7.1. NSF should consider funding models for advanced computing facilities that emphasize continuity of
support.
Recommendation 7.2. NSF should explore and possibly pilot the use of a special account (such as that used for Major Research
Equipment and Facilities Construction) to support large-scale advanced computing facilities.
Recommendation 7.3. NSF should consider longer-term commitments to center-like entities that can provide advanced
computing resources and the expertise to use them effectively in the scientific community.
Recommendation 7.4. NSF should establish regular processes for rigorous review of these center-like entities and not just their
individual procurements.
infrastructure for advanced computing. When I asked Gropp about the report’s main
message, he told me that “the community needs to get involved for the NSF to imple-
ment the recommendations.” That’s because we’ll need to do a better job of describing
our needs and our scientific plans. Gropp emphasized that it’s important to distinguish
between our wants and our needs. For example, Recommendation 3 calls on the NSF
to collect information on the needs of the scientific community for advanced comput-
ing—one possibility is that all grant applications will need to supply information about
their computing needs in a standard form (see recommendation 4.1).
The report also emphasizes that data-driven science needs to be supported along
with simulation. The latter has often driven machine design, but there are many inter-
esting scientific problems for which access to large amounts of data is the bottleneck,
and there are also now many simulations that produce large volumes of data that must
be read, stored, and visualized. It will be best to purchase computers that can support
both requirements well.
“For many years, we have been blessed with rapid growth in computing power,”
Gropp stated, but in referring to stagnant clock speeds, he noted, “that period is over.”
New supercomputers are going to employ new technologies that will require new pro-
gramming techniques to deal with the massive parallelism and deep memory hierar-
chies. Gropp quoted Ken Kennedy as saying that software transformations can take
6 September/October 2016
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
10 years to reach maturity. I note that my own community is eight years into GPU
code development and three to four years into development for Intel Xeon Phi. The ef-
fort is continuing in anticipation of the next generation of supercomputers. The report
strongly emphasizes that the NSF must help users to adapt their codes (Recommenda-
tion 5 and its subpoints).
B efore my conversation with Gropp ended, I asked him about the delay from the
original mid-2015 target date for the report’s release. He mentioned the “grueling
review process” and the need to respond to every comment. However, he said there
were many thoughtful, useful comments and that responding to them made the report
much better. Finally, Gropp left me with the thought that “Writing the report is not
the end, it is the beginning.” I certainly hope that my fellow CiSE readers will take that
to heart and get involved with helping the NSF plan for our needs for advanced com-
puting. You can find the entire report at http://tinyurl.com/advcomp17-20.
Center
er Mobile-friendlyŜ/RRNVJUHDWRQ
ť Mobile-friendlyŜ/RRNVJUHDWRQ
DQ\GHYLFHŜPRELOHWDEOHWODSWRS
DQ\GHYLFHŜPRELOHWDEOHWODSWRS
or desktop
of Technology
ology
ť CustomizableŜ:KDWHYHU\RXU
HUHDGHUOHWV\RXGR\RXFDQGRRQ
HUHDGHUOHWV\RXGR\RXFDQGRRQ
myCS
ArchiveŜ6DYHDOO\RXU
ť Personal ArchiveŜ6DYHDOO\RXU
issues and search or retrieve them
quickly on your personal myCS
site.
More at www.computer.org/myCS
September/October 2016 7
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
Science as a Service
Ravi Madduri and Ian Foster | Argonne National Laboratory and the University of Chicago
R
esearchers are increasingly taking advantage of advances in cloud computing to make data analy-
sis available as a service. As we see from the articles in this special issue, the science-as-a-service
approach has many advantages: it accelerates the discovery process via a separation of concerns,
with computational experts creating, managing, and improving services, and researchers using
them for scientific discovery. We also see that making scientific software available as a service can lower
costs and pave the way for sustainable scientific software. In addition, science services let users share their
analyses, discover what others have done, and provide infrastructure for reproducing results, reanalyzing
data, backward tracking rare or interesting events, performing uncertainty analysis, and verifying and
validating experiments. Generally speaking, this approach lowers barriers to entry to large-scale analysis
for theorists, students, and nonexperts in high-performance computing. It permits rapid hypothesis test-
ing and exploration as well as serving as a valuable tool for teaching.
Computation and automation are vital in many scientific domains. For example, the decreased se-
quencing costs in biology have transformed the field from a data-limited to a computationally-limited dis-
cipline. Increasingly, researchers must process hundreds of sequenced genomes to determine statistical
significance of variants. When datasets were small, they could be analyzed on PCs in modest amounts
of time: a few hours or perhaps overnight. However, this approach does not scale to large, next-
generation sequencing datasets—instead, researchers require high-performance computers and parallel
8 Computing in Science & Engineering 1521-9615/16/$33.00 © 2016 IEEE Copublished by the IEEE CS and the AIP September/October 2016
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
algorithms if they are to analyze their data in a research interests include high-performance comput-
timely manner. By leveraging services such as the ing, workflow technologies, and distributed comput-
cloud-based Globus Genomics, researchers can ing. Madduri has an MS in computer science from the
analyze hundreds of genomes in parallel using Illinois Institute of Technology. Contact him at rm@ ___
just a browser. anl.gov.
In this special issue, we present three great ex-
amples of efforts in science as a service. In “A Case Ian Foster is director of the Computation Institute, a
for Data Commons: Toward Data Science as a joint institute of the University of Chicago and Argonne
Service,” Robert L. Grossman and his colleagues National Laboratory. He is also an Argonne Senior Sci-
present a flexible computational infrastructure that entist and Distinguished Fellow and the Arthur Holly
supports various activities in the data life cycle Compton Distinguished Service Professor of Computer
such as discovery, storage, analysis, and long-term Science. His research deals with distributed, parallel,
archiving. The authors present a vision to create a and data-intensive computing technologies, and in-
data commons and discuss challenges that result novative applications of those technologies to scientific
from a lack of appropriate standards. problems in such domains as climate change and bio-
In “MRICloud: Delivering High-Throughput medicine. Foster received a PhD in computer science
MRI Neuroinformatics as Cloud-Based Software from Imperial College, United Kingdom.
as a Service,” Susumu Mori and colleagues pres-
ent MRICloud, a science as a service for large-
scale analysis of brain images. This article illustrates
how researchers can make novel analysis capabili-
ties available to the scientific community at large Selected articles and columns from IEEE Computer
by outsourcing key capabilities such as high-perfor- Society publications are also available for free at
mance computing. http://ComputingNow.computer.org.
Finally, in “WaveformECG:
A Platform for Visualizing, Annotat-
ing, and Analyzing ECG Data,” Rai-
mond Winslow and colleagues present
a service for analyzing electrocardio-
gram data that lets researchers upload
time-series ECG data and provides
analysis capabilities to enable discov-
ery of the underlying aspects of heart
disease. WaveformECG is accessible
through a browser and provides inter-
active analysis, visualization, and an-
notation of waveforms using standard COMPUTER ENTREPRENEUR AWARD
medical terminology.
In 1982, on the occasion of its All members of the profession are
thirtieth anniversary, the IEEE invited to nominate a colleague
www.computer.org/cise 9
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
SCIENCE AS A SERVICE
Robert L. Grossman, Allison Heath, Mark Murphy, and Maria Patterson | University of Chicago
Walt Wells | Center for Computational Science Research
Data commons collocate data, storage, and computing infrastructure with core services and com-
monly used tools and applications for managing, analyzing, and sharing data to create an interoper-
able resource for the research community. An architecture for data commons is described, as well
as some lessons learned from operating several large-scale data commons.
W
ith the amount of available scientific data being far larger than the ability of the research com-
munity to analyze it, there’s a critical need for new algorithms, software applications, software
services, and cyberinfrastructure to support data throughout its life cycle in data science. In
this article, we make a case for the role of data commons in meeting this need. We describe the
design and architecture of several data commons that we’ve developed and operated for the research com-
munity in conjunction with the Open Science Data Cloud (OSDC), a multipetabyte science cloud that the
nonprofit Open Commons Consortium (OCC) has managed and operated since 2009.1 One of the distin-
guishing characteristics of the OSDC is that it interoperates with a data commons containing over 1 Pbyte
of public research data through a service-based architecture. This is an example of what is sometimes called
“data as a service,” which plays an important role in some science-as-a-service frameworks.
There are at least two definitions for science as a service. The first is analogous to the software-as-a-service2
model, in which instead of managing data and software locally using your own storage and computing resourc-
es, you use the storage, computing, and software services offered by a service provider, such as a cloud service
provider (CSP). With this approach, instead of setting up his or her own storage and computing infrastructure
and installing the required software, a scientist uploads data to a CSP and uses preinstalled software for data
10 Computing in Science & Engineering 1521-9615/16/$33.00 © 2016 IEEE Copublished by the IEEE CS and the AIP September/October 2016
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
analysis. Note that a trained scientist is still required In the discussion below, we distinguish among
to run the software and analyze the data. Science as a several stakeholders involved in data commons: the
service can also refer more generally to a service mod- data commons service provider (DCSP), which is
el that relaxes the requirement of needing a trained the entity operating the data commons; the data
scientist to process and analyze data. With this service contributor (DC), which is the organization or in-
model, specific software and analysis tools are avail- dividual providing the data to the DCSP; and the
able for specific types of scientific data, which is up- data user (DU), which is the organization or indi-
loaded to the science-as-a-service provider, processed vidual accessing the data. (Note that there’s often a
using the appropriate pipelines, and then made avail- fourth stakeholder: the DCSP associated with the
able to the researcher for further analysis if required. researcher accessing the data.) In general, there will
Obviously these two definitions are closely connected be an agreement, often called the data contribu-
in that a scientist can set up the required science-as- tors agreement (DCA), governing the terms by
a-service framework, as in the first definition, so that which the data is managed by the DCSP and the
less-trained technicians can use the service to process researchers accessing the data, as well as a second
their research data, as in the second definition. By and agreement, often called the data access agreement
large, we focus on the first definition in this article. (DAA), governing the terms of any researcher who
There are various science-as-a-service frameworks, accesses the data.
including variants of the types of clouds formalized As we describe in more detail later, we’ve built
by the US National Institute of Standards and Tech- several data commons since 2009. Based on this ex-
nology (infrastructure as a service, platform as a ser- perience, we’ve identified six main requirements that,
vice, and software as a service),2 as well as some more if followed, would enable data commons to interop-
specialized services that are relevant for data science erate with each other, science clouds,1 and other cy-
(data science support services and data commons): berinfrastructure supporting science as a service:
■ data science infrastructure and platform services, ■ Requirement 1, permanent digital IDs. The data
in which virtual machines (VMs), containers, commons must have a digital ID service, and
or platform environments containing com- datasets in the data commons must have per-
monly used applications, tools, services, and manent, persistent digital IDs. Associated with
datasets are made available to researchers (the digital IDs are access controls specifying who
OSDC is an example); can access the data and metadata specifying
■ data science software as a service, in which data additional information about the data. Part of
is uploaded and processed by one or more ap- this requirement is that data can be accessed
plications or pipelines and results are stored from the data commons through an API by
in the cloud or downloaded (general-purpose specifying its digital ID.
platforms offering data science as a service ■ Requirement 2, permanent metadata. There
include Agave,3 as well as more specialized ser- must be a metadata service that returns the as-
vices, such as those designed to process ge- sociated metadata for each digital ID. Because
nomics data); the metadata can be indexed, this provides a ba-
■ data science support services, including data stor- sic mechanism for the data to be discoverable.
age services, data-sharing services, data trans- ■ Requirement 3, API-based access. Data must
fer services, and data collaboration services be accessed by an API, not just by browsing
(one example is Globus4); and through a portal. Part of this requirement is
■ data commons, in which data, data science com- that a metadata service can be queried to return
puting infrastructure, data science support a list of digital IDs that can then be retrieved
services, and data science applications are col- via the API. For those data commons that con-
located and available to researchers. tain controlled access data, another component
of the requirement is that there’s an authentica-
Data Commons tion and authorization service so that users can
When we write of a “data commons,” we mean cy- first be authenticated and the data commons
berinfrastructure that collocates data, storage, and can check whether they are authorized to have
computing infrastructure with commonly used access to the data.
tools for analyzing and sharing data to create an ■ Requirement 4, data portability. The data must
interoperable resource for the research community. be portable in the sense that a dataset in a data
www.computer.org/cise 11
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
SCIENCE AS A SERVICE
Table 1. Data-intensive users supported by the Open Science computing, high-performance data transport ser-
Data Cloud. vices, and VM images and shareable snapshots con-
No. core hours per month No. users taining common data analysis pipelines and tools.
20,000 120 The OSDC is designed to provide a long-term
50,000 34
persistent home for scientific data, as well as a plat-
form for data-intensive science, allowing new types of
100,000 23
data-intensive algorithms to be developed, tested, and
200,000 5 used over large sets of heterogeneous scientific data.
Recently, OSDC researchers have logged about two
commons can be transported to another data million core hours each month, which translates to
commons and be hosted there. In general, if data more than US$800,000 worth of cloud computing
access is through digital IDs (versus referencing services (if purchased through Amazon Web Services’
the data’s physical location), then software that public cloud). This equates to more than 12,000 core
references data shouldn’t have to be changed hours per user, or a 16-core machine continuously
when data is rehosted by a second data commons. used by each researcher on average.
■ Requirement 5, data peering. By “data peer- OSDC researchers used a total of more than
ing,” we mean an agreement between two data 18 million core hours in 2015. We currently target
commons service providers to transfer data at operating OSDC computing resources at approxi-
no cost so that a researcher at data commons 1 mately 85 percent of capacity, and storage resources
can access data commons 2. In other words, the at 80 percent of capacity. Given these constraints,
two data commons agree to transport research we can determine how many researchers to support
data between them with no access charges, no and what size allocations to provide them. Because
egress charges, and no ingress charges. the OSDC specializes in supporting data-intensive
■ Requirement 6, pay for compute. Because, in research projects, we’ve chosen to target research-
practice, researchers’ demand for computing ers who need larger-scale resources (relative to our
resources is larger than available computing total capacity) for data-intensive science. In other
resources, computing resources must be ra- words, rather than support more researchers with
tioned, either through allocations or by charg- smaller allocations, we support fewer researchers
ing for their use. Notice the asymmetry in how with larger allocations. Table 1 shows the number
a data commons treats storage and computing of times researchers exceeded the indicated number
infrastructure. When data is accepted into a of core hours in a single month during 2015.
data commons, there’s a commitment to store
and make it available for a certain period of The OSDC Community
time, often indefinitely. In contrast, computing The OSDC is developed and operated by the Open
over data in a data commons is rationed in an Commons Consortium, a nonprofit that supports
ongoing fashion, as is the working storage and the scientific community by operating data com-
the storage required for derived data products, mons and cloud computing infrastructure to support
either by providing computing and storage al- scientific, environmental, medical, and healthcare-
locations for this purpose or by charging for related research. OCC members and partners include
them. For simplicity, we refer to this require- universities (University of Chicago, Northwestern
ment as “pay for computing,” even though the University, University of Michigan), companies
model is more complicated than that. (Yahoo, Cisco, Infoblox), US government agencies
and national laboratories (NASA, NOAA), and
Although very important for many applications, international partners (Edinburgh University, Uni-
we view other services, such as those for providing versity of Amsterdam, Japan’s National Institute
data provenance,5 data replication,6 and data col- of Advanced Industrial Science and Technology).
laboration,7 as optional and not core services. The OSDC is a joint project with the University of
Chicago, which provides the OSDC’s datacenter.
OSDC and OCC Data Commons Much of the support for the OSDC came from the
The OSDC is a multipetabyte science cloud that Moore Foundation and from corporate donations.
serves the research community by collocating a mul- The OSDC has a wide-reaching, multicampus,
tidisciplinary data commons containing approxi- multi-institutional, interdisciplinary user base and has
mately 1 Pbyte of scientific data with cloud-based supported more than 760 research projects since its
12 September/October 2016
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
inception. In 2015, 470 research groups from 54 uni- to hold human genomic and other sensitive biomed-
versities in 14 countries received OSDC allocations. ical data. These two clouds contain a variety of sensi-
In a typical month (November 2015), 186 of these tive controlled-access biomedical data that we make
research groups were active. The most computation- available to the research community following the
al-intensive group projects in 2015 included projects requirements of the relevant data access committees.
around biological sciences and genomics research,
analysis of Earth science satellite imagery data, anal- Common software stack. The core software stack for
ysis of text data in historical and scientific literature, the various data commons and clouds described
and a computationally intensive project in sociology. here is open source. Many of the components are
developed by third parties, but some key services
OCC Data Commons are developed and maintained by the OCC and
The OCC operates several data commons for the other working groups. Although there are some
research community. differences between them, we try to minimize the
differences between the software stacks used by the
OSDC data commons. We introduced our first data various data commons that we operate. In practice,
commons in 2009. It currently holds approximately as we develop new versions of the basic software
800 Tbytes of public open access research data, in- stack, it usually takes a year or so until the changes
cluding Earth science data, biological data, social can percolate throughout our entire infrastructure.
science data, and digital humanities data.
OSDC Design and Architecture
Matsu data commons. The OCC has collaborated with Figure 1 shows the OSDC’s architecture. We are
NASA since 2009 on Project Matsu, a data commons currently transitioning from version 2 of the
that contains six years of Earth Observing-1 (EO-1) OSDC software stack1 to version 3. Both are based
data, with new data added daily, as well as selected on OpenStack8 for infrastructure as a service. The
datasets from other NASA satellites, including NA- primary change made between version 2 and ver-
SA’s Moderate Resolution Imaging Spectrometer sion 3 is that version 2 uses GlusterFS9 for storage,
(MODIS) and the Landsat Global Land Surveys. while version 3 uses Ceph10 for object storage in
addition to OpenStack’s ephemeral storage. This
The OCC NOAA data commons. In April 2015, NOAA is a significant user-facing change that comes with
announced five data alliance partnerships (with Am- some tradeoffs. Version 2 utilized a POSIX-com-
azon, Google, IBM, Microsoft, and the OCC) that pliant file system for user home directory (scratch
would have broad access to its data and help make it and persistent) data storage, which provides com-
more accessible to the public. Currently, only a small mand-line utilities familiar for most OSDC users.
fraction of the more than 20 of data that NOAA has Version 3’s object storage, however, provides the
available in its archives is available to the public, but advantage of an increased level of interoperability,
NOAA data alliance partners have broader access as Ceph’s object storage has an interface compat-
to it. The focus of the OCC data alliance is to work ible with a large subset of Amazon’s S3 RESTful
with the environmental research community to build API in addition to OpenStack’s API.
an environmental data commons. Currently, the In version 3, there’s thus a clearer distinction
OCC NOAA data commons contains Nexrad data, between the way users interface with scratch data
with additional datasets expected in 2016. and intermediate working results on ephemeral
storage, which is simple to use and persists only un-
National Cancer Institute’s (NCI’s) genomic data com- til VMs are terminated. This results in longer-term
mons (GDC). Through a contract between the NCI and data on object storage, which requires the small
the University of Chicago and in collaboration with extra effort of curating through the API interface.
the OCC, we’ve developed a data commons for cancer Although there’s a learning curve required in adopt-
data; the GDC contains genomic data and associated ing object storage, we’ve noticed that it’s small and
clinical data from NCI-funded projects. Currently, the easily overcome with examples in documentation. It
GDC contains about 2 Pbytes of data, but this is ex- also tempers increased storage usage that could stem
pected to grow rapidly over the next few years. from unnecessary data that isn’t actively removed.
The OSDC has a portal called the Tukey por-
Bionimbus protected data cloud. We also operate two tal, which provides a front-end Web portal inter-
private cloud computing platforms that are designed face for users to access, launch, and manage VMs
www.computer.org/cise 13
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
SCIENCE AS A SERVICE
Data access
Disk portal + APIs
Disk
Infrastructure-as-
a-service (IaaS)
portal + APIs
Disk
Disk
Data peering
Disk API
Physical hardware
Figure 1. The Open Science Data Cloud (OSDC) architecture. The various data commons that we have developed and
operate share an architecture, consisting of object-based storage, virtual machines (VMs), and containers for on-demand
computing, and core services for digital IDs, metadata, data access, and access to computing resources, all of which are
available through RESTful APIs. The data access and data submission portals are applications built using these APIs.
and storage. The Tukey portal interfaces with the an interactive support ticketing system that tracks
Tukey middleware, which provides a secure au- user support requests and system team responses
thentication layer and interface between various for technical questions. Collecting this data lets us
software stacks. The OSDC uses federated login track usage statistics and build a comprehensive as-
for authentication so that academic institutions sessment of how researchers use our services.
with InCommon, CANARIE, or the UK Federa- While adding to our resources, we’ve devel-
tion can use those credentials. We’ve worked with oped an infrastructure automation tool called Yates
145 academic universities and research institutions to simplify bringing up new computing, storage,
to release the appropriate attributes for authentica- and networking infrastructure. We also try to au-
tion. We also support Gmail and Yahoo logins, but tomate as much of the security required to operate
only for approved projects when other authentica- the OSDC as is practical.
tion options aren’t available. The core OSDC software stack is open source,
We instrument all the resources that we oper- enabling interested parties to set up their own sci-
ate so that we can meter and collect the data re- ence cloud or data commons. The core software
quired for accounting and billing each user. We stack consists of third-party, open source software,
use Salesforce.com, one of the components of the such as OpenStack and Ceph, as well as open
OSDC that isn’t open source, to send out invoic- source software developed by the OSDC commu-
es. Even when computing resources are allocated nity. The latter is licensed under the open source
and no payment is required, we’ve found that re- Apache license. The OSDC does use some propri-
ceipt of these invoices promotes responsible usage etary software, such as Salesforce.com to do the ac-
of OSDC community resources. We also operate counting and billing, as mentioned earlier.
14 September/October 2016
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
OCC Digital ID and Metadata Services Second, digital IDs are an important component
The digital ID (DID) service is accessible via an of the data portability requirement. More specifically,
API that generates digital IDs, assigns key-value datasets can be moved between data commons,
attributes to digital IDs, and returns key-value at- and, again, researchers don’t need to change their
tributes associated with digital IDs. We also devel- code. In practice, datasets can be migrated over
oped a metadata service that’s accessible via an API time, with the digital IDs’ references updated as
and can assign and retrieve metadata associated the migration proceeds.
with a digital ID. Users can also edit metadata as- Signpost is the digital ID service for the
sociated with digital IDs if they have write access OSDC. Instead of using a hard-coded URL, the
to it. Due to different release schedules, there are primary way to access managed data via the OSDC
some differences in the digital ID and metadata is through a digital ID. Signpost is an implementa-
services between several of the data commons that tion of this concept via JavaScript Object Notation
we operate, but over time, we plan to converge these (JSON) documents.
services. The Signpost digital ID service integrates
a mutable ID that’s assigned to the data with an
Persistent Identifier Strategies immutable hash-based ID that’s computed from the
Although the necessity of assigning digital IDs to data. Both IDs are accessible through a REST API
data is well recognized,11,12 there isn’t yet a widely interface. With this approach, data contributors can
accepted service for this purpose, especially for large make updates to the data and retain the same ID,
datasets.13 This is in contrast to the generally accept- while the data commons service provider can use
ed use of digital object identifiers (DOIs) or handles the hash-based ID to facilitate data management.
for referencing digital publications. An alternative to To prevent unauthorized editing of digital IDs, an
a DOI is an archival resource key (ARK), a Uniform access control list (ACL) is kept by each digital ID
Resource Locator (URL) that’s also a multipurpose specifying the read/write permissions for different
identifier for information objects of any type.14,15 In users and groups.
practice, DOIs and ARKs are generally used to as- User-defined identities are flexible, can be of
sign IDs to datasets, with individual communities any format (including ARKs and DOIs), and pro-
sometimes developing their own IDs. DataCite is vide a layer of human readability. They map to
an international consortium that manages DOIs for hashes of the identified data objects, with the bot-
datasets and supports services for finding, accessing, tom layer utilizing hash-based identifiers, which
and reusing data.16 There are also services such as guarantee data immutability, allow for identifica-
EZID that support both DOIs and ARKs.17 tion of duplicated data via hash collisions, and al-
Given the challenges the community is fac- low for verification upon retrieval. These map to
ing in coming to a consensus about which digital known locations of the identified data.
IDs to use, our approach has been to build an open
source digital ID service that can support multiple Metadata Service
digital IDs, support “suffix pass-through,”13 and The OSDC metadata service, Sightseer, lets us-
that can scale to large datasets. ers create, modify, and access searchable JSON
documents containing metadata about digital
Digital IDs IDs. The primary data can be accessed using Sign-
From the researcher viewpoint, the need for digital post and the digital ID. At its core, Sightseer pro-
IDs associated with datasets is well appreciated.18,19 vides no restrictions on the JSON documents it
Here, we discuss some of the reasons that digital can store. However, it has the ability to specify
IDs are important for a data commons from an op- metadata types and associate them with JSON
erational viewpoint. schemas. This helps prevent unexpected errors
First, with digital IDs, data can be moved from in metadata with defined schemas. Sightseer has
one physical location or storage system within a data similar abilities as Signpost to provide ACLs to
commons to another without the need to change specify users that have write/read access to the
any code that references the data. As the amount specific JSON document.
of data grows, moving data between zones within a
data commons or between storage systems becomes Case Studies
more and more common, and digital IDs allow this Two case studies illustrate some of the projects that
to take place without impeding researchers. can be supported with data commons.
www.computer.org/cise 15
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
SCIENCE AS A SERVICE
Figure 2. A screenshot of part of the Namibia Flood Dashboard from 14 March 2014. This image shows water catchments (outlined
and colored regions) and a one-day flood potential forecast of the area from hydrological models using data from the Tropical Rainfall
Measuring Mission (TRMM), a joint space mission between NASA and the Japan Aerospace Exploration Agency.
16 September/October 2016
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
Related Work
cancer data available to the cancer community repository or digital library for data associated
through the TCGA and ICGA consortia using with published research. Second, data commons
several clouds, including Bionimbus. can store data along with computational environ-
Bionimbus also uses the data commons archi- ments in VMs or containers so that computations
tecture illustrated in Figure 1. More specifically, supporting scientific discoveries can be reproduc-
the current architecture uses OpenStack to provide ible. Third, data commons can serve as a platform,
virtualized infrastructure, containers to provide a enabling future discoveries as more data, algo-
platform-as-a-service capability, and object-based rithms, and software applications are added to the
storage with an AWS compatible interface. Bion- commons.
imbus is a National Institutes of Health (NIH) Data commons fit well with the science-as-a-
Trusted Partner22 that interoperates with both the service model: although data commons allow re-
NIH Electronic Research Administration Com- searchers to download data, host it themselves, and
mons to authenticate researchers and with the NIH analyze it locally, they also allow current data to
Database of Genotypes and Phenotypes system to be reanalyzed with new methods, tools, and appli-
authorize users access to specific controlled access cations using collocated computing infrastruc-
datasets, such as the TCGA dataset. ture. New data can be uploaded for an integrated
analysis, and hosted data can be made available to
Discussion other resources and applications using a data-as-a-
Three projects that are supporting infrastructures service model, in which data in a data commons is
similar to the OCC data commons are described accessed through an API. A data-as-a-service model
in the sidebar. With the appropriate services, data is enhanced when multiple data commons and sci-
commons support three different but related func- ence clouds peer so that data can be moved between
tions. First, data commons can serve as a data them at no cost.
www.computer.org/cise 17
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
SCIENCE AS A SERVICE
18 September/October 2016
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
users requiring opt-in for continued resource us- of data” as the number of data commons begins
age extending into the next quarter. This provides to grow, as standards for data commons and their
a more formal reminder every three months to interoperability begin to mature, and as data com-
users who are finishing research projects to re- mons begin to peer.
linquish their quotas and has been successful for
tempering unnecessary core usage. Similarly, as we Acknowledgments
moved to object storage functionality, we noted This material is based in part on work supported by the US
more responsible usage of storage, as scratch space National Science Foundation under grant numbers OISE
is in ephemeral storage and removed by default 1129076, CISE 1127316, and CISE 1251201 and by Nation-
when the computing environment is terminated. al Institutes of Health/Leidos Biomedical Research through
The small extra effort in moving data via an API to contracts 14X050 and 13XS021/HHSN261200800001E.
the object storage requires more thoughtful cura-
tion and usage of resources. References
Over the past several years, much of the re- 1. R.L. Grossman et al., “The Design of a Community
search focus has been on designing and operating Science Cloud: The Open Science Data Cloud Per-
data commons and science clouds that are scalable, spective,” Proc. High Performance Computing, Net-
contain interesting datasets, and offer computing working, Storage and Analysis, 2012, pp. 1051–1057.
infrastructure as a service. We expect that as these 2. P. Mell and T. Grance, The NIST Definition of Cloud
types of science-as-a-service offerings become more Computing (Draft): Recommendations of the National
common, there will be a variety of more interest- Institute of Standards and Technology, Nat’l Inst.
ing higher-order services, including discovery, cor- Standards and Tech., 2011.
relation, and other analysis services that are offered 3. R. Dooley et al., “Software-as-a-Service: The iPlant
within a commons or cloud and across two or more Foundation API,” Proc. 5th IEEE Workshop Many-Task
commons and clouds that interoperate. Computing on Grids and Supercomputers, 2012; https://
____
Today, Web mashups are quite common, but www.semanticscholar.org/paper/Software-as-a-
______________________________
analysis mashups, in which data is left in place but service-the-Iplant-Foundation-Api-Dooley-Vaughn/
______________________________
continuously analyzed as a distributed service, are ccde19b95773dbb55328f3269fa697a4a7d60e03/pdf.
______________________________
relatively rare. As data commons and science clouds 4. I. Foster, “Globus Online: Accelerating and Democ-
become more common, these types of services can ratizing Science through Cloud-Based Services,”
be more easily built. IEEE Internet Computing, vol. 3, 2011, pp. 70–73.
Finally, hybrid clouds will become the norm. 5. Y.L. Simmhan, B. Plale, and D. Gannon, “A Survey
At the scale of a several dozen racks (a cyberpod), of Data Provenance in E-Science,” ACM Sigmod Re-
a highly utilized data commons in a well-run data- cord, vol. 34, no. 3, 2005, pp. 31–36.
center is less expensive than using today’s public 6. A. Chervenak et al., “Wide Area Data Replication
clouds.22 For this reason, hybrid clouds consisting for Scientific Collaborations,” Int’ l J. High Perfor-
of privately run cyberpods hosting data commons mance Computing and Networking, vol. 5, no. 3,
that interoperate with public clouds seem to have 2008, pp. 124–134.
certain advantages. 7. J. Alameda et al., “The Open Grid Computing Environ-
ments Collaboration: Portlets and Services for Science
Gateways,” Concurrency and Computation: Practice
www.computer.org/cise 19
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
SCIENCE AS A SERVICE
ment and Recommendations,” Earth Science Infor- Robert L. Grossman is director of the University of Chi-
matics, vol. 4, no. 3, 2011, pp. 139–160. cago’s Center for Data Intensive Science, a professor in
13. C. Lagoze et al., “CED 2 AR: The Comprehensive the Division of Biological Sciences at the University of
Extensible Data Documentation and Access Reposi- Chicago, founder and chief data scientist of Open Data
tory,” Proc. IEEE/ACM Joint Conf. Digital Libraries, Group, and director of the nonprofit Open Commons
2014, pp. 267–276. Consortium. Grossman has a PhD from Princeton Uni-
14. J. Kunze, “Towards Electronic Persistence Using versity from the Program in Applied and Computational
ARK Identifiers,” Proc. 3rd ECDL Workshop Web Mathematics. He’s a Core Faculty and Senior Fellow
Archives, 2003; https://wiki.umiacs.umd.edu/adapt/
_____________________ at the University of Chicago’s Computation Institute.
images/0/0a/Arkcdl.pdf.
______________ Contact him at [email protected].
__________________
15. J.R. Kunze, The ARK Identifier Scheme, US Nat’l
Library Medicine, 2008. Allison Heath is director of research for the University
16. T. Pollard and J. Wilkinson, “Making Datasets Vis- of Chicago’s Center for Data Intensive Science. Her re-
ible and Accessible: DataCite’s First Summer Meet- search interests include scalable systems and algorithms
ing,” Ariadne, vol. 64, 2010; www.ariadne.ac.uk/ tailored for data-intensive science, specifically with ap-
issue64/datacite-2010-rpt.
_______________ plications to genomics. Heath has a PhD in computer
17. J. Starr et al., “A Collaborative Framework for Data science from Rice University. Contact her at aheath@
_____
Management Services: The Experience of the Uni- uchicago.edu.
________
versity of California,” J. eScience Librarianship, vol.
1, no. 2, 2012, p. 7. Mark Murphy is a software engineer at the University
18. A. Ball and M. Duke, “How to Cite Datasets and of Chicago’s Center for Data Intensive Science. His re-
Link to Publications,” Digital Curation Centre, search interests include the development of software to
2011. support scientific pursuits. Murphy has a BS in com-
19. T. Green, “We Need Publishing Standards for Da- puter science engineering and a BS in physics from the
tasets and Data Tables,” Learned Publishing, vol. 22, Ohio State University. Contact him at murphymarkw@
__________
no. 4, 2009, pp. 325–327. uchicago.edu.
________
20. D. Mandl et al., “Use of the Earth Observing One
(EO-1) Satellite for the Namibia SensorWeb Flood Maria Patterson is a research scientist at the University
Early Warning Pilot,” IEEE J. Selected Topics in Ap- of Chicago’s Center for Data Intensive Science. She also
plied Earth Observations and Remote Sensing, vol. 6, serves as scientific lead for the Open Science Data Cloud
no. 2, 2013, pp. 298–308. and works with the Open Commons Consortium on its
21. A.P. Heath et al., “Bionimbus: A Cloud for Manag- Earth science collaborations with NASA and NOAA.
ing, Analyzing and Sharing Large Genomics Datas- Her research interests include cross-disciplinary scien-
ets,” J. Am. Medical Informatics Assoc., vol. 21, no. 6, tific data analysis and techniques and tools for ensuring
2014, pp. 969–975. research reproducibility. Patterson has a PhD in astron-
22. D.N. Paltoo et al., “Data Use under the NIH GWAS omy from New Mexico State University. Contact her at
Data Sharing Policy and Future Directions,” Nature [email protected].
______________
Genetics, vol. 46, no. 9, 2014, p. 934.
23. Future Directions for NSF Advanced Computing Walt Wells is director of operations at the Open Com-
Infrastructure to Support US Science and Engineer- mons Consortium. His professional interests include
ing in 2017–2020, Nat’l Academies Press, 2016. using open data and data commons ecosystems to ac-
24. L.A. Barroso, J. Clidaras, and U. Hölzle, “The Data- celerate the pace of innovation and discovery. Wells re-
center as a Computer: An Introduction to the De- ceived a BA in ethnomusicology/folklore from Indiana
sign of Warehouse-Scale Machines,” Synthesis Lec- University and is pursuing an MS in data science at
tures on Computer Architecture, vol. 8, no. 3, 2013, CUNY. Contact him at [email protected].
___________
pp. 1–154.
25. J. Dean and S. Ghemawat, “MapReduce: Simplified
Data Processing on Large Clusters,” Comm. ACM,
vol. 51, no. 1, 2008, pp. 107–113.
26. G. DeCandia et al., “Dynamo: Amazon’s Highly
Available Key-Value Store,” ACM SIGOPS Op- Selected articles and columns from IEEE Computer
erating Systems Rev., vol. 41, no. 6, 2007, pp. Society publications are also available for free at
205–220. http://ComputingNow.computer.org.
20 September/October 2016
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
SCIENCE AS A SERVICE
MRICloud provides a high-throughput neuroinformatics platform for automated brain MRI segmentation
and analytical tools for quantification via distributed client-server remote computation and Web-based
user interfaces. This cloud-based service approach improves the efficiency of software implementation,
upgrades, and maintenance. The client-server model is also ideal for high-performance computing,
allowing distribution of computational servers and client interactions across the world.
I
n our laboratories at Johns Hopkins University, we have more than 15 years of experience in developing im-
age analysis tools for brain magnetic resonance imaging (MRI) and in sharing the tools with research com-
munities. The effort started when we developed DtiStudio in 20001 as an executable program that could
be downloaded from our website to perform tensor calculation of diffusion tensor imaging and 3D white
matter tract reconstruction. In 2006, two more programs (RoiEditor and DiffeoMap) joined the family that
we collectively called MriStudio. These two programs were designed to perform ROI (region of interest)-based
September/October 2016 Copublished by the IEEE CS and the AIP 1521-9615/16/$33.00 © 2016 IEEE Computing in Science & Engineering 21
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
SCIENCE AS A SERVICE
image quantification for any type of brain MRI data. the 16 atlases and complete the calculation in 15
The ROI could be manually defined, but DiffeoMap to 20 minutes using 8 cores for each atlas registra-
introduced our first capability for automated brain tion (a total of 128 cores), which is equivalent to
segmentation. We based our work on a single-subject consuming 32 to 43 CPU-hour service units (SUs).
atlas with more than 100 defined brain regions that Through the cloud system, users can transparently
were automatically deformed to image data, and access this type of high-performance computation
thus transferring the predefined ROIs in the atlas to facility and run large cohorts of data in a short
achieve automated brain segmentation of the target. amount of time, which is certainly an advantage.
We call this image analysis pipeline high-throughput However, in the transition from a conventional
neuroinformatics,2 as it offers the user the opportu- executable-distribution model to a cloud platform,
nity to reduce MR imagery on the order of O(106 to it has become apparent that computational power
107) variables to O(1,000) dimensions associated with is only one advantage that a cloud system can of-
the neuro-ontology of atlas-defined structures. These fer in terms of science as a service. It also changes
1,000 dimensions are searchable and can be used to the efficiency for new tools development and dis-
support diagnostic workflows. semination, enabling services that weren’t previ-
The core atlas-to-data image analysis is based ously possible. In this article, we share the expe-
on advanced diffeomorphic image registration riences we have accumulated during our period of
algorithms for positioning information in hu- development.
man anatomical coordinate systems.3 To posi-
tion dense atlas-based image ontologies, we use Software as a Service
image-based large deformation diffeomorphic The core mapping service is a computationally
metric mapping (LDDMM),4 which is most effi- demanding high-throughput image analysis al-
ciently implemented using high-performance net- gorithm that parcels brain MRIs into upward
worked systems, especially for large-volume data of 400 structures by positioning the labeled atlas
such as high-resolution T1-weighted images. In ontologies into the coordinates of the brain tar-
2006, the term cloud wasn’t yet widely used, but gets. The approach assumes that there exists a
we employed a concept similar to that of cloud structure-preserving or correspondence, what we
storage to solve this problem. Specifically, we used term a diffeomorphism, a one-to-one smooth map-
an IBM supercomputer at Johns Hopkins Univer- ping between the target I(x), x X, M I l Iatlas.
sity’s Institute for Computational Medicine to re- Here, I and Iatlas are the target and atlas images,
motely and transparently process user data. Since respectively, x and X denote an image’s individual
the introduction of DiffeoMap, approximately coordinates and spatial domain, and M denotes the
50,000 whole-brain MRI data have been pro- diffeomorphic transformation between the two
cessed using this approach. The platform natural- images. The correspondence between the indi-
ly evolved into MRICloud, which we introduced vidual and the atlas is termed the DiffeoMap. We
in December 2014 as a beta testing platform. This interpret the morphisms M(x), x X as carrying the
Web-based software follows a cloud-based soft- contrast MR imagery I(x), x X. The morphisms
ware-as-a-service (SaaS) model. provide a GPS3 for both transferring the atlas’s on-
After 15 years of software development, tological semantic labeling and providing coordi-
the number of MriStudio’s registered users now nates to statistically encode the anatomy.
approaches 10,000, and in 2015, the number of Personalization of atlas coordinates to the target
processed data per month through the new cloud occurs via smooth transformation of the atlas, which
system reached a record of 3500 per month. One minimizes the distance inf d ( I , I atlas I1−1 ) between
I
motivation to adopt a cloud system is to exploit the individual’s representation I and the transfor-
−1
publicly available supercomputing systems for med atlas I atlas I1 ,5 with the transformation solving
CPU- and memory-intensive operations. For ex-
the equation φt = vt (φt ), t [0, 1] and minimizing
ample, although each MR image is typically 10 to the integrated cost
20 Mbytes, our current image-segmentation algo-
rithm with 16 reference atlases requires approxi- 1
22 September/October 2016
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
T2-weighted
MRI Metabolic MRI
Cr
NAA
Cho
Diffusion
tensor MRI
MR spectroscopy
Vol
Atlas-based T2
feature analysis FA
NAA
Figure 1. The structural-functional model of atlas-based MRI informatics. MRI imaginary from different modalities,
such as T1- and T2-weighted structural MRI, diffusion tensor MRI, functional and resting-state functional MRI, and
MRI spectroscopy images can be parcellated to predefined structures based on the presegmented MRI atlases. This
allows for extraction of multicontrast features, from hundreds of anatomical structures to millions of voxels, in a
reduced dimension.
at time t, φt denotes first-order differentiation of It, developed on the Windows platform and written
1 in C++ for core algorithms. It also contains com-
and ∫0 vt dt denotes the integration of the norm ponents of MS-Visual C, MFC, and OpenGL.
V
of vt over the entire velocity field, V the Hilbert User data and the executable file are both located
space of smooth vector fields. Figure 1 shows ex- in users’ local computers (Figure 2a).1 The execut-
amples of our structure-function model, including able file needs to be downloaded from our website
T1- and T2-weighted structural contrast imagery, (www.mristudio.org),
____________ but all operations are per-
orientation vector imagery (such as diffusion tensor formed within users’ local computers, including
MRI), metabolism measured via magnetic resonance data I/O, calculations, and the visualization in-
spectroscopy, and functional connectivity via rest- terface. The input data are raw diffusion-weighted
ing-state functional MRI (rs-fMRI).6 Each atlas images and associated parameters from MRI scan-
carries with it the means and variances associated ners, from which diffusion tensor matrices are cal-
with each high-dimensional feature vector. culated. The software also offers ROI drawing and
tractography tools to define white matter tracts and
perform quantifications.
Evolution of Software Architecture DiffeoMap is an example of a model in which
To highlight the software architecture’s evolution external computation power is incorporated based
(see Figure 2), let’s first look at the functions of on a seamless communication scheme (Figure 2b).7
three key software programs. DtiStudio is software The software reads two images (a reference brain
www.computer.org/cise 23
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
SCIENCE AS A SERVICE
Layers
Local data Image data input/output
DICOM Analyze Nifty RAW Mosaic ...
Custom support
Developing tools Email listing Forum FAQ ...
MS-Visual C/C++ MFC OpenGL
End users
Processing servers
Windows PC Components JHU IBM XSEDE (Gordon, Stampede) ...
(a) http
Software as a service
FTP
Remote services Local data DTI mapping Segmentation LDDMM
Atlas-based statistics fMRI ...
Nonlinear image trasformation
FTP
Developing tools
Windows
PC (local) MS-Visual C/C++ MFC OpenGL
(b)
Figure 2. Schematic diagrams of the architectures of (a) DtiStudio, (b) DiffeoMap, and (c) MRICloud. DtiStudio is an example of a
conventional distribution model, in which an executable file is downloaded to local computers and the entire process take place within
local computers. DiffeoMap has an internal architecture similar to that of DtiStudio, but CPU-demanding calculations associated to the
large deformation diffeomorphic mapping calculations of DiffeoMap occur on a remote Unix/Linux server. For the MRICloud system,
the entire calculation occurs in the remote server, and the communication with users relies on a Web interface. The system has flexible
scalability and contains a storage system for temporary storage of user data.
atlas and a user-provided image), and one image MRICloud is the latest evolution of our soft-
is transformed into the shape of the other, thereby ware, in which computationally intensive algo-
anatomically registering voxel-coordinates of the rithms are migrated to a remote server (Figure 2c).
two images. Basic image transformation (voxel resiz- The cloud computing model is an attractive client-
ing, image cropping, and linear transformation) and server model that we adopted because of the ease
associated functions (file format conversion, inten- of scalability, portability, accessibility, and main-
sity matching) are performed locally. The data I/O tenance cost, providing a “virtual” hardware en-
and visualization interfaces also remain in the local vironment that decouples the computer from the
Windows platform. However, diffeomorphic image physical hardware. The computer is referred to as
transformation, which is too CPU-intensive for local a virtual machine and behaves like a software pro-
PCs, is performed by a remote server. Communica- gram that can run on another computer. Abstract-
tion with the remote server is performed through ing the computer from hardware facilitates move-
HTTPS and FTP protocols and through notifica- ment and scaling of virtual machines on the fly.
tion to users via email. Once users’ data are automat-
ically sent to the remote server, the server performs Cloud System Architecture
diffeomorphic transformation and the resultant The main entry point to the server infrastructure
transformation matrices, which are typically about is through either the MRICloud Web application
1 Gbyte, can be retrieved by DiffeoMap through or its accompanying RESTful Web API. Data
32-bit data identifiers provided in the email. payloads can be several hundreds of Mbytes, and a
24 September/October 2016
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
MARCC cluster
https://www.marcc.jhu.edu
Iddmm processing cluster
MriStudio
user Internet io19.cis.jhu.edu
validation and cluster allocation server
JHU Departmental
firewall firewall
MriCloud.org
ftp.mristudio.org
Incoming/outgoing
XSEDE processing storage
https://www.xsede.org/
Iddmm processing cluster
Figure 3. Diagram of core MriStudio and MRICloud server components. MRICloud.org or MriStudio applications
generate a zip from the users’ data, which are uploaded to an anonymous FTP server (ftp.mristudio.org).
__________ Another
server (io19.cis.jhu.edu)
_________ monitors the incoming queue for new data. Upon arrival, this server validates the data,
identifies a computation resource, and copies the data to one of the clusters (currently, http://icm.jhu.edu, www.
___
xsede.org
_____ or www.marcc.jhu.edu).
___________ The data are then queued using an SSH signal. The validation and allocation
server also monitors job completion and updates the job status at www.mricloud.org or sends an email to MriStudio
users with a URL of the data location.
special jQuery interface is used to facilitate resum- by email with the URI to retrieve the data. Alterna-
able uploads because they aren’t directly supported tively, the user can check on the status of the pro-
by the HTTP protocol. A successful upload returns cessing at any time via the MRICloud website and
a job identifier that references the data and its pipe- retrieve the data from there if they’re ready.
line throughout the system. The job ID is used to To facilitate a programmatic interface to the
check the status of the processing and to reference processing, the RESTful Web API provides a service
the resulting processed data to be downloaded. that can ping the status of the processing, as well as
Figure 3 outlines our back-end processing pipe- another service for downloading the data. There-
line, which is built from standard legacy protocols on fore, a user can batch process and retrieve results,
a LAMP (Linux, Apache, MySQL, PHP) stack, also without a human in the loop, and is notified when
including FTP, SSH, SMTP, and high-level script- the MRI images being processed are completed.
ing (BASH, PHP) that keeps the system lightweight, An example protocol might be api_submit data; in
simple, robust, and easily maintainable. Once the a loop, api_job_status every 30 seconds until com-
data are uploaded, it’s repacked in a zip payload plete; and api_download result. As with any REST-
structure to move through the system, first moving ful Web API, this can be done programmatically in
into a queue via FTP. Once consumed by the moni- any language that supports the HTTP protocol.
tor, the payload is validated for completeness. Then, To secure the processing pipeline, SSH is core
an available computational resource (www.xsede.org to data transfer and signaling commands on remote
or www.marcc.jhu.edu)
____________ is identified and the data are systems. The validation and cluster allocation server
submitted to the cluster’s processing queue using an uses public and private keys with authorized key re-
SSH signal. The cluster uses SMTP to signal that the strictions. The root of the server allows SSHFS to
job is submitted, and the monitor then polls for com- mount a restricted area of data space for processing
plete jobs. Upon job completion, the resulting data storage. A user-level SSH public/private key with au-
are moved to an FTP server, and the user is notified thorized_key restrictions is used to signal the cluster
www.computer.org/cise 25
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
SCIENCE AS A SERVICE
(a) (b)
Figure 4. The actual login window and one of the SaaS interfaces of MRICloud. (a) User registration and
authentication are essential for cloud-based services. (b) Currently, six SaaSs have been tested, which include brain
segmentation, diffusion tensor imaging (DTI) calculation, resting-state functional MRI (rs-fMRI), arterial spin labeling,
angio scan parameter calculation, and surface mapping. The figure shows the interface for the brain segmentation
based on T1-weighted images.
for job submission. The user-level account doesn’t the Institute for Computational Medicine (http:// ____
have direct shell access to the cluster but moves data icm.jhu.edu),
_______ and the University of California, San
through the root-level SSHFS mount point. Diego, Gordon computer from the US National Sci-
ence Foundation for the Computational Anatomy
Step-by-Step Procedure for the Cloud Service Gateway via XSEDE (www.xsede.org).
_________ Thus far, the
To provide a clear illustration of how the cloud- services at the Computational Anatomy Gateway have
based SaaS functions, Figure 4 shows the actual been supported by the XSEDE grant program, which
steps involved in the image analysis services. The allows us to provide the MRI SaaS to users free of
first step is to create an account in the login window charge. Our current effort is focused on utilizing pub-
(Figure 4a). Once logged in, users have access to licly available computational resources to make them
several SaaSs, including T1-based brain segmenta- available for users. When using the occupied SUs and
tion, diffusion tensor imaging (DTI) data process- computing resources, we can compare the efficiency
ing, resting-state fMRI analysis, and arterial spin performance in terms of computing consumption.
labeling data processing. SUs can be defined as SU = (wall time/60) * (total
If a T1-based brain segmentation SaaS is CPU number). Given a T1-segmentation pipeline
chosen, a data upload page appears (Figure 4b), using 16 atlases, the wall time on XSEDE resources
in which users need to choose several options, would be 32 minutes. The SU and runtime increase
including choice of processing servers, image ori- with the number of atlases, which also depend on the
entations (sagittal or axial), and multiatlas libraries. available number of CPUs, as illustrated in Figure 5.
Currently, the SaaS accepts a specific file format gen- On the current MRICloud platform, 45 and 30 at-
erated by a small program that needs to be down- lases are in use for adult and pediatric target images,
loaded from the MRICloud website. If users want respectively. The results in Figure 5a highlight the im-
to compare their data with the internal control data portance of parallelization and enhanced efficiency by
being logged within MRICloud, the demography employing supercomputing resources. The runtime of
information must also be provided. Users have a the pipeline decreases drastically as more CPUs be-
choice of two processing servers: the John Hop- come available. Figures 5b and 5c demonstrate the
kins University IBM Blade computer, supported by pipeline’s scalability when a large number of cores/
26 September/October 2016
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
4,000 80
CPUs are available. As long as there are a sufficient
Runtime
number of CPUs available, the increased number of 3,500
System units
70
atlases does not lead to an increased runtime.
Runtime (s)
downloaded or viewed from the same page (Figure 6a). 2,000 40
The “view results” option opens a new webpage
1,500 30
that allows visual inspection of the segmentation
results, which are displayed in three orthogonal 1,000 20
views and a 3D surface rendering (Figure 6b). Us-
500 10
ers can examine the quality of the segmentation
with these views and, in addition, if the age of the 0 0
16 32 64 128 256
data is specified at the data submission, the vol- (a) CPU number
ume of each defined structure can be compared to 50
age-matched controls based on z-scores or within 45
an age-versus-volume plot. These control data are
System units (CPU hours)
40
stored in MongoDB; results on the Web are up-
35
dated in real time as the control database evolves.
30
The downloaded files contain information 25
about volumes and intensity of segmented struc-
20
tures. Currently, we offer atlas version 7a, which
15
has 289 defined structures and a five-level ontolog-
10 Stampede
ical relationship for these structures, as described Gordon
5
in our previously published paper.8 This service
0
converts T1-weighted images with more than 1 4/32 8/64 12/96 16/128
million voxels to standardized and quantitative (b) Atlas number/CPU number
matrices with [volume, intensity] u 289 structures. 1,400
This T1-based brain segmentation service is
linked to other SaaSs provided by MRICloud. For 1,200
example, rs-fMRI and arterial spin labeling (ASL) 1,000
services incorporate the structural segmentation re-
Runtime (s)
www.computer.org/cise 27
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
SCIENCE AS A SERVICE
28 September/October 2016
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
use of a cloud service should be included in each mented in a computation server and linked to a
project’s Internal Review Board approval. For clini- Web interface for users. However, this process does
cal purposes, it isn’t immediately clear whether require communication between program develop-
hospitals would allow their data to migrate to an ers and cloud systems, who must agree on exactly
outside entity, even temporarily, without proper what the inputs and outputs are. The Web interface
permission. then is designed to meet specific needs.
Our current approach is to develop a small ex- During this process, however, it is important to
ecutable program that can read original DICOM realize that two phases in science as a service profound-
from most vendors, which is then distributed to ly affect service design. In the first phase, users must
each local client and file conversions are executed have access to all the service software’s parameters
from DICOM to two simple files, a raw image ma- to provide scientific freedom and to maximally ex-
trix and a header file that contains only the matrix plore data contents, as well as to evaluate tool effica-
dimension information. (De-identification and file cy. This process is highly interactive and thus requires
standardization are performed on local computers extensive user-interaction interfaces that let users vi-
before the data are uploaded to the cloud service.) sually inspect results and store intermediate files at
This executable file needs to be constantly updated each step. The cloud approach’s performance could
and distributed as vendors change their DICOM degrade if each interaction requires a large amount
contents. In this sense, the cloud system requires a of data transfer. The Web interface also mandates
lightweight download of a pre-processing execut- modern designs to efficiently perform the frequent
able program, and isn’t completely free from dis- interactions between local and remote computers, es-
tribution burdens and local computations. This pecially for complex visualization and graphics inter-
strategy also indicates that all de-identification is faces. In the second phase, the technology matures,
accomplished by users prior to submission of their tool efficacy is established, and the majority of users
data to the service and, therefore, the SaaS is free start to use the same parameter sets and protocols.
from HIPPA issues. However, questions remain The cloud system is more efficient, as the informa-
about the unique signatures embedded in the im- tion transfer occurs only twice: data upload and re-
age. We can assume that highly processed data, sults download. One of the frequent questions we re-
such as the volumes of 100 structures, are essen- ceive is, “Can we modify the segmentation results?”
tially anonymized data, but we can also argue that Unfortunately, in our current setup, the segmenta-
unique identifiers associated with imaging features tion files must be downloaded to the local computer
are the purpose of the SaaS. Certainly, at some and modified by ROI-management software in the
point we need to define a line where HIPPA is local PC. The cloud approach requires a balance be-
applicable or not, although such a boundary isn’t tween server- and client-side uploads and downloads.
entirely clear. In addition, as science as a service, We find that our downloadable executable programs
it would be more beneficial for users if the HIP- such as MriStudio provide advantages in terms of
PA issue is handled on the server side. Another physical public network separation among data stor-
interesting strategy, which we’re testing for clini- age, visualization, and memory-based computer en-
cal applications, is to transplant the entire cloud gines, facilitating user feedback and stepwise quality
service behind an institutional firewall. This hy- control monitoring. However, the scaling arguments,
brid approach falls between a distribution and a software maintenance, and upgrade ease, as well as
cloud model, which is highly viable because the large-scale computation distribution through nation-
cloud architecture is portable and transplantation al computing networks, gives the cloud solution its
is relatively straightforward, but it loses several ad- own distinct advantages.
vantages such as access to public supercomputing In the first phase, it’s important to stress that
resources and multiplication of efforts to maintain the maturation processes takes place both through
the servers. These issues deserve more discussion users’ experience in testing and parameter choices
for science as a service in the future. and through developers’ efforts to revise the soft-
The SaaS model is powerful when technologies ware to accommodate user requests for better or
are mature: deployment of a new SaaS is, in the- newer functionalities. In this period of dynamic
ory, straightforward if core programs are written updates, high-level programming languages, such
without relying on platform-dependent libraries. as Matlab and IDL, provide an ideal environment
If we already have local executable files to perform for efficient revisions. This could also facilitate
certain types of image analysis, they can be imple- open source strategies and user participation in
www.computer.org/cise 29
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
SCIENCE AS A SERVICE
software development. As the software becomes database development and sharing because a large
mature and services are solidified, the final phase number of data would be submitted by users that
of the maturation process should be tested by pro- conform to a relatively uniform image protocol to
cessing a large amount of data. For example, a perform the automated image analysis. This data
simple task, such as skull-stripping, has two modes collection doesn’t require coordinated planning or
of failure. In the early phase of tool development, specific funding; users have motivation to acquire
the tool is improved by minimizing the leakage (or images with specified types, submit their data, and
removal) of brain definitions (say, 5 percent leak- have access to the automated segmentation servic-
age of the voxels outside the brain), but in the latter es. The cloud host then has opportunities to build
phase, as tool performance improves, our interest two types of databases: users’ raw images and pro-
shifts to occasional failures (5 percent of the popu- cessed data (such as the standardized anatomical
lation) that are encountered only in a large-scale feature vectors shown in Figure 1).
analysis. At this point, the low computation effi- This indeed could be a new approach to fa-
ciency of the high-level language starts to become cilitate efficient knowledge building and sharing.
a major obstacle due to low throughput: every time However, several hurdles should also be noted.
we make a minor modification, we need to wait First, our current SaaS has a rule to erase users’ data
a week to complete 100 test data. At some point, after 60 days of storage. To retain them, we would
recoding to C++ is inevitable, which can improve need not only a much larger storage space but also
computation time as much as 10,000 times from permission from users. Storage of anatomical fea-
the original, depending on the algorithms. Think- ture vectors, on the other hand, could be less of an
ing about the nature of the cloud-based SaaS and issue as they’re much smaller and highly de-identi-
its position in science as a service, it makes more fied. In either case, probably the largest limitation
sense to deploy software using the lower-level lan- is the availability of the associated nonimage infor-
guage. This is especially important when we utilize mation. In the regular data submission, users sub-
national resource computation facilities because we mit their images without demographic and clinical
need to make every effort to maximize the resourc- information. The resultant database then have only
es. One practical limitation, however, is that it isn’t anatomical features, which wouldn’t be very useful
always easy to secure human resources to support for many application studies. It’s relatively straight-
these types of efforts. As much as we need the ex- forward to build an interface to gather demograph-
pertise and knowledge of trainees and faculties in ic and clinical information as part of a SaaS (see
academic institutes, some crucial efforts are needed Figure 4b), but the barrier would be the extra effort
to develop a sophisticated cloud, and SaaS isn’t a of users to compile and input them at the time of
subject for academic publications. data submission. The incentive, therefore, would be
building a useful database through SaaS for future
Impact of SaaS on Data Sharing data sharing. For example, if the service includes
In recent years, data sharing is becoming an im- image interpretation (potential diagnoses and their
portant National Institutes of Health policy, and likelihood) based on detailed clinical patient data,
there are many data available in the public domain users might be willing to make the extra effort to
including Alzheimer’s Disease Neuroimaging Ini- submit additional information associated with the
tiative (ADNI; www.adni-info.org)
____________ for Alzheimer’s images. For the actual method to distribute data,
disease, Pediatric Imaging, Neurocognition, and we currently use the GitHub (https://github.com)
____________
Genetics (PING; http://pingstudy.ucsd.edu)
________________ for nor- repository, which has become a de facto site for
mal pediatric data, and National Database for Autism data sharing. Our rich atlas resources are available
Research (NDAR; https://ndar.nih.gov)
____________ for autism through this channel.
research. What is common to these database is the
availability of raw data with which research com- What New Things Can We Do with the Cloud
munities can apply their own tools to extract bio- Service?
logically or clinically important finding. In these In the previous sections, we discussed the advan-
types of public databases, proactive plans and co- tages and limitations of the cloud-based SaaS, high-
ordinated efforts, as well as funding, are needed to lighting differences from classical distribution mod-
acquire data in a uniform manner, establish a da- els. In this section, we focus on service concepts
tabase structure, gather data, and maintain them. that are only possible in the cloud platform. The key
SaaS introduces a very different perspective to concept is “knowledge-driven” analysis.
30 September/October 2016
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
www.computer.org/cise 31
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
SCIENCE AS A SERVICE
Single atlas
Probabilistic atlas
Multiatlas
Figure 7. Evolution of atlas-based brain segmentation approaches. In a single-atlas-based approach, only one atlas
is warped to the target image and, at the same time, transfers its presegmented labels to the target image. In the
probabilistic atlas-based approach, multiple atlases are warped to the target image, and a probabilistic map is
generated by averaging the label definitions from all atlases; image intensity information can be incorporated to
determine the final segmentation. The multiatlas-based approach also warps multiple atlases to the target image, but
employs arbitration algorithms (typically, weighting and fusion) to combine the multiple atlas labels to generate the
final segmentation.
to the gray matter. In this way, the probabilistic process by which a probabilistic map is created is
atlas could teach an algorithm about the anatomi- omitted and multiple atlases are directly registered
cal signatures (locations and intensities) of each to the patient image, followed by an arbitration pro-
structure label such that the best labeling accuracy cess.18–20 This process opens up many new possibili-
can be achieved. In the multiatlas framework, the ties for knowledge-based image analysis.
32 September/October 2016
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
For example, in the multiatlas framework, the as the tools are mature enough to follow the API,
atlas library can be enriched, altered, or revised which is nothing more than defining input and
easily without creating a population-averaged atlas. output parameters. Then, the developers can enjoy
The appropriate atlases can be dynamically cho- the existing infrastructures for super-computation
sen from a library. The criteria for appropriateness resources, processing status management, data up-
could be nonimage attributes, such as age, gender, load/download functions, and user management,
race, or diagnosis.21 Image-based attributes, such such as registration and notifications. This kind
as intensity, shape, or the amount of atrophy, could of expandability doesn’t have to rely on a specific
be used to determine the contributions of the at- cloud platform like MRICloud because other de-
lases.12,22,23 By extending this notion, the selection velopers can create their own cloud platforms and
of images from an atlas library can be evolved to access our SaaS without going through our cloud
context-based image retrieval (CBIR),24,25 and if interface. This has an important implication for
the library is sufficiently large with various patho- future extensions of medical imaging informatics.
logical cases with rich clinical information, statis- In the past, attempts have been made to inte-
tics about the retrieved images, such as diagnosis grate results from multiple contrasts, multiple im-
information, could be generated. aging modalities, and multiple medical records.
While the multiatlas-based analysis provides These integrative analyses have been, however, ham-
interesting and new research frontiers, it also poses pered by the fact that they need to ensure that data
unique challenges. First, the algorithm is CPU-in- from each modality have already been standard-
tensive. For segmentation and mapping based on a ized, quantified, and dimension-reduced. If we use
single atlas or a probabilistic atlas, image registra- the analogy of building a house, a cloud platform
tion is required only once. However, for 30 atlases, such as MRICloud serves as one of the foundations
the registration has to be repeated 30 times, fol- to build vertical columns that correspond to each
lowed by another CPU-intensive arbitration pro- SaaS. If we come up with a new image analysis tool,
cess. If the user chooses to select a subset of “appro- it can be integrated into one of the cloud founda-
priate” atlases from the 300-atlas library, further tions as a new service column. In this context, the
calculation would be needed. This implies the issue cloud platform’s role is to provide an environment
of content management for atlas libraries. Because to readily establish new columns. The real power
the libraries of data sources are dynamically evolv- of the cloud strategy is then materialized when a
ing in quantity and quality with frequent updates, “horizontal service” (corresponding to the roof of
it isn’t realistic to distribute the entire library to the house in this analogy) emerges, which spans
every user and provide version management. The not only multiple service columns but also multiple
cloud-based approach provides a high-performance cloud foundations.
computation environment and centralized man- In the field of medical records, there are high
agement of the atlas libraries, 26 therefore enabling expectations for the integration of big data as-
advanced multiatlas technologies and applications. sociated to available medical records to create a
knowledge database and providing personalized
Linkage of Services medicine through the comparison of the features
The cloud-based SaaS provides unique platforms to of individual patients to the knowledge database.
link different types of service tools. Many research- This is a typical example of the horizontal ser-
ers in image analysis communities often make their vice, but if we open the electronic health records
own programs to analyze their data or assist in the currently available in each hospital, we soon real-
interpretation of MR scans. Many of these tools ize that the data aren’t standardized, structured,
are highly valuable for these communities, and quantitative, consistent, or cohesive. The inte-
developers are willing to share them. However, for grative analysis by a horizontal service would be-
their programs to be widely adopted, they need to come prohibitively difficult if, for example, one
develop user interfaces, distribution channels, and aspect of the data were a raw MR image that
user management systems, such as registration and didn’t specify where the brain is within the 8
communications. Based on our experience in de- million voxels (200 u 200 u 200 image dimen-
veloping the MriStudio software family, we know sion). The success of the horizontal services, there-
how time-consuming it is to develop new programs fore, hinges on the proliferation of high-quality
as stand-alone software. In the cloud platform, the vertical services. This is somewhat akin to integra-
addition of new SaaSs is straightforward as long tive travel services, such as Orbitz, Expedia, and
www.computer.org/cise 33
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
SCIENCE AS A SERVICE
Booking.com,
_________ which rely on reservation SaaSs for 7. K. Oishi et al., “Atlas-Based Whole Brain White
each hotel, airline, or rental car company. Vertical Matter Analysis Using Large Deformation Dif-
SaaSs, established in many different medical appli- feomorphic Metric Mapping: Application to Nor-
cations, have the potential to be linked via third- mal Elderly and Alzheimer’s Disease Participants,”
party horizontal services to perform higher-order NeuroImage, 19 Jan. 2009, pp. 486–499.
integrative analysis, which, in the future, could 8. A. Djamanakova et al., “Tools for Multiple Granularity
realize new medical informatics that we haven’t yet Analysis of Brain MRI Data for Individualized Image
imagined. Analysis,” NeuroImage, vol. 101, 2014, pp. 168–176.
9. A.V. Faria et al., “Content-Based Image Retrieval
for Brain MRI: An Image-Searching Engine and
34 September/October 2016
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
21. P. Aljabar et al., “Multi-atlas Based Segmentation of University School of Medicine. Contact him at yueli.
___
Brain Images: Atlas Selection and Its Effect on Accu- [email protected].
__________
racy,” NeuroImage, vol. 46, no. 3, 2009, pp. 726–738.
22. F. Maes et al., “Multimodality Image Registration by Anthony Kolasny is an IT architect at Johns Hopkins
Maximization of Mutual Information,” IEEE Trans. University’s Center for Imaging Science. His research
Medical Imaging, vol. 16, no. 2, 1997, pp. 187–198. interests include high-performance computing and is the
23. M. Wu et al., “Optimum Template Selection for JHU XSEDE Campus Champion. Kolasny has an MS in
Atlas-Based Segmentation,” NeuroImage, vol. 34, computer science from Johns Hopkins University. He’s
no. 4, 2007, pp. 1612–1618. a professional member of the Society for Neuroscience,
24. W. Hsu et al., “Context-Based Electronic Health Usenix, and ACM. Contact him at [email protected].
____________
Record: Toward Patient Specific Healthcare,” IEEE
Trans. Information Technology in Biomedicine, vol. Marc A. Vaillant is president and CTO of Animetrics, a
16, no. 2, 2012, pp. 228–234. software company that provides facial recognition solutions
25. H. Müller et al., “A Review of Content-Based Image to law enforcement, government, and commercial markets.
Retrieval Systems in Medical Applications—Clini- His research interests include computational anatomy in
cal Benefits and Future Directions,” Int’ l J. Medical the brain sciences and machine learning. Vaillant has a
Informatics, vol. 73, no. 1, 2004, pp. 1–23. PhD in biomedical engineering from Johns Hopkins Uni-
26. D. Wu et al., “Resource Atlases for Multi-Atlas versity. Contact him at [email protected].
______________
Brain Segmentations with Multiple Ontology Levels
Based on T1-Weighted MRI,” NeuroImage, vol. 125, Andreia V. Faria is a radiologist and an assistant profes-
no. 10, 2015, pp. 120–130. sor in the Department of Radiology at Johns Hopkins
University School of Medicine. Her interests include
Susumu Mori is a professor in the Department of Radi- the development, improvement, and application of tech-
ology at Johns Hopkins University School of Medicine. niques to study normal brain development and aging, as
His research interest is to develop new technologies for well as pathological models. Faria has a PhD in neuro-
brain MRI data acquisition and analyses. Mori has a PhD sciences from the State University of Campinas. Contact
in biophysics from Johns Hopkins University School of her at [email protected].
__________
Medicine. He’s a Fellow of the International Society of
Magnetic Resonance in Medicine. Contact him at smo- ___ Kenichi Oishi is an associate professor in the Depart-
[email protected].
_______ ment of Radiology at Johns Hopkins University School
of Medicine. His research interests include multimodal
Dan Wu is a research associate in the Department of brain atlases and applied atlas-based image recognition
Radiology at Johns Hopkins University School of Medi- and feature extraction methods for various neurological
cine. Her research interests include advanced neuroim- diseases. Oishi has an MD in medicine and a PhD in
aging and quantitative brain MRI analysis, especially neuroscience from Kobe University School of Medicine
atlas-based neuroinformatics for clinical data analysis. in Japan. Contact him at [email protected].
____________
Wu has a PhD in biomedical engineering from Johns
Hopkins University. She’s a Junior Fellow of the Inter- Michael I. Miller is the Herschel Seder Professor of Bio-
national Society of Magnetic Resonance in Medicine. medical Engineering and director of the Center for Im-
Contact her at [email protected].
__________ aging Science at Johns Hopkins University. He has been
influential in pioneering the field of computational anato-
Can Ceritoglu is a research scientist and software engi- my, focused on the study of the shape, form, and connec-
neer in the Center for Imaging Science at Johns Hop- tivity of human anatomy at the morpheme scale. Miller
kins University. His research interests includes medical has a PhD in biomedical engineering from Johns Hopkins
image processing. Ceritoglu has a PhD in electrical and University. Contact him at [email protected].
__________
computer engineering from Johns Hopkins University.
Contact him at [email protected].
__________
www.computer.org/cise 35
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
SCIENCE AS A SERVICE
Raimond L. Winslow, Stephen Granite, and Christian Jurado | Johns Hopkins University
The electrocardiogram (ECG) is the most commonly collected data in cardiovascular research
because of the ease with which it can be measured and because changes in ECG waveforms
reflect underlying aspects of heart disease. Accessed through a browser, WaveformECG is an open
source platform supporting interactive analysis, visualization, and annotation of ECGs.
T
he electrocardiogram (ECG) is a measurement of time-varying changes in body surface poten-
tials produced by the heart’s underlying electrical activity. It’s the most commonly collected data
in heart disease research. This is because changes in ECG waveforms reflect underlying aspects
of heart disease such as intraventricular conduction, depolarization, and repolarization distur-
bances,1,2 coronary artery disease,3 and structural remodeling.4 Many studies have investigated the use
of different ECG features to predict the risk of coronary events such as arrhythmia and sudden cardiac
death, however, it remains an open challenge to identify markers that are both sensitive and specific.
Many different commercial vendors have developed information systems that accept, store, and ana-
lyze ECGs acquired via local monitors. The challenge in applying these systems in clinical research is that
they’re closed and don’t provide APIs by which other software systems can query and access their stored
digital ECG waveforms for further analyses, nor the means for adding and testing novel data-processing
algorithms. They’re designed for use in patient care, rather than for clinical research. Despite the ubiquity
of ECGs in cardiac clinical research, there are no open, noncommercial platforms for interactive manage-
ment, sharing, and analysis of these data. We developed WaveformECG to address this unmet need.
WaveformECG is a Web-based tool for managing and analyzing ECG data, developed as part of the
CardioVascular Research Grid (CVRG) project funded by the US National Institutes of Health’s National
36 Computing in Science & Engineering 1521-9615/16/$33.00 © 2016 IEEE Copublished by the IEEE CS and the AIP September/October 2016
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
Heart, Lung, and Blood Institute.5 Users can browse End user
their files and upload ECG data in a variety of ven-
dor formats for storage. WaveformECG extracts and
stores ECGs as a time series; once data are uploaded,
a browser can select, view, and scroll through indi-
vidual digital ECG lead signals. Points and time in- Single sign-on and authentication: Giobus Nexus
tervals in ECG waveforms can be annotated using
Liferay
Upload Visualize Analyze Download
ontology from the Bioportal ontology server oper- portlet portlet portlet portlet
ated by the National Center for Biomedical Ontol-
ogy (NCBO),6 and annotations are stored with the
waveforms for later retrieval, enabling features of in-
Analysis QT QRS
terest to be marked and saved for others. Users can Java libraries for screening
processing
Algorithms
algorithms: detector
Backed data/metadata algorithm
select groups of ECGs for computational analysis via Apache axis2
access and Web service
multiple algorithms, and analyses can be distributed (local and QRS Heart
manipulation score rate
remote) algorithm variability
across multiple CPUs to decrease processing time.
WaveformECG has also been integrated with the
Informatics for Integrating Biology and the Bed- ECG time series and annotation storage:
file system
OpenTSDB
Database/
Metadata
side (I2B2) clinical data warehouse system.7 This storage:
bidirectional coupling lets users define study cohorts PostgreSQL Hadoop Zookeeper HBase
www.computer.org/cise 37
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
SCIENCE AS A SERVICE
(a)
Are all Store the ECG Files
Yes Write time-series/ Store
Transfer necessarry in LDML/extract
Start upload analysis results metadata in Done
file to server files transferred? metadata & analysis
to OpenTSDB PostgreSQL
results
1 2 3 4 5
(b) No
Figure 2. Upload portal. (a) The listing on the left shows that the user has created the patient006 folder, into which
data will be uploaded. Datasets under “my subjects” are owned by the user. Folders group datasets by subjects,
and progress bars next to the file names in the center of the screen show progress on the upload of each file to the
server. The background queue on the right provides users with a real-time update of progress on dataset processing.
(b) The upload processing flow consists of five parts: server upload (“wait”); storage in LDML and parsing file data
(“transferred”); transfer of time-series data and analysis results to OpenTSDB (“written”); transfer of metadata to
PostgreSQL (“annotated”); and completion (“done”).
of sums, averages, max-min values, statistics, and cific proprietary ECG analysis algorithms, metrics
custom functions. on signal quality, and other data. These data are
OpenTSDB provides access to its storage and also extracted and stored.
retrieval mechanisms via RESTful APIs.15 With this Within the upload interface (Figure 2a), users
capability, other software systems can query OpenTS- can browse their file system to locate folders contain-
DB to retrieve ECG datasets. The open source rela- ing ECG data. Files are selected for upload by click-
tional database system PostgreSQL16 maintains file- ing the “choose” button or by dragging and dropping
related information and other metadata. PostgreSQL files into the central display area (Figure 2a).
is also used for portal content management (user Clicking the “upload all” button initiates transfer
identities, portal configuration, the Liferay document of data from the user’s file system to WaveformECG.
and media library [LDML], and so on), storage of all WaveformECG automatically determines each file’s
uploaded ECG data files in their native format, and format and follows a multistep procedure for storing
other ECG metadata (sampling rate, length of data and retrieving data (Figure 2b). Completion of these
capture, subject identifier, and so on). steps is used as an indicator of progress. Progress in-
formation is displayed in the right-most portion of
Data Upload and Management the upload interface, under the “background queue”
WaveformECG can import ECG data in several tab. In the first step, the system checks to make sure
different vendor formats, including Philips XML that all required files have transferred from the local
1.03/1.04, HL7aECG, Schiller XML, GE Muse/ source to the host. While most formats only have one
Muse XML 7+, Norav Raw Data (RDT), and the file per dataset, some formats split information across
WaveForm DataBase (WFDB) format used in the multiple files. Figure 2a shows this for s0027lre, a
Physionet Project.17 In addition to the ECG time WFDB format ECG dataset. s0027lre’s data is pack-
series, Philips, Schiller, and GE Muse XML files aged in three different files, with dataset metadata in
also contain results from execution of vendor-spe- the header (.hea) file, and time-series data in the other
38 September/October 2016
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
(a)
Invoke
Drag/drop Click checkboxes for
Click start algorithms on
Start analysis all files for analysis any/all algorithms Done
analysis selected files/
to the center pane to process with
(b) update progress
Figure 3. Analysis portal. (a) Three datasets with different formats were selected for processing by multiple
algorithms. The background queue shows progress in data processing: two datasets have each been processed using
eight algorithms, while the third has completed processing by seven. (b) In the analysis process, data and algorithm
selection follow a step-by-step workflow.
two (.dat and .xyz). In this example, WaveformECG appropriate ontology term, selected from the NCBO
has fully received the .hea file, but the .dat and .xyz Bioportal Electrocardiography Ontology (http://purl.
file transfers are still in progress. The progress bar for bioontology.org/ontology/ECG) by storing the ontol-
dataset s0027lre is empty, and the phase column of ogy ID along with the result. WaveformECG bundles
the background queue displays “wait” because these this information, along with the subject identifier, the
data files are still being transferred. Once each ECG format of the uploaded ECG dataset, and the start
file transfer to the service is complete, they’re stored in time of the time series itself, and writes labeled analysis
their native format in the LDML. Files at this stage of results into OpenTSDB. Once this is completed, the
the workflow have a progress bar at 40 percent com- progress bar moves to 80 percent, with “annotated”
pletion, with “transferred” displayed in the “back- displayed in the “background queue” phase column.
ground queue” area. The folder structure within the WaveformECG must be able to maintain a
LDML corresponds with that of the folders created connection with the original uploaded ECG files,
by the user in the upload interface. WaveformECG the stored time-series data, file metadata, analysis
displays this folder structure on all screens where the results, and manual annotations made to ECG
user interacts directly with their uploaded files. waveforms. To do this, the OpenTSDB TSUID is
Once transfer is complete, WaveformECG stored in PostgreSQL. Once this is done, the prog-
spawns a separate process to extract each ECG time ress bar moves to 100 percent, and “done” is dis-
series for storage in OpenTSDB. A single ECG file played in the “background queue” phase column.
contains signals from multiple leads, and a time
series for each lead signal is extracted and labeled Data Analysis
with a unique TSUID. Once this step is complete ECG analysis algorithms are made available for use
for all leads of an ECG waveform, the progress bar in WaveformECG as Web services. The analyze
moves to 60 percent completion, with “written” interface (Figure 3a) uses libraries for Web service
displayed in the “background queue” column. implementations of ECG analysis algorithms ac-
Following completion of writing, another back- cessed through Apache Axis2.18
ground process is spawned to extract ECG waveform Analysis Web services are developed by using
analysis results in the files. Each result is labeled with an Axis2 for communicating with the compiled version
www.computer.org/cise 39
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
SCIENCE AS A SERVICE
Figure 4. Multilead visualization interface. In the multilead display for 4 of the 15 leads from a GE Marquette
Universal System for Electrocardiography (MUSE) XML upload, the vertical bar in 3 of the graphs represents the
cursor location in the first graph. The bars move with the cursor and change focus as the cursor changes graphs.
of the analysis algorithm. Axis2 is an open source list allows a user to toggle the selection of all the
XML-based framework that provides APIs for gen- available algorithms. All available algorithms have
erating and deploying Web services. It runs on an default settings—some have parameters that can
Apache Tomcat server, rendering it operating-plat- be set via the “options” button, but all parameters
form-independent. Algorithms developed using set for an algorithm will be applied to all files to
the interpreted language Matlab can be compiled be analyzed. Upon selection of files to be processed
using Matlab Compiler (www.mathworks.com/ and the algorithms with which to process them, the
products/compiler/mcr/?requestedDomain=www.
____________________________ ____ user clicks the “start analysis” button, which creates
mathworks.com)
__________ and executed in Matlab Runtime, a thread to handle the processing. The thread dis-
a stand-alone set of shared libraries that enables the patches a RESTful call to OpenTSDB to retrieve
execution of compiled Matlab applications or com- all the data requested. Depending on the algo-
ponents on computers that don’t have it installed. rithms chosen, the thread writes the data into the
An XML file is developed that defines the service, necessary formats required by the algorithms (for
commands it will accept, and acceptable values to example, algorithms from the PhysioToolkit19 re-
pass to it. In a separate administrative portion of quire that ECG data be in the WFDB file format).
the analyze interface, a tool allows administrators The thread then invokes the requested algorithms
to easily add algorithms implemented as Web ser- on the requested data. As long as the analyze screen
vices to the system. Upon entry of proper algorithm remains open, the background queue will be up-
details and parameter information, WaveformECG dated, incrementing the number of algorithms that
can invoke an algorithm that the administrator has have finished processing. Upon processing comple-
deployed. This approach simplifies the process of tion of all selected algorithms for one file, the phase
adding new algorithms to WaveformECG. will update to “done” in the background queue.
Figure 3a shows the analyze interface and Fig- Analyses of ECGs can provide information
ure 3b shows the associated processing steps. Users on the heart’s both normal and pathological func-
select files or folders from the file chooser on the tions. The lead V3 ECG waveform in Figure 4
left; multiple files can be dragged and dropped into shows body surface potential (uV; ordinate) as a
the central pane. Placing a file in that pane makes function of time (Sec; abscissa) measured over a
that file available for analysis by one or more algo- single heartbeat. The ECG P, Q, R, S, and T waves
rithms, listed in the bottom center pane. Clicking are labeled. The P wave reflects depolarization of
the checkbox on an algorithm entry instructs the the cardiac atria in response to electrical excitation
system to analyze the selected files with that algo- produced in the pacemaker region of the heart, the
rithm. The checkbox at the top of the algorithm sinoatrial node. Onset of the Q wave corresponds
40 September/October 2016
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
sqrs/sqrs2csv Physionet Detects onset and offset times of the QRS complex in single leads; second implementation
produces output in CSV format
wqrs/wqrs2csv Physionet Detects onset and offset times of the QRS complex in single leads using the length transform;
second implementation produces output in CSV format
ihr (sqrs & wqrs Physionet Computes successive RR intervals (instantaneous heart rate); requires input from sqrs or wqrs
implementation)
pNNx (sqrs & wqrs Physionet Calculates time domain measures of heart rate variability; requires input from sqrs or wqrs
implementation)
QT Screening Yuri Chesnokov Detects successive QT intervals based upon high- or low-pass filtering of ECG waveforms;
and colleagues19 works with data in WFDB format
QRS-Score David Strauss Produces the Strauss-Selvester QRS arrhythmia risk score based on certain criteria derived
and colleagues20 from GE MUSE analysis
to onset of depolarization of the cardiac interven- Initially, four leads are displayed, but additional
tricular septum. The R and S waves correspond to leads can be viewed by grabbing and dragging the
depolarization of the remainder of the cardiac ven- window scroll bar located on the right side of the
tricles and Purkinje fibers, respectively. Ventricular browser display. Whenever the cursor is positioned
activation time is defined as the time interval be- within a display window, its x-y location is marked
tween onset of the Q wave and the peak of the R by a filled red dot, and time-amplitude values at
wave. The T wave corresponds to repolarization of that location are displayed at the bottom of the
the ventricles to the resting state. The time interval panel. Cursor display in all graphs is synchronized
between onset of the Q wave and completion of the so that as the user navigates through one graph,
T wave is known as the QT interval and represents others update with it. Lead name and number of
the amount of time over which the heart is partial- annotations for that lead signal are displayed in
ly or fully electrically excited over the cardiac cycle. each graph. File metadata, including subject ID,
The time interval between successive R peaks is lead count, sampling rate, and ECG duration, are
the instantaneous heart rate. Abnormalities of the displayed above the graphs. WaveformECG sup-
shape, amplitude, and other features of these waves ports scrolling through waveforms. Clicking on the
and intervals can reflect underlying heart disease, “next” button at the top of the display steps through
and there has been considerable effort in develop- time in increments of 1.2 seconds. Users can jump
ing algorithms that can be used to automatically to a particular time point by entering the time value
analyze ECGs to detect these peaks and intervals. into the panel labeled “jump to time (sec).”
Table 1 lists the algorithms available in the current By clicking on a lead graph, users can expand
release of WaveformECG.20,21 the view to see the data for that lead in detail, in-
cluding any annotations that have been entered
Data Visualization manually. A list of analysis results on the lead are
The visualize interface lets users examine and in- displayed in a table at the left of the view, and the
teract with stored ECG data. This feature also pro- graph is displayed in the center right. In Figure 5a,
vides a mechanism for manually annotating wave- WaveformECG displays part of the analysis results
forms. When the user selects a file to view in the extracted from a Philips XML upload. While re-
visualization screen, it initially displays the data as a questing the analysis results displayed, the visualize
series of graphs, one for each lead in the dataset (15 interface also checks the data originally returned
leads for the GE MUSE dataset shown in Figure 4). from OpenTSDB to see if any annotations exist
A 1-mV amplitude with a 200-msec duration for the time frame displayed and, if so, Dygraphs
calibration pulse is displayed in the left-most panel. (http://dygraphs.com),
_____________ an open source JavaScript
www.computer.org/cise 41
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
SCIENCE AS A SERVICE
(a)
(b)
Figure 5. Visualization. (a) For lead II in a 12-lead ECG in Philips format, the table under “Analysis Results” displays
the results of automated data processing by the Philips system used to collect this ECG. In the waveform graph,
A denotes a QT interval annotation, with the yellow bar representing the interval itself. This annotation was made
manually. The 1 denotes an R peak annotation, also made manually. All interval and point annotations are listed
below the graph. (b) In the manual annotation interface, the R peak is highlighted and the information in the center
shows the definition returned for that term selection. In addition, there are details about the ECG and the point at
which the annotation was made. To create these displays, the visualize interface initiates a RESTful call to OpenTSDB
to retrieve the first 2.5 seconds of time-series data associated with all the leads in the file. Dygraphs, an open source
JavaScript charting library, generates each of the graphs displayed.
42 September/October 2016
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
charting library, renders the annotation display on As typing commences, a JavaScript application devel-
the screen. Figure 5a shows examples of the two oped by the NCBO provides a list of terms in the tar-
types of supported waveform annotations: point get ontology that match the typed text. The user can
annotations are associated with a specific ECG then select a term from that list. Upon selection of a
time-value pair—in this case, the time-value pair term, the lower box in the right-center screen will up-
corresponding to the peak of the R wave, labeled date with the term and the term definition retrieved
with the number 1—and interval annotations are from Bioportal. The user can then enter a comment
associated with a particular time interval. The user in the text field below that describes any additional
can scroll through the individual lead data using information to be included with the annotation.
the slider control at the bottom of the display or Upon completion of term selection and comment
the navigation buttons. There’s also a feature to entry, the user clicks the “save” button. In Figure 5b,
jump to a specific point in time. Zooming can be this button is grayed out because this figure shows
performed using the slider bar at the bottom of the the result of clicking on an existing annotation.
screen. To restore the original view of the graph, This lets users delve into the details of existing an-
the user can double-click on it. Manual annota- notations and see any comments entered previously
tions can be added by clicking in the graph screen. in the comment box for annotation.
www.computer.org/cise 43
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
SCIENCE AS A SERVICE
Figure 6. Integration with a clinical data warehouse. A split screen shows information sent from I2B2 (left) to WaveformECG.
In the expanded EKG annotations folder, three analysis results can be returned from WaveformECG to I2B2.
RESTful call to OpenTSDB, searching for analysis study were the large number of subjects and ECGs
results linked with files in the Eureka folder. Once (~35,000) to be managed and analyzed, the use of
found, analysis results along with their ECG lntol- different ECG instrumentation and thus different
ogy IDs are transferred to Eureka, where they are data formats at the two sites, and the fact that instru-
reorganized into a format acceptable for automatic ment vendors don’t make either of the algorithms to
loading into I2B2. A subset of those results can be tested available in their systems. WaveformECG
be seen in Figure 6 under the “EKG annotations” proved to be a powerful platform for supporting this
folder in the I2B2 “navigate terms” window. study. The QRS score and QRS-T angle algorithms
were implemented and deployed, making it possible
WaveformECG Case Study for the research team to quickly select and analyze
Sudden cardiac death (SCD) accounts for 200,000 ECGs from different sites. The two ECG-based fea-
to 450,000 deaths in the US annually.24 Current tures were shown to be a useful initial method (a
screening strategies fail to detect roughly 80 per- sensitivity of 70 percent and a specificity of 55 per-
cent of those who die suddenly. The ideal screening cent) for identifying those at risk of SCD in the pop-
method for increased risk should be simple, inex- ulation of patients having preserved left ventricular
pensive, and reproducible in different settings so ejection fraction (LVEF > 35 percent).
that it can be performed routinely in a physicians’
offices, yet be both sensitive and specific. A recent
study has shown that features computed from the
12-lead ECG known as the QRS score and QRS-T
angle can be used to identify patients with fibrotic
O ther physiological time-series data arise in
many other healthcare applications. Blood
pressure waveforms, peripheral capillary oxygen
scars (determined using late-gadolinium enhance- saturation, respiratory rate, and other physiologi-
ment magnetic resonance imaging) with 98 percent cal signals are measured from every patient in the
sensitivity and 51 percent specificity.25 Motivated by modern hospital, particularly those in critical care
these findings, we assisted in a large-scale screening settings. Currently in most hospitals, these data are
of all ECGs obtained over a six-month period at two “ephemeral,” meaning they appear on the bedside
large hospital systems. The challenges faced in this monitor and then disappear. These data are among
44 September/October 2016
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
the most actionable in the hospital because they 9. E.B. Lynch et al., “Cardiovascular Disease Risk Fac-
reflect the patient’s moment-to-moment physiolog- tor Knowledge in Young Adults and 10-Year Change
ical functioning. Capturing these data and under- in Risk Factors: The Coronary Artery Risk Devel-
standing how they can be used along with other opment in Young Adults (CARDIA) Study,” Am. J.
data from the electronic health record to more pre- Epidemiology, vol. 164, 15 Dec. 2006, pp. 1171–1179.
cisely inform patient interventions has the poten- 10. A. Cheng et al., “Protein Biomarkers Identify Pa-
tial to significantly improve healthcare outcomes. tients Unlikely to Benefit from Primary Prevention
In future work, we will extend WaveformECG to Implantable Cardioverter Defibrillators: Findings
serve as a general-purpose platform for working from the Prospective Observational Study of Im-
with other types physiological time-series data. plantable Cardioverter Defibrillators (PROSE-
ICD),” Circulation: Arrhythmia and Electrophysiol-
Acknowledgments ogy, vol. 7, no. 12, 2014, pp. 1084–1091.
Development of WaveformECG was supported by the 11. J.X. Yuan, Liferay Portal Systems Development, Packt
National Heart, Lung and Blood Institute through NIH Publishing, 2012.
R24 HL085343, NIH R01 HL103727, and as a subcon- 12. R. Ananthakrishnan et al., “Globus Nexus: An
tract of NIH U54HG004028 from the National Center Identity, Profile, and Group Management Platform
for Biomedical Ontology. for Science Gateways and Other Collaborative Sci-
ence Applications,” Proc. Int’ l Conf. Cluster Comput-
References ing, 2013, pp. 1–3.
1. B. Surawicz et al., “AHA/ACCF/HRS Recommen- 13. B. Sigoure, “OpenTSDB: The Distributed, Scalable
dations for the Standardization and Interpretation Time Series Database,” Proc. Open Source Convention,
of the Electrocardiogram: Part III: Intraventricular 2010; http://opentsdb.net/misc/opentsdb-oscon.pdf.
Conduction Disturbances,” Circulation, vol. 119, 17 14. R.C. Taylor, “An Overview of the Hadoop/MapRe-
Mar. 2009, pp. e235–240. duce/HBase Framework and Its Current Applica-
2. P.M. Rautaharju et al., “AHA/ACCF/HRS Recom- tions in Bioinformatics,” BMC Bioinformatics, vol.
mendations for the Standardization and Interpre- 11, 2010, p. S1.
tation of the Electrocardiogram: Part IV: The ST 15. C. Pautasso, RESTful Web Services: Principles, Patterns,
Segment, T and U Waves, and the QT Interval,” Emerging Technologies,” Springer, 2014, pp. 31–51.
Circulation, vol. 119, 17 Mar. 2009, pp. e241–250. 16. K. Douglas and S. Douglas, PostgreSQL: A Com-
3. G.S. Wagner et al., “AHA/ACCF/HRS Recommen- prehensive Guide to Building, Programming, and
dations for the Standardization and Interpretation Administering PostgreSQL Databases, SAMS pub-
of the Electrocardiogram: Part VI: Acute Ischemia/ lishing, 2003.
Infarction,” J. Am. College Cardiology, vol. 53, 17 17. G.B. Moody, R.G. Mark, and A.L. Goldberger,
Mar. 2009, pp. 1003–1011. “PhysioNet: A Web-Based Resource for the Study of
4. E.W. Hancock et al., “AHA/ACCF/HRS Recommen- Physiologic Signals,” IEEE Eng. Medicine and Biol-
dations for the Standardization and Interpretation ogy Magazine, vol. 20, no. 3, 2001, pp. 70–75.
of the Electrocardiogram: Part V: Electrocardio- 18. D. Jayasinghe and A. Azeez, Apache Axis2 Web Ser-
gram Changes Associated with Cardiac Chamber vices, Packt Publishing, 2011.
Hypertrophy,” Circulation, vol. 119, 17 Mar. 2009, 19. G.B. Moody, R.G. Mark, and A.L. Goldberger,
pp. e251–261. “PhysioNet: Physiologic Signals, Time Series and
5. R. Winslow et al., “The CardioVascular Research Related Open Source Software for Basic, Clinical,
Grid (CVRG) Project,” Proc. AMIA Summit on and Applied Research,” Proc. Conf. IEEE Eng. Medi-
Translational Bioinformatics, 2011, pp. 77–81. cine and Biology Soc., vol. 2011, pp. 8327–8330.
6. M.A. Musen et al., “BioPortal: Ontologies and Data 20. Y. Chesnokov, D. Nerukh, and R. Glen, “Individu-
Resources with the Click of a Mouse,” Proc. Am. Medi- ally Adaptable Automatic QT Detector,” Computers
cal Informatics Assoc. Ann. Symp., 2008, pp. 1223–1224. in Cardiology, vol. 33, 2006, pp. 337–341.
7. S.N. Murphy et al., “Serving the Enterprise and Be- 21. D.G. Strauss et al., “Screening Entire Health System
yond with Informatics for Integrating Biology and ECG Databases to Identify Patients at Increased
the Bedside (I2B2),” J. Am. Medical Informatics As- Risk of Death,” Circulation: Arrhythmia and Elec-
soc., vol. 17, no. 2, 2010, pp. 124–130. trophysiology, vol. 6, no. 12, 2013, pp. 1156–1162.
8. D.E. Bild et al., “Multi-ethnic Study of Atheroscle- 22. A. Post et al., “Semantic ETL into I2B2 with Eu-
rosis: Objectives and Design,” Am. J. Epidemiology, reka!,” AMIA Summit Translational Science Proc.,
vol. 156, 1 Nov. 2002, pp. 871–881. 2013, pp. 203–207.
www.computer.org/cise 45
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
SCIENCE AS A SERVICE
23. M. Saeed et al., “Multiparameter Intelligent Moni- cipal investigator of the CardioVascular Research Grid
toring in Intensive Care II: A Public-Access Inten- Project and holds joints appointments in the departments
sive Care Unit Database,” Critical Care Medicine, of Electrical and Computer Engineering, Computer Sci-
vol. 39, no. 5, 2011, pp. 952–960. ence, and the Division of Health Care Information Sci-
24. J.J. Goldberger et al., “American Heart Association/ ences at Johns Hopkins University. He’s a Fellow of the
American College of Cardiology Foundation/Heart American Heart Association, the Biomedical Engineer-
Rhythm Society Scientific Statement on Noninva- ing Society, and the American Institute for Medical and
sive Risk Stratification Techniques for Identifying Biological Engineers. Contact him at [email protected].
___________
Patients at Risk for Sudden Cardiac Death: A Sci-
entific Statement from the American Heart Associa- Stephen Granite is the director of database and software
tion Council on Clinical Cardiology Committee on development of the Institute for Computational Medi-
Electrocardiography and Arrhythmias and Council cine at Johns Hopkins University. He’s also the program
on Epidemiology and Prevention,” Circulation, vol. manager for the CardioVascular Research Grid Project.
118, 30 Sept. 2008, pp. 1497–1518. Granite has an MS in computer science with a focus in
25. D.G. Strauss et al., “ECG Quantification of Myo- bioinformatics and an MS in business administration
cardial Scar in Cardiomyopathy Patients with or with a focus in competitive intelligence, both from Johns
without Conduction Defects: Correlation with Car- Hopkins University. Contact him at [email protected].
__________
diac Magnetic Resonance and Arrhythmogenesis,”
Circulation: Arrhythmia and Electrophysiology, vol. 1, Christian Jurado is a software engineer in the Institute
no. 12, 2008, pp. 327–336. for Computational Medicine at Johns Hopkins Univer-
sity. He’s also lead developer of WaveformECG for the
Raimond L. Winslow is the Raj and Neera Singh Pro- CardioVascular Research Grid Project. Jurado has a BS
fessor of Biomedical Engineering and director of the in computer science, specializing in Java Web develop-
Institute for Computational Medicine at Johns Hopkins ment and Liferay. Contact him at [email protected].
___________
University. His research interests include the use of com-
putational modeling to understand the molecular mech-
anisms of cardiac arrhythmias and sudden death, as well
as the development of informatics technologies that pro- Selected articles and columns from IEEE Computer
vide researchers secure, seamless access to cardiovascular Society publications are also available for free at
research study data and analysis tools. Winslow is prin- http://ComputingNow.computer.org.
46 September/October 2016
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
PURPOSE: The IEEE Computer Society is the world’s largest EXECUTIVE COMMITTEE
association of computing professionals and is the leading President: Roger U. Fujii
provider of technical information in the field. President-Elect: Jean-Luc Gaudiot; Past President: Thomas M. Conte;
MEMBERSHIP: Members receive the monthly magazine Secretary: Gregory T. Byrd; Treasurer: Forrest Shull; VP, Member &
Computer, discounts, and opportunities to serve (all activities Geographic Activities: Nita K. Patel; VP, Publications: David S. Ebert;
are led by volunteer members). Membership is open to all IEEE VP, Professional & Educational Activities: Andy T. Chen; VP, Standards
members, affiliate society members, and others interested in the Activities: Mark Paulk; VP, Technical & Conference Activities: Hausi A.
computer field. Müller; 2016 IEEE Director & Delegate Division VIII: John W. Walz; 2016
COMPUTER SOCIETY WEBSITE: www.computer.org IEEE Director & Delegate Division V: Harold Javid; 2017 IEEE Director-
OMBUDSMAN: Direct unresolved complaints to ombudsman@
________ Elect & Delegate Division V: Dejan S. MilojiɯiƩ
computer.org.
CHAPTERS: Regular and student chapters worldwide provide the BOARD OF GOVERNORS
opportunity to interact with colleagues, hear technical experts, Term Expriring 2016: David A. Bader, Pierre Bourque, Dennis J. Frailey,
and serve the local professional community. Jill I. Gostin, Atsuhiro Goto, Rob Reilly, Christina M. Schober
AVAILABLE INFORMATION: To check membership status, report Term Expiring 2017: David Lomet, Ming C. Lin, Gregory T. Byrd, Alfredo
an address change, or obtain more information on any of the Benso, Forrest Shull, Fabrizio Lombardi, Hausi A. Müller
following, email Customer Service at [email protected]
____________ or call Term Expiring 2018: Ann DeMarle, Fred Douglis, Vladimir Getov, Bruce
+1 714 821 8380 (international) or our toll-free number, +1 800 M. McMillin, Cecilia Metra, Kunio Uchiyama, Stefano Zanero
272 6657 (US):
EXECUTIVE STAFF
G Membership applications Executive Director: Angela R. Burgess
G Publications catalog Director, Governance & Associate Executive Director: Anne Marie Kelly
G Draft standards and order forms Director, Finance & Accounting: Sunny Hwang
G Technical committee list Director, Information Technology & Services: Sumit Kacker
G Technical committee application Director, Membership Development: Eric Berkowitz
G Chapter start-up procedures Director, Products & Services: Evan M. Butterfield
G Student scholarship information Director, Sales & Marketing: Chris Jensen
G Volunteer leaders/staff directory
G IEEE senior member grade application (requires 10 years COMPUTER SOCIETY OFFICES
practice and significant performance in five of those 10) Washington, D.C.: 2001 L St., Ste. 700, Washington, D.C. 20036-4928
Phone: GFax: +1 202 728 9614
PUBLICATIONS AND ACTIVITIES Email: [email protected]
___________
Computer: The flagship publication of the IEEE Computer Los Alamitos: 10662 Los Vaqueros Circle, Los Alamitos, CA 90720
Society, Computer, publishes peer-reviewed technical content that Phone: +1 714 821 8380
covers all aspects of computer science, computer engineering, Email: [email protected]
__________
technology, and applications.
MEMBERSHIP & PUBLICATION ORDERS
Periodicals: The society publishes 13 magazines, 19 transactions,
Phone: GFax: G9-58418</;9<A@1>;>3
__________
and one letters. Refer to membership application or request
Asia/Pacific: Watanabe Building, 1-4-2 Minami-Aoyama, Minato-ku,
information as noted above.
Tokyo 107-0062, Japan
Conference Proceedings & Books: Conference Publishing
Phone: GFax: +81 3 3408 3553
Services publishes more than 175 titles every year.
Email: [email protected]
_____________
Standards Working Groups: More than 150 groups produce IEEE
standards used throughout the world. IEEE BOARD OF DIRECTORS
Technical Committees: TCs provide professional interaction in President & CEO: Barry L. Shoop
more than 45 technical areas and directly influence computer President-Elect: Karen Bartleson
engineering conferences and publications. Past President: Howard E. Michel
Conferences/Education: The society holds about 200 conferences Secretary: Parviz Famouri
each year and sponsors many educational activities, including Treasurer: Jerry L. Hudgins
computing science accreditation. Director & President, IEEE-USA: Peter Alan Eckstein
Certifications: The society offers two software developer Director & President, Standards Association: Bruce P. Kraemer
credentials. For more information, visit www.computer.org/ Director & VP, Educational Activities: S.K. Ramesh
certification.
_______ Director & VP, Membership and Geographic Activities: Wai-Choong
(Lawrence) Wong
Director & VP, Publication Services and Products: Sheila Hemami
NEXT BOARD MEETING Director & VP, Technical Activities: Jose M.F. Moura
13–14 November 2016, New Brunswick, NJ, USA Director & Delegate Division V: Harold Javid
Director & Delegate Division VIII: John W. Walz
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
COMPUTATIONAL CHEMISTRY
Chemical kinetics has played a critical role in understanding phenomena such as global climate change and
photochemical smog, and researchers use it to analyze chemical reactors and alternative fuels. When computing is
applied to the development of detailed chemical kinetic models, it allows scientists to predict the behavior of these
complex chemical systems.
T
he 1995 Nobel Prize in Chemistry was spectively. Other applications of kinetics include
awarded to Paul J. Crutzen, Mario J. Mo- controlling photochemical smog through emis-
lina, and F. Sherwood Rowland “for their sions regulations on automobiles and factories and
work in atmospheric chemistry, particu- the development of alternative fuels for the inter-
larly concerning the formation and decomposition nal combustion engine. The experimental testing
of ozone.”1 Molina and Rowland performed calcu- needed for fuel certification is expensive and time-
lations predicting that chlorofluorocarbon (CFC) consuming, leading to the development of compu-
gases being released into the atmosphere would tational approaches to minimize the experimental
lead to the depletion of the ozone layer. Because space.
the ozone layer absorbs ultraviolet light, its deple- The purpose of this article is to acquaint com-
tion would lead to an increase in ultraviolet light puter scientists with the application of computing
on the Earth’s surface, resulting in an increase in in chemical kinetics and to outline future chal-
skin cancer and eye damage in humans. The sub- lenges, such as the need to couple kinetics and
sequent international treaty, the Montreal Protocol transport phenomena to obtain more accurate pre-
on Substances that Deplete the Ozone Layer, was dictions. We focus on gas-phase chemistry; that
universally adopted and phased out the production is, all the chemicals are gases. Figure 1 shows the
of CFCs; it serves as an exemplar of public policy flow of the chemical kinetic computation. Aspects
being informed by science. of this are similar to hardware and software design
The underlying calculations used by Molina flows, where the upstream steps have considerable
and Rowland have their basis in chemical kinet- impact on the downstream steps, both in terms
ics, which concerns the rate at which chemical re- of computation time and quality of solution. This
actions occur. When a chemical reaction (such as article’s organization mirrors the steps in the flow-
the combustion of methane) takes place, the over- chart, with sections on mechanism generation,
all reaction might appear simple—such as CH4 + consistency and completeness analysis, and mecha-
2O2 = CO2 + 2H2O—but the actual chemistry is nism reduction.
typically much more complex (for details, see the
“Related Work in Chemical Kinetics” sidebar). An Mechanism Generation
accurate analysis of the underlying combustion The automated development of a reaction mechanism
phenomenon requires consideration of all the spe- involves starting with a set of reactants and determin-
cies (molecules) and elementary reactions, which ing the reactions they participate in, the intermedi-
could number in the hundreds and thousands, re- ate species that are generated, and, ultimately, the
48 Computing in Science & Engineering 1521-9615/16/$33.00 © 2016 IEEE Copublished by the IEEE CS and the AIP September/October 2016
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
1 d [A] 1 d [B ] 1 d [C ] 1 d [D ]
− =− = = . These are then numerically integrated to give, for each species, a
a dt b dt c dt d dt
description of the concentration variation with time. The resulting
The field of chemical kinetics was pioneered by Cato M. predictions are compared to experimental data to assess the proposed
Guldberg and Peter Waage, who observed in 1864 that reaction mechanism’s suitability. Although our example reaction consists
rates are related to the concentrations of the reactants. Typically, of five reactions and five species, a complex reaction can contain
thousands of species and tens of thousands of reactions, requiring
RATE ∝ [A]x [B ] y efficient integration algorithms. A description of the ODE solvers and
the electronic structure calculations used to determine rate constants,
= k [A]x [B ] y ,
while crucial to this technology, are beyond this article’s scope.
products that are obtained. This is an inherently This process can theoretically continue indefinitely, re-
iterative process because intermediate species might sulting in a combinatorial explosion of species and reac-
themselves participate in reactions, resulting in the for- tions. In practice, criteria are needed for two reasons: to
mation of new species, which might in turn react with decide when to terminate the process and to identify
other species to generate even more species, and so on. the most chemically important reactions and species.
www.computer.org/cise 49
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
COMPUTATIONAL CHEMISTRY
Chemical kinetics
flow diagram
Chemical insight element Mii (typically zero in an adjacency matrix)
(expt, theory)
denotes the number of free electrons of atom i that
aren’t used in its bonds. The sum of the elements in
Rate rules
the ith row then gives the valence of atom i. Figure 2
Mechanism illustrates the concept.
generation A reaction (R) matrix is used to capture the bond
Consistency and completeness changes associated with a certain type of reaction.
analysis Several well-known reaction types exist—including
Efficient ODE solver Validation
No hydrogen abstraction, b-scission, and recombina-
Yes tion—and each has well-defined behavior with re-
Mechanism reduction spect to the bond changes that occur in the reaction.
Final mechanism CFD solver This is illustrated using hydrogen abstraction—that
is, the removal (abstraction) of an H atom from a
molecule by a radical (a species with an atom with a
Figure 1. Mechanism generation flow chart. The
unpaired electron in its outermost shell)—as follows:
kineticist uses chemical insights gained through
experiments and theory to develop rate rules, which X H Y* q X* Y H.
are used by the generation algorithm to create a new
reaction mechanism. The reaction mechanism is Here, the radical Y* abstracts the H atom from the
checked for consistency and completeness. This is molecule X-H, giving the molecule Y-H and the
followed by validation procedures. If the mechanism
radical X*. The bonds associated with three atoms
fails the validation procedures—that is, if predictions
obtained by running ordinary differential equations are impacted by this reaction:
(ODE) solvers don’t match experimental data—the
process must be repeated. Upon passing the validation ■ X a, the atom in X whose bond with the H
procedures, the size of the mechanism is reduced, atom is broken;
giving a final mechanism. This mechanism can be
■ the H atom itself; and
used along with computational fluid dynamics (CFD)
computation to perform accurate system simulations. ■ Yb, the atom in Y with the unpaired electron,
that forms a bond with the H atom.
Bond-Electron and Reaction Matrices These are reflected in the H-abstraction reaction
We now describe the use of matrices to generate matrix:
products from a set of reactants and reaction
Xa H Yb
types.2,3 The bond-electron (BE) matrix represents
a species and is a variation of the classical adjacency Xa 1 1 0
matrix used to represent graphs. Specifically, H 1 0 1
Yb 0 1 1
■ graph vertices are augmented with labels to de-
note atoms, such as C for carbon, H for hydro- Figure 3 illustrates in detail how the products of
gen, and O for oxygen; and the reaction in Figure 2 are generated.
■ multiple edges are permitted between vertices
to account for bond order. Termination and Selection
Let the initial set of reactants be R 0. Assume that
For example, a pair of C atoms can be joined by a reaction matrices are applied to all possible com-
single bond, a double bond, or a triple bond; these binations of reactants to generate products as de-
three cases must be distinguishable. Bond forma- scribed in the previous section. Let the set of new
tion is governed by the participating atoms’ va- products generated be R1. We now repeat the pro-
lences—that is, the number of unpaired electrons cess on R 0 R1 to generate R 2.
in their outermost shells. The valences for C, O, This process can be repeated indefinitely. The
and H are 4, 2, and 1, respectively. A single bond challenge is twofold: to determine which criteria
is formed by the contribution of one unpaired elec- should be used to terminate this process and to
tron from the two participating atoms. identify how to select chemically significant species
Element Mij in an n w n BE matrix M of a mol- and reactions, while leaving chemically insignifi-
ecule with n atoms denotes the number of bonds cant ones out. The rate-based approach,4 which has
between atoms i and j, when i | j. The diagonal found favor within the kinetics community, uses
50 September/October 2016
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
5 5
H H
B A B A 2
4H C1 H2 + H O* 4H C* + H O H
1
H H
3 3
1 2 3 4 5
1 3 4 5
1 C 0 1 1 1 1 A B 2
A B 1 C 1 1 1 1
2 H 1 0 0 0 0 A O 0 1 1
A O 1 1 3 H 1 0 0 0
3 H 1 0 0 0 0 B H 1 0 0
B H 1 0 4 H 1 0 0 0
4 H 1 0 0 0 0 2 H 1 0 0
5 H 1 0 0 0
5 H 1 0 0 0 0
(a) (b) (c) (d)
Figure 2. Hydrogen (H) abstraction reaction example and bond-electron (BE) matrices for all participating species:
(a) and (b) the reactants methane and hydroxyl radical, and (c) and (d) the products methyl radical and water. The *
denotes an atom with an unpaired electron. Each atom has a label used later to identify it uniquely.
Estimating Rate Constants Figure 3. Reaction matrices for the reaction shown in Figure 2. (a) We combine
We now briefly describe how rate constants are es- the BE matrices of the reactants CH4 and OH and place boxes around matrix
timated. Functional groups are specific groups of elements that will be impacted by the succeeding steps. (b) The expanded
reaction matrix. (c) The result of adding the matrices from (a) and (b) with
atoms within molecules that are responsible for the
boxes around the elements that are affected by the addition. (d) Rows and
characteristic chemical behaviors of those mole-
columns are reordered, giving the products CH3 and H2O, by identifying
cules. For example, all acids (for example, HCl and connected components of the graph using, for example, breadth-first search.
H2SO4) contain the H atom and all alkalies (such
as NaOH and KOH) contain the OH grouping.
Reaction rate constants and other thermochemi- carbonyls (a carbon atom double-bonded to an ox-
cal properties that are required to set up the sys- ygen atom). The more specialized the knowledge of
tem of ODEs can be estimated from the functional functional groups in a reaction, the more accurate
groups that participate in a reaction. Estimates are the rate constant estimate that can be obtained.
required because the direct measurement of these
quantities is often impractical. Consistency and Completeness Analysis
Functional groups are represented in a rooted Mechanisms generated using these techniques must
tree data model,5 in which the root represents a be checked for consistency and completeness. Due to
general functional group and its descendants rep- the large sizes of the mechanisms, kineticists use
resent more specialized groups. Figure 4 shows a software tools to verify accuracy and completeness.
portion of a functional group tree for classifying Tools that automatically classify reactions into
www.computer.org/cise 51
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
COMPUTATIONAL CHEMISTRY
52 September/October 2016
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
or product bond. A bit set to 0 indicates the bond ware and redeployment of the system.7 The rules
shouldn’t be broken, while a bit set to 1 indicates for hydrogen abstraction are as follows:
the bond should be broken. Figure 5c shows a bit
(defrule habstraction
string corresponding to the reaction of Figure 5b
(1)(Reaction {numReactants == 2})
(in which each bond is labeled to indicate posi-
(2)(Reaction {numProducts == 2})
tion in the bit string). In Figure 5c, the bit string
(3) (Reaction {atLeastOneRadicalReactant == TRUE})
indicates that the reactant bond labeled 3 and the
(4) (Reaction {atLeastOneRadicalProduct == TRUE})
product bond labeled 5 must be broken. We then
(5) (Reaction {sameRadicalReactantAndProduct == FALSE})
use canonical labels to see whether this break re-
(6)(Mapping {allHydrogenBonds == TRUE})
sults in the LHS becoming equal to the RHS; in
(7) (Mapping{hydrogenGoingFromStableToRadical
this case, it does, giving us a mapping of cost 2,
Reactant == TRUE})
which is optimal. Note that there are bit patterns
(8)(Reaction {numBondsBroken == 1})
(such as 100000000) that don’t give LHS = RHS
(9)(Reaction {numBondsFormed == 1})
and therefore aren’t valid mappings. Also, because
=>
the reaction is balanced, breaking all the bonds—
(add (new String HydrogenAbstraction)))
that is, a bit pattern with all 1s—guarantees the ex-
istence of a mapping. Statements (1) and (2) verify that there are two
Of course, we don’t know a priori which of reactants and two products using methods from the
the 2b subsets of b bonds will result in an optimal Reaction class. Statements (3) and (4) verify the exis-
mapping. A relatively simple, but remarkably effec- tence of at least one radical reactant and one radi-
tive, approach is to try all the cal product using methods from the Reaction class.
Statement (5) verifies that the radical reactant isn’t
⎛ b ⎞⎟
⎜⎜ ⎟ the same as the radical product using a method from
⎜⎜⎝ i ⎟⎟⎠ the Reaction class. Statement (6) verifies that each
bond broken or formed was connected to a hydro-
bit patterns with i 1s, with i going from 0 to b, gen atom using a method from the Mapping class.
stopping as soon as a mapping is found. In the Statement (7) verifies that a hydrogen atom moved
worst case, as mentioned earlier, this approach is from a stable to a radical reactant using a method
guaranteed to find a mapping when i = b when the from the Mapping class. Statements (8) and (9) verify
bit pattern consists of all 1s, resulting in an expo- that exactly one reactant bond was broken and one
nential time complexity. In practice, the minimum product bond was formed using methods from the
number of bond changes in a chemical reaction is Reaction class. Notice that rules (8) and (9) pertain to
small. For hydrogen abstraction, this quantity is bonds broken and formed in the reaction obtained
two, which means that the number of bit patterns using the ARM techniques described earlier.
examined by the algorithm is bounded by O(b2), a The complicated nature of gas-phase reaction
polynomial. systems makes it impractical to devise a set of rules
that classify all reactions. Unclassified reactions are
Reaction Classification important in their own right because they allow
A classified and sorted reaction mechanism can be the kineticist to focus on problems in mechanisms
used to that have failed validation procedures. Our system
was able to determine the classification for about
■ check for completeness in the mechanism, 95 percent of the reactions in a set of benchmark
■ check the consistency of rate coefficient assignments, combustion mechanisms.
■ focus on unclassified reactions when looking
for problems if validation fails, and Mechanism Reduction
■ compare multiple mechanisms that model the After consistency and completeness analysis, the
same phenomena. system of ODEs is solved and concentration-time
profiles are generated. With improvements in com-
Reaction classification is based on rules asso- puting hardware and ODE solver algorithms, these
ciated with the properties of the reaction and its large systems are now solved routinely. The mecha-
species. These rules can be recorded in a rule-based nisms are validated by comparing the predictions
system such as Jess, which allows for rule modifi- with available data. Many of the problems of interest
cation without requiring recompilation of the soft- require coupling of the validated kinetic mechanism
www.computer.org/cise 53
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
COMPUTATIONAL CHEMISTRY
54 September/October 2016
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
6. J. Crabtree and D. Mehta, “Automated Reaction Anthony M. Dean is a professor of chemical engineering
Mapping,” J. Experimental Algorithmics, vol. 13, no. and vice president for research at the Colorado School
15, 2009, article no. 15. of Mines. His research interests include quantitative ki-
7. T. Kouri et al., “RCARM: Reaction Classification netic characterization of reaction networks in a variety
Using ARM,” Int’ l J. Chemical Kinetics, vol. 45, no. of systems. Dean received a PhD in physical chemistry
2, 2013, pp. 125–139. from Harvard University. He’s a member of the Ameri-
8. T. Lu and C.K. Law, “A Directed Relation Graph can Chemical Society, the American Institute of Chemi-
Method for Mechanism Reduction,” Proc. Combus- cal Engineers, and the Combustion Institute. Contact
tion Institute, vol. 30, no. 1, 2005, pp. 1333–1341. him at ____________
[email protected].
9. S.W. Churchill, “Interaction of Chemical Reac-
tions and Transport. 1. An Overview,” Industrial Tina M. Kouri is a research and development computer
& Eng. Chemistry Research, vol. 44, no. 14, 2005, scientist at Sandia National Labs. Her research interests
pp. 5199–5212. include applied algorithms and cheminformatics. Kouri
received a PhD in mathematical and computer sciences
Dinesh P. Mehta is professor of electrical engineering from the Colorado School of Mines. She’s a member of
and computer science at the Colorado School of Mines. the ACM. Contact her at [email protected].
___________
His research interests include applied algorithms, VLSI
design automation, and cheminformatics. Mehta re-
ceived a PhD in computer and information science from Selected articles and columns from IEEE Computer
the University of Florida. He’s a member of IEEE and Society publications are also available for free at
the ACM. Contact him at ____________
[email protected]. http://ComputingNow.computer.org.
ADVERTISER INFORMATION
Northeast, Midwest, Europe, Middle East: Advertising Sales Representatives (Jobs Board)
Ann & David Schissler
Email: [email protected],
_______________ [email protected]
_______________
Phone: +1 508 394 4026 Heather Buonadies
Fax: +1 508 394 1707 Email: [email protected]
________________
Phone: +1 973 304 4123
Fax: +1 973 585 7071
www.computer.org/cise 55
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
CLOUD COMPUTING
Sanjay Sareen | Guru Nanak Dev University, Amritsar, India and IK Gujral Punjab Technical University, Kaurthala, India
Sandeep K. Sood | Guru Nanak Dev University Regional Campus, Gurdaspur, India
Sunil Kumar Gupta | Beant College of Engineering and Technology, Gurdaspur, India
Automatic detection of an epileptic seizure before its occurrence could protect patients from accidents or even save
lives. A framework that automatically predicts seizures can exploit cloud-based services to collect and analyze EEG
data from a patient’s mobile phone.
E
pilepsy is a disorder that affects the brain, options for real-time and continuous monitoring of
causing seizures. During a seizure, a patient chronically ill and assisted living patients remotely,
could lose consciousness, including while thus minimizing the need for caregivers. One of the
walking or driving a vehicle, which could important segments of WSN is body sensor net-
result in significant injury or death. According to a works (BSNs), which record the vital signs of a pa-
recent survey, the main cause of death for epileptic tient such as heart rate, electrocardiogram (ECG),
patients includes sudden unexpected death during and EEG. These wearable sensors are placed on the
epilepsy due to drowning and accidents, which ac- patient’s body, and their key benefit is mobility—
count for 89 percent of total epilepsy-related deaths they enable the patient to move freely insider or
in Australia.1 Such patients can benefit from an outside the home. BSNs generate a huge amount of
alert before the start of a seizure or emergency treat- sensor data that needs to be processed in real time to
ment when they have a seizure, thus improving their provide timely help to the patient. Cloud computing
quality of life and safety considerably. provides the ability to store and analyze this rapidly
In a clinical study, Brian Litt and colleagues2 ob- generated sensor data in real time from sensors of dif-
served that an increase in the amount of abnormal ferent patients residing in different geographic loca-
electrical activity occurs before a seizure’s onset. One tions. The cloud computing infrastructure integrated
of the most important steps to protect the life of an with BSNs provides an infrastructure to monitor and
epileptic patient is the early detection of seizures, analyze the sensor data of large numbers of epilep-
which can help patients take precautionary measures tic patients around the globe efficiently and in real
and prevent accidents. It has also been observed that time.3 The cloud service provider is bound to provide
during the transition from a normal state to an ictal an agreed upon quality of service based on a service-
state (mid-seizure), to detect a seizure, the electrical level agreement (SLA), and appropriate compensa-
activity in the patient’s brain needs to be recorded tion is paid to the customer if the required service
continuously and efficiently around the clock. Elec- levels aren’t met.4 To protect the patient from acci-
troencephalogram (EEG) is the most commonly used dents when a seizure occurs, ideally, family members
technique to measure electrical activities in the brain could continuously monitor the patient everywhere,
for the diagnosis of epileptic seizures. which isn’t feasible under the traditional circum-
In this direction, wireless sensor network (WSN) stances. Hence, the main objectives of our proposed
technology is emerging as one of the most promising system are
56 Computing in Science & Engineering 1521-9615/16/$33.00 © 2016 IEEE Copublished by the IEEE CS and the AIP September/October 2016
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
■ to detect preictal variations that occur prior to a unique identification number (UID) is allocat-
seizure onset so that the patient can be warned in ed to that person. The data from body sensors in
a timely manner before the start of a seizure, and digital form is collected through patients’ mobile
■ to alert the patient, his or her family members, phones using the Bluetooth technology. The fast
and a nearby hospital for emergency assistance. Walsh-Hadamard transform (FWHT) is used to
extract features or abnormalities from the EEG sig-
To achieve these objectives, we propose a model nal; these features are then reduced using higher-
in which each patient is registered by entering per- order spectral analysis (HOSA) and classified into
sonal information through a mobile phone, then normal, preictal, and ictal states using a Gaussian
www.computer.org/cise 57
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
CLOUD COMPUTING
Data acquisition
Electroencephalogram and transmission Cloud storage and processing of EEG signal
(EEG) to mobile
sensors phone
Bluetooth ER
Data Data
database validation
collection
Wi-Fi Feature
4G extraction
Patient
Feature
classification
m Aler
ge
me rt
es t
ssa
Ale
sa
ge
Family Hospital
member
Figure 1. The architecture of the proposed cloud-based seizure alert system for epileptic patients. The model
integrates wireless body sensor network, mobile phone, cloud computing, and Internet technology to predict the
seizure in real time irrespective of the patient’s geographic location.
process classification algorithm. GPS is used to information and their sensor data. The system as-
track the location of patients from their respective signs each user a UID at the time of registration.
mobile phones. Whenever the system detects the The seizure prediction component performs tasks
preictal state of the patient, an alert message will be such as data validation, feature extraction, and fea-
generated to be sent to the patient’s mobile phone, ture classification. The FWHT and higher-order
family member, and a nearby hospital, depending statistics analyze and extract the feature set from
upon the location of the patient. the EEG signal. A Gaussian process classifies the
See the “Related Work in Seizure Prediction” feature set into normal, preictal, and ictal states of
sidebar for more information on the use of cloud seizure. Based on the classification, the system can
computing and wireless BSNs to predict epileptic generate an alert message and send it to the hospi-
seizures. tal closest to the user’s geographic location, a fam-
ily member, and the actual user. The objective in
Proposed Model sending an alert message to users’ mobile phones
The proposed model consists of target patients, a is to encourage them to take precautionary mea-
BSN, data acquisition and transmission, data col- sures to protect themselves from injuries. The GPS-
lection, seizure prediction, and GPS-based loca- based location-tracking component keeps track of
tion tracking. The BSN consists of wearable EEG the location of the patients with the help of their
sensors placed on different parts of the brain for mobile phones. Figure 1 demonstrates the design
capturing EEG signals. The data acquisition and of our proposed system for predicting and detect-
transmission component comprise a smartphone ing seizures.
and an Android-based application that capture data
from body sensors and send it to the cloud along Data Acquisition and Transmission
with the user’s personal information, manually en- The EEG sensor device contains one or more elec-
tered through the app. The data collection compo- trodes to detect the voltage of current flowing
nent is used to collect and store raw sensor data in through the brain neurons from multiple segments
a database and transforms it into a suitable form of the brain. In our model, we use an Emotiv
for further processing and analysis. It contains a EPOC headset, which contains 14 sensors placed
cloud storage repository to store patients’ personal on the scalp to read signals from different areas of
58 September/October 2016
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
the brain. The signals are transformed at 200 Hz Table 1. Personal attributes of a patient.
using an analog-to-digital converter before being Serial number Attribute Data type
sent to a mobile phone. The raw data streams gen- 1 Social Security Number Integer
erated by EEG sensors are collected continuously 2 Name String
in real time by the patient’s own mobile phone us-
3 Age Integer
ing wireless communication protocol. The mobile
phone constitutes a wireless personal area network 4 Sex String
(WPAN) that receives data from the BSN. Blue- 5 Address String
tooth is used to transfer the data streams between 6 Mobile number Integer
Bluetooth-enabled devices over short distances. 7 Family member’s name String
Several sensor devices can be connected to one 8 Family member’s mobile number Integer
Bluetooth server device (such as a mobile phone),
which acts as a coordinator. An Android-based ap-
plication collects digital sampled values from the Because the EEG signals are nonstationary and
body sensors. The mobile phone transmits the data their frequency response varies with time, conven-
to the cloud via a suitable communication proto- tional methods based on time and frequency do-
col, such as Wi-Fi 3G or 4G networks. mains aren’t suitable for seizure prediction.
Feature Extraction from EEG Signals where i = 0, 1, …, N 1 and WAL(n) are Walsh
The EEG signal in its original form doesn’t provide functions.
any information that can be helpful in detecting a The features extracted from the EEG signal in
seizure. The variation in signal pattern during dif- the form of FWHT coefficients are normalized to
ferent identifiable seizure states can be detected by remove any possible errors that might occur due to
applying an appropriate feature-extraction tech- inadequate extracted features. This can be done using
nique. Inadequate feature extraction might not the equation
provide good classification results, even though the yi − μ
classification method is highly optimized for the npi = , ∀yi , i = 1,2,#, n,
σ
problem at hand. Several feature-extraction meth-
ods based on the time domain and feature domain, where m and Ȫ are the mean and standard devia-
and wavelet transform (WT) features are available. tion, respectively, over all features.
www.computer.org/cise 59
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
CLOUD COMPUTING
Higher-order spectral analysis (HOSA). Seizure pre- detect the distribution of the energy contained
diction of epileptic patients with a higher degree in a signal, and we use entropy parameters to
of accuracy and rapid response is a major chal- characterize the irregularity (normal, preictal,
lenge. In the preictal state, bursts of complex and ictal) of the EEG signal. Different statistical
epileptiform discharges occur in the patient’s characteristics are examined and the entropy-
EEG signal. These quantitative EEG changes can based parameters listed in Equations 1 through
be detected by appropriate analysis of EEG sig- 3 are considered to be the most important and
nals. Higher-order statistics are widely used for distinctive for seizure state detection. Different
the analysis of EEG and ECG data to diagnose entropy values of the normalized bispectrum are
faults in the human body such as tremor, epilepsy, evaluated and can be represented mathematically
and heart disease.5 However, EEG signals contain as follows.
significant nonlinear and non-Gaussian features Normalized Shannon entropy (E1) is
among signal frequency components.6 Existing
techniques aren’t sufficient in handling these non- E1 = −∑ pi log pi ,
linear and non-Gaussian characteristics. HOSA is i
60 September/October 2016
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
classification techniques are available for classify- each file consists of 4,096 values of one EEG time
ing EEG signals, we adopted the Gaussian pro- series in ASCII code. The first four sets (A–D)
cess technique based on Laplace approximation. were obtained from nonepileptic patients. The last
We made this choice due to the fact that it can be set (E) was recorded from an epileptic patient who
applied to very large databases. In this technique, had seizure activity. Therefore, our experimental
clustering is generally used prior to classifiers to data set contains a total of 500 single-channel
prepare a training dataset for classifiers. EEG epochs (windows), out of which 400 are of
The Gaussian process classifier is used to mod- nonepileptic patients and 100 are of an epilep-
el the three states of seizure class probabilities and tic patient. Each EEG epoch is 23.6 s long. The
is given by the equation recordings were captured using a 128-channel
amplifier and converted into digital form at a sam-
−1
⎛ c ⎞ pling rate of 173.61 Hz and 12-bit analog/digi-
p (Yi | f i ) = exp ( f i , c )⎜⎜⎜∑ exp (YiT f i )⎟⎟⎟ , tal resolution. Table 2 shows some details of EEG
⎜⎝ j =1 ⎟⎠ recordings related to nonepileptic and epileptic
patients.8
where f i = ⎡⎣ f i ,1 , … , f i ,c ⎤⎦ is a vector of the la-
T
We analyzed the EEG signals using Matlab
tent function values related to data point i, and and its toolboxes. We performed our experiments
Yi = ⎡⎣ yi ,1 , … , yi ,c ⎤⎦ is the corresponding target
T
on an Intel i5 CPU at 2.40 GHz with 2 Gbytes
vector, which has one entry for the correct memory running on Windows 7. Our experiment
class for the observation i and zero entries performed the following tasks:
otherwise.
■ EEG signal decomposition,
GPS-Based Location Tracing ■ bispectral analysis,
The objective of location tracking is to identify ■ feature extraction based on entropy,
the patient’s location to provide him or her with ■ feature classification,
immediate treatment whenever a seizure occurs. ■ performance analysis on Amazon Elastic Com-
The mobile phone’s GPS function is used to track pute Cloud (EC2), and
the patient’s location, which is sent to the cloud ■ performance comparison.
through the Internet. An alert message is gener-
ated before the triggering of the seizure and is sent EEG Signal Decomposition
to the patient’s mobile phone, as well as to family In the first stage, we applied the FWHT to decom-
members and a nearby hospital. pose the signal. We extracted the discriminating
features in terms of frequency and spectral domain.
Experimental Results and Performance We applied Algorithm 1 to each patient’s EEG data
Analysis file, which each contain 4,096 points and generate
We conducted different experiments to analyze 8,192 coefficients. Figure 2 represents the original
and classify EEG signals. Our objective was to EEG signal and its FWHT coefficients for a non-
identify the preictal state so as to provide alerts epileptic and an epileptic patient.
to the patient before the seizure actually occurs. One of the major problems of seizure state
The EEG recordings used in this experiment were characterization is in identifying whether the pro-
collected from five patients at a sampling rate of cess is Gaussian or linear. In our experiment, the
173.61 Hz by using surface electrodes placed on Hinich test is applied for detecting the nonskew-
the skull. Each set (A–E) contains 100 files, and ness and linearity of the process.9 Different statisti-
www.computer.org/cise 61
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
CLOUD COMPUTING
100
Amplitude
1,000
0
0
−100
−1,000
−200
−2,000
0 500 1,000 1,500 2,000 2,500 3,000 3,500 4,000 4,500 0 500 1,000 1,500 2,000 2,500 3,000 3,500 4,000 4,500
No. samples No. samples
WHT coefficients WHT coefficients
5 30
4
Magnitude
Magnitude
3 20
2
1 10
0
0
0 1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000 9,000 0 1,000 2,000 3,000 4,000 5,000 6,000 7,000 8,000 9,000
(a) Sequency index (b) Sequency index
Figure 2. Original EEG signal and fast Walsh-Hadamard transform (WHT) coefficients: (a) nonepileptic patient and (b) epileptic patient.
The fast WHT coefficients extracts the discriminating features of the EEG signal such as epileptic spikes.
62 September/October 2016
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
Bispectrum estimated via the direct (FFT) method Bispectrum estimated via the direct (FFT) method
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
f2
f2
−0.1 −0.1
−0.2 −0.2
−0.3 −0.3
−0.4 −0.4
−0.5 −0.5
−0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4
(a) f1 (b) f1
Figure 3. Nonparametric bispectrum: (a) nonepileptic patient and (b) epileptic patient. The bispectrum is capable of
retrieving the higher-order cumulants of a signal even in the presence of artifacts (FFT = fast Fourier transform).
cal parameters such as mean, variance, skewness, 256 w 256 is obtained. Figure 3 shows the bispec-
Gaussianity, and so on, based on FWHT coeffi- trum of a nonepileptic and an epileptic patient.
cients are evaluated in Table 3. The bicoherence, the normalized form of the
bispectrum, is estimated using the direct FFT
Bispectral Analysis method in the HOSA toolbox. We used bicoher-
Bispectral analysis is a powerful tool for detecting ence to quantify the quadratic phase coupling in
interfrequency phase coherence, which is used for EEG signals, which is very useful in detecting
characterization of the EEG signal’s different states. nonlinear coupling in the time series for the char-
The bispectrum is computed for each dataset to per- acterization of different seizure states. Figure 4
form in-depth analysis of features using the HOSA represents the bicoherence of a nonepileptic and an
toolbox in Matlab.10 For this purpose, the normal- epileptic patient.
ize point (np) is calculated for each FWHT coef-
ficient (y) by using the equation Feature Extraction Based on Entropy
In this stage of the experiment, we determine the best
y − mean ( y ) features relevant to three seizure states (normal, preic-
np = ,
std ( y ) tal, and ictal) by evaluating different kinds of entropy
from the bicoherence. In the seizure recognition, we
where the functions mean() and std() are used to considered three classes: normal, preictal, and ictal.
calculate the mean and standard deviation, respec- Hence, we computed three different sets of entropy
tively, of each FWHT coefficient. values for the recognition of different seizure states.
The bispectrum is computed by applying the Table 4 shows the mean values of different seizure
direct FFT method from the HOSA toolbox to states for the three selected features computed on the
each normalized point. A data vector matrix of size basis of third-order polyspectra for the five patients.
www.computer.org/cise 63
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
CLOUD COMPUTING
Bicoherence estimated via the direct (FFT) method Bicoherence estimated via the direct (FFT) method
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
f2
0 0
f2
−0.1 −0.1
−0.2 −0.2
−0.3 −0.3
−0.4 −0.4
−0.5 −0.5
−0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4
(a) f1 (b) f1
Figure 4. Bicoherence: (a) nonepileptic patient and (b) epileptic patient. The bicoherence is used to retrieve different
types of entropy values to characterize the different seizure states.
64 September/October 2016
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
100
Total no. users
Ictal
Preictal 80
0.5
70
0 60
50
–0.5
40
–1 30
–4 –4.5 –5 0 1 20
–5.5 –6 –2 –1
–6.5 –7 –4 –3 10
x105 –7.5 –5 x104
Normalized 0
Log energy entropy (E2) shannon entropy (E1) 6 12 18 24 30 36 42 48 54 60
(a) Time (in minutes)
2.5
number of patients.
2
We performed a comparative evaluation on a
1.5
desktop computer and Amazon EC2 with different 1
sets of patients, starting from 5,000 and increasing 0.5
up to 50,000. Figure 7 shows the execution time 0
to process and classify the EEG data. Results show 6 12 18 24 30 36 42 48 54 60
(c) Time (in minutes)
that the time required for the computation of EEG
data on Amazon EC2 is reduced significantly from 15,000 30,000 45,000
their performance with our proposed Gaussian pro- 400 Amazon EC2 cloud
cess. Table 5 shows the summary of statistics in dif- 300
ferent classification models tested in Weka 3.6. 200
Table 6 shows the results of classification accu- 100
racy in the Gaussian process classifier. The classifier 0
is able to classify normal, preictal, and ictal states 0 5 10 15 20 25 30 35 40 45 50 55 60
No. patients (in thousands)
with an accuracy of 84.20 percent, 86.40 percent,
and 89.00 percent, respectively. Figure 7. Comparative performance of EEG signal analysis
Next, we calculated the classification accuracy on the Amazon EC2 cloud and a desktop computer. The
of detecting the preictal state versus a non-preictal time required for the analysis of EEG data on Amazon EC2
state using the three statistical measures of sensitiv- is reduced significantly than on the desktop computer.
ity, specificity, and accuracy.17,18 The accuracy of
each classification algorithm was tested in Weka
3.6; Table 7 shows the sensitivity, specificity, and Gaussian process classifier has a larger area under
accuracy scores. The proposed Gaussian process the receiver operating characteristic (ROC) curve
classification algorithm provides a high sensitivity than other models. It’s clear from Table 7 that the
of 83.6 percent and a high accuracy of 85.1 percent Gaussian classifier achieves the highest classification
over all other classification models. Moreover, the accuracy of 85.1 and justifies its use in our proposed
www.computer.org/cise 65
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
CLOUD COMPUTING
Table 6. Classification accuracy of the GP classifier with entropy features (E1, E2, E3).
No. correctly classified Correct
Categories No. instances instances classification (%)
Normal 50,000 42,100 84.2
Preictal 50,000 43,200 86.4
Ictal 50,000 44,500 89.0
Table 7. Detailed accuracy of Gaussian and other models for EEG signal classification.
Classification Receiver operating
model Sensitivity (%) Specificity (%) Accuracy (%) characteristic area
Gaussian process 83.6 16.3 85.1 0.984
Multiplayer 78.5 21.5 80.3 0.928
perceptron
Linear regression 71.7 28.3 77.4 0.892
Least median of 26.6 73.4 25.2 0.464
squares regression
Accuracy of classification (%)
66 September/October 2016
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
higher-order statistics to detect and extract features 12. H. Yan et al., “A Multilayer Perceptron-Based Medi-
representing nonlinearity of the EEG signal; hence, its cal Decision Support System for Heart Disease Di-
use isn’t limited to a specific part of the brain. agnosis,” Expert Systems with Applications, vol. 30,
no. 2, 2006, pp. 272–281.
13. D.M. Bates and D.G. Watts, Nonlinear Regression:
www.computer.org/cise 67
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
HYBRID SYSTEMS
Cole Freniere, Ashish Pathak, Mehdi Raessi, and Gaurav Khanna | University of Massachusetts Dartmouth
Amazon's Elastic Compute Cloud (EC2) service could be an alternative computational resource for running MPI-
parallel, GPU-accelerated, multiphase-flow simulations. The EC2 service is competitive with a benchmark cluster
in a certain range of simulations, but there are some performance limitations, particularly in the GPU and cluster
network connection.
S
ince the 1980s, the US National Science maintaining their own local HPCCs. Instead, they
Foundation (NSF) has funded supercom- can simply set up an account and run an applica-
puters for use by scientific researchers and tion instantly for a fee, with no financial or installa-
engineers, but continuing this practice tion maintenance overhead. Moreover, a cloud ser-
today involves many challenges. An interim NSF vice can offer great flexibility—HPC users can out-
report published in 2014 made it clear that the source their lower-profile jobs to cloud servers and
high cost of high-end facilities and shrinking NSF reserve the most critical ones for their local clusters.
resources are compounded by the fact that the com- An additional benefit to using cloud computing is
puting needs of scientists and engineers are becom- that various machine configurations can be expedi-
ing more diverse.1 For example, data analytics is a tiously tested and explored for benchmarking pur-
rapidly growing field that brings with it completely poses, which can lead to more appropriate decisions
different computing requirements than conven- for those planning on building their own HPCC.
tional scientific and engineering simulations. For However, are the cloud services available today
optimal application performance, a certain system prepared to meet the needs of HPC applications? Is
structure is desired, and different disciplines tend to using the cloud a viable alternative to localized, con-
have different optimal systems. Some applications, ventional supercomputers? Amazon Web Services
for example, are shifting from conventional CPUs (AWS) is one of the most prevalent vendors in the
to heterogeneous parallelized architectures that in- cloud computing market,2 and its computing service,
clude GPUs. Cloud computing could be a potential Amazon Elastic Cloud Compute (EC2), offers a va-
solution to meet these expanding computing needs. riety of virtual computers (www.ec2instances.info).
______________
Although cloud computing services could be trans- In recent years, several new services tailored toward
formative for some fields, there’s a high level of un- HPC applications have been released; AWS seems
certainty about the cost tradeoffs, and the options to be an appropriate cloud computing provider to
must be evaluated carefully.1 evaluate whether cloud computing is ready for HPC
A case in support of cloud computing is that if applications. The first work that evaluated Amazon’s
it’s used as an alternative to constructing and main- EC2 service for an HPC application ran coupled
taining a local high-performance computing cluster atmosphere-ocean climate models and performed
(HPCC), it would relieve institutions and compa- standard benchmark tests.3 That work highlighted
nies from the drudgery and cost of building and that the performance was significantly worse in the
68 Computing in Science & Engineering 1521-9615/16/$33.00 © 2016 IEEE Copublished by the IEEE CS and the AIP September/October 2016
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
cloud than at dedicated supercomputer centers and 3D parallel code. Keeping in mind that Amazon’s
was only competitive with low-cost cluster systems. services are rapidly evolving and that new hard-
The poor performance occurred because latencies ware options are constantly being added, the key
and bandwidths were inferior to dedicated centers; question addressed is the following: Is outsourcing
the authors recommended that the interconnect net- HPC workloads to the AWS cloud a viable alter-
work be upgraded to systems such as Myrinet or In- native to using a local, purpose-built HPCC? This
finiBand to be desirable for HPC use. Peter Zaspel question is answered from the perspective of our
and Michael Griebel4 evaluated AWS for their het- own research group; more broad recommendations
erogeneous CPU-GPU parallel two-phase flow solv- are made for other HPC users. Not surprisingly,
er, similar to the solver we present in this article. This the answer to this question depends on many fac-
work concluded that the cloud was well prepared tors. We believe this work is the first to comprehen-
for moderately sized computational fluid dynamics sively test the g2.2xlarge GPU-instance of AWS for
(CFD) problems for up to 64 cores or 8 GPUs, and multiphase-flow simulations; it's not limited to just
that it was a viable and cost-effective alternative to standard benchmark tests.
mid-sized parallel computing systems. However, if
the cloud cluster was increased to more than eight Amazon Web Services
nodes, network interconnect problems followed. In The user can manage cloud services with the AWS
2012, Piyush Mehrotra and coworkers5 of the NASA management console via a Web browser or the com-
Ames Research Center compared Amazon’s perfor- mand line. AWS offers more than 40 different ser-
mance to their renowned Pleiades supercomputer. vices, but the only ones necessary for our tests were
For single-node tests, AWS was highly competitive the EC2 service for virtual computer rental and the
with Pleiades, but for large core counts, it was signifi- Simple Storage Service (S3) for data storage. EC2 of-
cantly slower because the Ethernet connection didn’t fers computers (known as instances) with a variety of
compare well with Pleiades’ InfiniBand network. hardware specifications—the most basic instance is
The authors concluded that Amazon’s computers a single-core CPU with 1 Gbyte of RAM, priced at
aren’t suitable for tightly coupled applications, where US$0.013 per hour, and the most expensive instance
fast communication is paramount. consists of 32 cores with 104 Gbytes of RAM,
Many other studies conducted standard bench- priced at $6.82 per hour (www.ec2instances.info).
_______________
mark tests on AWS to compare it to a conventional The user must select an Amazon Machine Im-
HPCC and reached similar conclusions. Zach Hill age (AMI), which includes the operating system
and Marty Humphrey6 concluded that AWS’s ease and software loaded onto the instance. Several
of use and low cost make it an attractive option default and community AMIs built by other cus-
for HPC, but not for tightly coupled applications. tomers are available. Because default AMIs are very
Keith Jackson and coworkers7 ran their own ap- bare-boned, it may be necessary for the user to in-
plication in addition to standard benchmark tests, stall several libraries and other software on the in-
and also concluded that AWS isn’t suited for tight- stance to run specific applications—we spent con-
ly coupled applications. Yan Zhai and coworkers8 siderable time properly configuring the instance for
included a variety of benchmark tests, application our application. However, once the instance is set
tests, and a highly detailed breakdown of the costs up to the user’s liking, a new AMI can be saved
associated with the two alternatives, producing a from that machine and can be used as a template to
more positive evaluation of AWS than most other easily create more instances in the future. This is a
studies, but with an admission that the cloud isn’t critical feature when building clusters of instances.
ideal for codes that require many small messages
between parallel processes. Aniruddha Marathe Building a Cluster in the Cloud
and coworkers9 ran benchmark tests and developed For instances to communicate over the same net-
a pricing model to evaluate AWS as an alternative work, they must be launched into the same place-
to a local cluster on a case-by-case basis but didn’t ment group, which ensures that the requested ma-
use this model to present quantitative economic chines are physically located in the same comput-
results. Overall, the general conclusions regarding ing facility. Three notable tools are available:
cloud computing for HPC applications has evolved
over time as the market has developed. ■ Cloud Formation Cluster (CfnCluster) is of-
Our work is concerned with evaluating AWS fered by AWS for cluster creation, but it’s cur-
for a GPU-accelerated, multiphase-flow solver—a rently limited in its configuration options. Only
www.computer.org/cise 69
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
HYBRID SYSTEMS
70 September/October 2016
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
1,00,000 10,000
Point-to-point latency (μs)
AWS cluster
10,000
bandwidth (MB/s)
UMD HPCC 1,000
Point-to-point
1,000 (benchmark cluster)
100
100 10
AWS cluster
10 1 UMD HPCC
(benchmark cluster)
1 0.1
1 100 10,000 10,00,000 1 100 10,000 10,00,000
(a) Message size (bytes) (b) Message size (bytes)
Figure 1. The average (a) latency and (b) bandwidth for various message sizes between two nodes on each cluster. Note
that the horizontal axis is logarithmic base 2 and the vertical is logarithmic base 10.
Point-to-point bandwidth tests. Figure 1b shows the munication overhead, and in the context of weak
communication bandwidth between two nodes. and strong scaling, we determined it for both
The maximum sustained bandwidth rates for AWS CPU-GPU and CPU-CPU communication for
and the benchmark cluster are 984 Mbytes/s and various cluster sizes. Typically, when the flow solver
25.6 Gbytes/s, respectively. The bandwidth ranges is running on a conventional HPCC, about 10 to
from 15 to 25 times lower on Amazon, illustrating 25 percent of the execution time is spent just trans-
the difference between the Ethernet connection in ferring data from the CPU to GPU, and 5 to 10
the AWS placement group and the InfiniBand con- percent is spent transferring data between CPUs
nection on the benchmark cluster. Contrary to the through MPI-parallel calls. Thus, any decrease in
latency tests, the bandwidth tests show that AWS communication speed performance on AWS can
suffers more at larger message sizes. have a significant impact on overall execution time.
Collective latency tests. The graphs for the collective GPU Performance
latency tests aren’t presented in this article for brev- GPU speed is of great importance and drastically
ity; the results are actually similar to the point-to- affects execution time. The GPU on the g2.2xlarge
point tests. For an eight-node cluster on Amazon, instance was found to be about 25 percent slower
the collective test MPI_alltoall approaches laten- than the benchmark cluster’s GPU. This impedi-
cies of 700 Ps, while on the benchmark cluster, it’s ment plays a large role in the results for overall
80 Ps. Such large latencies drastically slow down AWS performance.
the flow solver when quantities across multiple pro-
cesses are collected and summed. Strong Scaling
The simulation tested for strong scaling required
Connection Speed over the Internet from a nearly all the 15 Gbytes of memory offered by a sin-
Local Machine to Instances gle g2.2xlarge instance. As Figure 2 shows, the AWS
For our purposes, it was convenient to simply se- cluster is 25 to 40 percent slower than the benchmark
cure copy (scp) the data directly from Amazon’s cluster. Note that the speedup is reported relative
virtual machines to ours, rather than using S3. The to one node on the benchmark cluster on a loga-
bandwidth fluctuated between 1 and 7 Mbytes/s, rithmic scale. For low node counts, the AWS clus-
which is a reasonable connection. There could be ter is competitive with the benchmark cluster and
some cases in which the data must persist past the is merely 25 percent slower than the benchmark.
lifetime of the instance, for example, if the output However, as node count increases, AWS doesn’t fare
data can’t be copied to a local server as quickly as as well: the performance is 30 percent slower than
the application produces it. the UMD HPCC for clusters with two nodes or
more. The solver’s general behavior for strong scaling
Performance of MPI-Parallel is as follows: increasing the number of processes for
GPU-Accelerated Code a fixed problem size means less memory per process,
We tested the flow solver’s performance on AWS by that is, the number of cells per process decreases.
simulating a rigid, solid wedge free-falling through Memory transfer between processes is directly relat-
air and impacting a water surface.10 The time spent ed to the number of cells per process, and commu-
in communication between devices is termed com- nication time between processes is proportional to
www.computer.org/cise 71
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
HYBRID SYSTEMS
6.4
from the CPU across the PCIe bus to the GPU is
Speedup relative to one
3.2
16 UMD HPCC for these tests. For the benchmark cluster, CPU-CPU
(benchmark cluster)
8 communication starts off at 2.7 seconds and drops
to less than 1 second very consistently for all sub-
4
sequent cluster sizes. On the other hand, on AWS,
2 the overhead starts off lower than the benchmark,
at 1.3 seconds, but when a second node is added, it
1
1 2 3 4 5 6 7 8 steps up dramatically to 4.6 seconds. It’s interesting
No. nodes (8 cores and 1 GPU per node) to note that AWS communication time increases
with the addition of a second node, whereas the
Figure 3. Strong scaling CPU-GPU communication
overhead time in seconds after 1,000 iterations of the UMD HPCC communication time decreases. The
pressure solver. Error bars indicate the maximum and key difference is that the addition of a second node
minimum data points, and plotted points are averages. on the AWS cluster requires the use of the Ethernet
The vertical axis is logarithmic base 2. network, which negatively impacts performance. An-
other shortfall of AWS is that the performance of its
the memory that must be transferred. Hence, strong Ethernet network is highly variable, which is visible
scaling has the advantage of reducing the workload in the error bars in Figure 4. Even though the CPU-
and communication time per process, but it has the CPU communication time on AWS is higher than
disadvantage of requiring a large network size. the benchmark cluster, the difference isn’t significant
for this particular application because the time spent
CPU-GPU communication. A single node consists of in communication is relatively small compared to the
one CPU (eight processors) and one GPU card. All total execution time.
the pressure field data for eight CPU processes are
transferred between the CPU and GPU twice dur- Weak Scaling
ing each iteration of the pressure solver. The pres- Figure 5 shows the results of the weak scaling
sure solver was set to iterate 1,000 times, and the tests. Note that the scaling is presented relative to
communication time was determined by modify- one UMD HPCC node. The AWS cluster is 25
ing the code to either allow communication be- to 45 percent slower overall than the benchmark
tween the CPU and GPU or not at all. This iso- cluster. As the number of nodes increases, AWS
lated the communication time between the CPU becomes progressively slower than the UMD
and GPU. Figure 3 shows the results for commu- HPCC. For example, AWS is 25 percent slower
nication time in logarithmic scale as the cluster is than the UMD HPCC for single-node test cases,
scaled up. Recognizing that scaling up the cluster but for high node counts, it becomes 45 percent
decreases the number of cells per process, CPU- slower. The 25 percent deficit for AWS for one
GPU communication time decreases accordingly. node is because the GPU is inherently less pow-
This behavior is observed for both the benchmark erful than the benchmark cluster. However, the
and AWS, implying that network performance increased deficit with large cluster size is due to
72 September/October 2016
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
8 AWS cluster 16
7 14
UMD HPCC
6 (benchmark cluster) 12
5 10
4 8
3 6
AWS cluster
2 4
UMD HPCC
2
1 (benchmark cluster)
0
0 1 2 3 4 5 6 7 8
1 2 3 4 5 6 7 8
No. nodes (8 cores and 1 GPU per node)
No. nodes (8 cores and 1 GPU per node)
9
1.2 8
Weak scaling relative to one
www.computer.org/cise 73
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
HYBRID SYSTEMS
Table 1. Cost comparison between the benchmark local cluster (first two rows) and the AWS cloud
HPC with on-demand, one-year, and three-year reservations. The benchmark is considered with and
without electricity and maintenance costs.
Total cost Equivalent cost per node-hour Useful life
Without electricity or maintenance $6,000 $0.137 5 years
communication and general computations, so the ■ amount of data that is transferred from AWS
slow network connection doesn’t pose as much of a to the Internet.
problem as it presented in previous studies.
AWS offers three usage tiers: on-demand, one-
Cost Analysis year reserved, and three-year reserved. Note that
Cost is an important factor in our evaluation of a reserved instance will use the same physical ma-
AWS as an alternative to a local, conventional chine for the reservation period. Table 1 presents
HPCC. Comparing the two alternatives on an a cost comparison between the benchmark local
hourly or total cost basis doesn’t lead to an imme- cluster and AWS for each usage tier. The bench-
diately obvious conclusion because there are many mark local cluster is considered with and without
variables that can affect the outcome. We used our the additional costs associated with electricity and
local cluster for the cost analysis, and the results can maintenance by IT professionals. In this analysis,
be considered a case study. When building a local we approximated that the electricity cost per node
HPCC, the upfront cost is very large, but the invest- is $3,200 for a period of five years. We also approx-
ment is relatively long term because it can last sever- imated that 30 percent of a full-time IT profes-
al years. In addition to the electricity cost, research- sional’s time is spent on local cluster maintenance,
ers using a local cluster might need to support IT which would result in $1,600 in maintenance cost
professionals for maintenance services on the cluster. per node for a five-year period.
These additional costs over the cluster’s lifetime can
become significant compared to the cluster’s upfront Integration of Performance with Cost
cost. Therefore, we’ll make the cost analysis and Next, we narrow down our price analysis to price per
comparison with AWS both with and without these unit of useful computational work. In other words,
additional costs in the following sections. how many simulations can be completed on a cost ba-
Purchasing cloud services is a fundamentally dif- sis? This type of analysis, admittedly, could be highly
ferent approach to doing business. No maintenance variable: it depends on the cluster size and the simu-
or installation is required, and the upfront cost can lation, as well as the cluster’s hardware specifications.
be eliminated entirely by using an “on-demand” pay- For the majority of test cases shown in Figures 2 and 5,
ment method that charges the user by rounding up to AWS was about 40 percent slower than the benchmark
the nearest hour of usage time. Customers can com- cluster. Consequently, simulations require roughly
mit to a certain amount of reserved hours and pay an 40 percent more time to complete on AWS than the
upfront cost that will reduce the total cost compared benchmark cluster. To account for this, a weighting
to the on-demand payment method. Pricing on AWS factor is applied to the results in Table 1, resulting in
depends mainly on the following factors: the “weighted cost per unit of work” shown in Table 2.
■ compute time, which is the most expensive fac- Breakdown of Total Cost
tor and depends on the instance type as well as The total cost associated with running the test case
usage tier (on-demand or reserved); simulation on AWS can be modeled by Equation 1.
■ number of nodes; The cost of data storage is $0.03/(Gbyte-month) for
■ amount and duration of data storage in the both EC2 block storage and S3, while the cost of
cloud; and data transfer is $0.09/Gbyte:
74 September/October 2016
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
Cost = (EC2) + (data storage) + (data transfer) that some nodes are paid for but aren’t completing
useful work. This increases the “weighted cost per
= (p × t1 × n) + (4.175 × 10–5 × t2 × x1) + (0.09 × x2), (1) unit of work” quantity, which can be quantified
by the percentage of a local cluster’s utilization.
where p is the price of instance ($/node-hour), t 1 If the local cluster’s percent utilization is below a
is compute time, n is the number of nodes, 4.175 critical value, then using AWS would be more cost-
× 10 –5 is the price of data storage ($/Gbyte-hour), effective. Table 3 presents this critical value for the
t 2 is the duration of data storage (hours), x1 is the various AWS pricing options. For example, with
amount of data stored (Gbyte), 0.09 is the price of the electricity and maintenance costs included, if
data transferred to the Internet ($/Gbyte), and x 2 is the local cluster is utilized 27 percent or less, then
the amount of data transferred (Gbyte). the AWS on-demand option is more cost-effective.
It should be mentioned that if the utilization of a
Sample Calculation local cluster is expected to be low, users could pool
The computational domain in the test case studied the resource with other local computational re-
here consisted of 36 million grid points, which re- search groups, effectively subsidizing the cost and
quired 60 Gbytes of RAM distributed across four raising the utilization. It’s important to note that
nodes. The simulation time was 106 hours on AWS it’s less likely that reserved instances would have
and 71 hours on the benchmark cluster. On AWS, 100 percent utilization than on-demand instances,
16 Gbytes of data were stored and transferred from but the calculation for reserved instances are in-
the on-demand instances, which translates into cluded with 100 percent utilization for consistency.
$275 for EC2, $0.08 for data storage, and $1.44 The percent utilization of our local cluster is
for data transfer. much higher than the percentages shown in Table 3.
Clearly, EC2 is by far the largest contributor. Therefore, AWS isn’t a cost-effective option com-
On the benchmark cluster, the simulation cost is pared to our local cluster. The only AWS option
$39 when the electricity cost and maintenance are that becomes relatively competitive when the costs
neglected and $70 when included. In both cases, associated with electricity and maintenance are in-
running the simulation on the local cluster costs cluded is the three-year reserved instance.
less than on AWS.
www.computer.org/cise 75
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
HYBRID SYSTEMS
76 September/October 2016
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
11. A. Pathak and M. Raessi, “A Three-Dimensional Vol- Mehdi Raessi (corresponding author) is an assistant
ume-of-Fluid Method for Reconstructing and Advect- professor in the Mechanical Engineering Department
ing Three-Material Interfaces Forming Contact Lines,” at the University of Massachusetts Dartmouth. His re-
J. Computational Physics, vol. 307, 2016, pp. 550–573. search interests include computational simulations of
12. A.J. Chorin, “Numerical Solution of the Navier- multiphase flows with applications in energy systems
Stokes Equations,” Mathematics of Computation, (renewable and conventional), material processing, and
vol. 22, 1968, pp. 745–762. microscale transport phenomena. Raessi has a PhD in
13. S. Codyer, M. Raessi, and G. Khanna, “Using Graph- mechanical engineering from the University of Toronto.
ics Processing Units to Accelerate Numerical Simu- Contact him at [email protected].
_____________
lations of Interfacial Incompressible Flows,” Proc.
ASME Fluid Engineering Conf., 2012, pp. 625–634. Gaurav Khanna is an associate professor in the Physics De-
partment at the University of Massachusetts Dartmouth.
Cole Freniere is pursuing an MS in mechanical engineering His primary research project is related to the coalescence
at the University of Massachusetts Dartmouth. His research of binary black hole systems using perturbation theory and
interests include renewable energy, fluid dynamics, and HPC. estimation of the properties of the emitted gravitational
Specifically, he’s interested in the application of advanced radiation. Khanna has a PhD in physics from Penn State
computational simulations to aid in the design of ocean wave University. He’s a member of the American Physical Soci-
energy converters. Contact him at [email protected].
____________ ety. Contact him at [email protected].
_____________
Keeping
YOU at the All the Knowledge You
Need—On Your Time
of Technology
E 3,000 online courses
E 6,500 technical books
E 11,000 training videos
E Mentoring, practice exams, and
More at www.computer.org/elearning
www.computer.org/cise 77
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
SECTION TITLE
COMPUTER SIMULATIONS
Editors: Barry I. Schneider, [email protected]
______ | Gabriel A. Wainer, [email protected]
_____________
C
ore-collapse supernova explosions come from stars As the name alludes, the explosion is preceded by the
more massive than 8 to 10 times the mass of the collapse of a stellar core. At the end of its life, a massive star
sun. Ten core-collapse supernovae explode per sec- has a core composed mostly of iron-group nuclei. The core is
ond in the universe—in fact, automated astronomi- surrounded by an onion-skin structure of shells dominated
cal surveys discover multiple events per night, and one or two by successively lighter elements. Nuclear fusion is still ongo-
explode per century in the Milky Way. Core-collapse super- ing in the shells, but the iron core is inert. The electrons in
novae outshine entire galaxies in photons for weeks and out- the core are relativistic and degenerate. They provide the li-
put more power in neutrinos than the combined light output on’s share of the pressure support stabilizing the core against
of all other stars in the universe, for tens of seconds. These gravitational collapse. In this, the iron core is very similar to
explosions pollute the interstellar medium with the ashes of a white dwarf star, the end product of low-mass stellar evolu-
thermonuclear fusion. From these elements, planets form and tion. Once the iron core exceeds its maximum mass (the so-
life is made. Supernova shock waves stir the interstellar gas, called effective Chandrasekhar mass of approximately 1.5 to
trigger or shut off the formation of new stars, and eject hot 2 solar masses [M⦿]), gravitational instability sets in. With-
gas from galaxies. At their centers, a strongly gravitating com- in a few tenths of a second, the inner core collapses from a
pact remnant, a neutron star or a black hole, is formed. central density of approximately 1010 g cm –3 to a density
78 Computing in Science & Engineering 1521-9615/16/$33.00 © 2016 IEEE Copublished by the IEEE CS and the AIP September/October 2016
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
M
then quickly loses energy by work done breaking up
infalling iron-group nuclei into neutrons, protons, Core collapse to
and alpha particles. The copious emission of neutri- v
v v protoneutron star
nos from the hot (T a 10 MeV a1011 K) gas fur- (PNS)
ther reduces energy and pressure behind the shock. v v
≈400 km
M v v
balances the pressure behind the shock.
The supernova mechanism must revive the v v
v Stalled shock
stalled shock to drive a successful core-collapse su-
pernova explosion. Depending on the structure of
the progenitor star, this must occur within one to Shock not . Shock
a few seconds of core bounce. Otherwise, continu- revived M revived
ing accretion pushes the protoneutron star over its
maximum mass (approximately 2 to 3 M⦿), which
results in the formation of a black hole and no su-
© Anglo-Australian Observatory
pernova explosion. Figure 1 provides a schematic
of the core-collapse supernova phenomenon and its
outcomes. τ ≈ 1 − few
If the shock is successfully revived, it must seconds
travel through the outer core and the stellar enve- black hole
Core-collapse
lope before it breaks out of the star and creates the formation
supernova explosion
spectacular explosive display observed by astrono-
mers on Earth. This could take more than a day Figure 1. Schematic of core collapse and its simplest outcomes.
The image shows SN 1987A, which exploded in the large
for a red supergiant star (such as Betelgeuse, a 20
Magellanic cloud.
M⦿ star in the constellation Orion) or just tens of
seconds for a star that has been stripped of its ex-
tended hydrogen-rich envelope by a strong stellar
wind or mass exchange with a companion star in a insight and for making predictions that can be
binary system. contrasted with future neutrino and gravitational-
The photons observed by astronomers are wave observations from the next core-collapse su-
emitted extremely far from the central regions, pernova in the Milky Way.
and they carry information on the overall ener-
getics, the explosion geometry, and the products Supernova Energetics and Mechanisms
of the explosive nuclear burning triggered by the Core-collapse supernovae are “gravity bombs.” The
passing shock wave. They can, however, only pro- energy reservoir from which any explosion mecha-
vide weak constraints on the inner workings of the nism must draw is the gravitational energy released
supernova. Direct observational information on in the collapse of the iron core to a neutron star: ap-
the supernova mechanism can be gained only from proximately 3 × 1053 erg (3 × 1046 J), a mass-energy
neutrinos and gravitational waves that are emitted equivalent of approximately 0.15 M⦿c2. A fraction
directly in the supernova core. Detailed computa- of this tremendous energy is stored initially as heat
tional models are required for gaining theoretical (and rotational kinetic energy) in the protoneutron
www.computer.org/cise 79
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
COMPUTER SIMULATIONS
80 September/October 2016
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
www.computer.org/cise 81
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
COMPUTER SIMULATIONS
82 September/October 2016
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
www.computer.org/cise 83
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
COMPUTER SIMULATIONS
radii in a supernova core. In the dense protoneutron to the choice of flux limiter, and the need for
star, neutrinos are trapped and in equilibrium time-implicit integration (involving global ma-
with matter. Their radiation field is isotropic. They trix inversion) due to the stability properties of
gradually diffuse out and decouple from matter the parabolic diffusion equation. Two-moment
at the neutrinosphere (the neutrino equivalent of transport is the next better approximation, solv-
the photosphere). This decoupling is gradual and ing equations for the radiation energy density and
marked by the transition of the angular distribution momentum (that is, the radiative flux) and requir-
into the forward (radial) direction. In the outer ing a closure that describes the radiation pressure
decoupling region, neutrino heating is expected tensor (also known as the Eddington tensor). This
to occur, and the heating rates are sensitive to the closure can be analytic and based on the local val-
angular distribution of the radiation field.9 Eventu- ues of energy density and flux (the M1 approxima-
ally, at radii of a few hundred kilometers, the neu- tion). Alternatively, some codes compute a global
trinos have fully decoupled and are free streaming. closure based on the solution of a simplified,
Neutrino interactions with matter (and thus the time-independent Boltzmann equation. The major
decoupling process) are very sensitive to neutrino advantage of the two-moment approximation is
energy, since weak-interaction cross-sections scale that its advection terms are hyperbolic and can be
with the square of the neutrino energy. handled with standard time-explicit finite-volume
This is why neutrino transport needs to be mul- methods of computational hydrodynamics, and
tigroup, with typically a minimum of 10 to 20 en- only the local collision terms need time-implicit
ergy groups covering supernova neutrino energies updates.
of 1 – O(100) MeV. Typical mean energies of elec- There are now implementations of multigroup
tron neutrinos are around 10 to 20 MeV. Energy two-moment neutrino radiation-hydrodynamics in
exchanges between matter and radiation occur via multiple 2D/3D core-collapse supernova simula-
the collision terms in the Boltzmann equation. These tion codes.12,15,16 This method could be sufficiently
are stiff sources/sinks that must be handled time- close to the full Boltzmann solution (in particular,
implicitly with (local) backward-Euler methods. The if a global closure is used) and appears to be the
neutrino energy bins are coupled through frame- way toward massively parallel long-term 3D core-
dependent energy shifts. Neutrino-matter interaction collapse supernova simulations.
rates are usually precomputed and stored in dense
multidimensional tables within which simulations Neutrino oscillations. Neutrinos have mass and can
interpolate. oscillate between flavors. The oscillations occur in a
Full 61-D general-relativistic Boltzmann neu- vacuum but can also be mediated by neutrino-elec-
trino radiation-hydrodynamics is exceedingly chal- tron scattering (the Mikheyev-Smirnov-Wolfenstein
lenging and so far hasn’t been possible to include [MSW] effect) and neutrino-neutrino scattering.
in core-collapse supernova simulations, but 31-D Neutrino oscillations depend on neutrino mixing
(1D in space, 2D in momentum space),13 51-D parameters and on the neutrino mass eigenstates
(2D in space, 3D in momentum space),9 and static (the magnitudes of the mass differences are known
6D simulations14 have been carried out. but not their signs). Observation of neutrinos from
Most (spatially) multidimensional simulations the next galactic core-collapse supernova could help
treat neutrino transport in some dimensionally constrain the neutrino mass hierarchy.17
reduced approximation. The most common is an MSW oscillations occur in the stellar envelope.
expansion of the radiation field into angular mo- They’re important for the neutrino signal observed in
ments. The nth moment of this expansion requires detectors on Earth, but they can’t influence the ex-
information about the (n 1)th moment (and in plosion itself. The self-induced (via neutrino-neutrino
some cases, the (n 2)th moment as well). This scattering) oscillations, however, occur at the extreme
necessitates a closure relation for the moment at neutrino densities near the core. They offer a rich
which the expansion is truncated. Multigroup phenomenology that includes collective oscillation
flux-limited diffusion evolves the 0th moment (the behavior of neutrinos.17 The jury’s still out on their
radiation energy density). The flux limiter is the potential influence on the explosion mechanism.
closure that interpolates between diffusion and Collective neutrino oscillation calculations (es-
free streaming. The disadvantages of this method sentially solving coupled Schrödinger-like equa-
are its very diffusive nature (washes out spatial tions) are computationally intensive.17 They’re cur-
variations of the radiation field), its sensitivity rently performed independently of core-collapse
84 September/October 2016
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
supernova simulations and don’t take into account still be treated as ideal Boltzmann gases (but in-
feedback on the stellar plasma. Fully understand- cluding Coulomb corrections).
ing collective oscillations and their impact on the The nuclear force becomes relevant at den-
supernova mechanism will quite likely require that sities near and above 1010 – 1011 g cm–3. It is an
neutrino oscillations, transport, and neutrino-mat- effective quantum manybody interaction of the
ter interactions are solved for together in a quan- strong force, and its detailed properties presently
tum-kinetic approach.18 aren’t known. Under supernova conditions, mat-
ter will be in NSE in the nuclear regime, and the
Equation of state and nuclear reactions. The EOS is EOS is a function of density, temperature, and Ye.
essential for the (M)HD part of the problem and for Starting from a nuclear force model, an EOS can
updating the matter thermodynamics after neutri- be obtained in multiple ways,19 including direct
no-matter interactions. Baryons (protons, neutrons, Hartree-Fock manybody calculations, mean field
alpha particles, heavy nuclei), electrons, positrons, models, or phenomenological models (such as the
and photons contribute to the EOS. Neutrino mo- liquid-drop model).
mentum transfer contributes an effective pressure Typically, the minimum of the Helmholtz free
that is taken into account separately because neu- energy is sought and all thermodynamic variables
trinos are not everywhere in local thermodynamic are obtained from derivatives of the free energy.
equilibrium with the stellar plasma. In different In most cases, EOS calculations are too time-con-
parts of the star, different EOS physics applies. suming to be performed during a simulation. As in
At low densities and temperatures below ap- the case of the electron/positron EOS, large (more
proximately 0.5 MeV, nuclear reactions are too than 200 Mbytes must be stored by each MPI pro-
slow to reach nuclear statistical equilibrium. In this cess), densely spaced nuclear EOS tables are pre-
regime, the mass fractions of the various heavy nu- computed and simulations efficiently interpolate
clei (isotopes, in the following) must be tracked ex- in (log U, log T, Ye) to obtain thermodynamic and
plicitly. As the core collapses, the gas heats up and compositional information.
nuclear burning must be tracked with a nuclear re-
action network, a stiff system of ODEs. Solving the Multidimensionality
reaction network requires the inversion of sparse Stars are, at zeroth order, gas spheres. It’s thus nat-
matrices at each grid point. Depending on the ural to start with assuming spherical symmetry in
number of isotopes tracked (ranging typically from simulations—in particular, given the very limited
O(10) to O(100)), nuclear burning can be a signifi- compute power available to the pioneers of super-
cant contributor to the overall computational cost nova simulations. After decades of work, it now ap-
of a simulation. The EOS in the burning regime pears clear that detailed spherically symmetric simu-
is simple: all isotopes can essentially be treated as lations robustly fail at producing explosions for stars
noninteracting ideal Boltzmann gases. Often, cor- that are observed to explode in nature. Spherical
rections for Coulomb interactions are included. symmetry itself could be the culprit because sym-
Photons and electrons/positrons can be treated ev- metry is clearly broken in core-collapse supernovae:
erywhere as ideal Bose and Fermi gases, respective-
ly. Because electrons will be partially or completely ■ Observations show that neutron stars receive
degenerate, computing the electron/positron EOS “birth kicks,” giving them typical velocities of
involves the FLOP-intensive solution of Fermi inte- O(100) km s–1 with respect to the center of mass
grals. Because of this, their EOS is often included of their progenitors. The most likely and straight-
in tabulated form. forward explanation for these kicks is that highly
At temperatures above 0.5 MeV, nuclear sta- asymmetric explosions lead to neutron star re-
tistical equilibrium holds. This greatly simplifies coil, owing to momentum conservation.
things, since now the electron fraction Ye (number ■ Deep observations of supernova remnants
of electrons per baryon; because of macroscopic show that the innermost supernova ejecta ex-
charge neutrality, Ye is equal to Yp, the number hibit low-mode asphericity similar to the ge-
fraction of protons) is the only compositional vari- ometry of the shock front shown in Figure 2.
able. The mass fractions of all other baryonic spe- ■ Analytic considerations as well as 1D core-col-
cies can be obtained by solving Saha-like equations lapse simulations show that the protoneutron
for compositional equilibrium. At densities below star and the region behind the stalled shock
approximately 1010 – 1011 g cm–3, the baryons can where neutrino heating takes place are both
www.computer.org/cise 85
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
COMPUTER SIMULATIONS
86 September/October 2016
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
www.computer.org/cise 87
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
COMPUTER SIMULATIONS
2x
resolution creates a numerical bottleneck in the turbu-
lent cascade, artificially trapping turbulent kinetic en-
ergy at large scales where it can contribute most to the
explosion, and two, low resolution also increases the
size of numerical perturbations that enter through the
shock and from which buoyant eddies form. The larger
these seed perturbations are, the stronger is the turbu-
lent convection and the larger is the Reynolds stress.
Ref. 6x The qualitative and quantitative behavior of
turbulent flow is very sensitive to numerical resolu-
tion. This can be appreciated by looking at Figure 7,
which shows the same 3D simulation of neutrino-
driven convection at four different resolutions, span-
ning a factor of 12 from the reference resolution that
is presently used in many 3D simulations and which
underresolves the turbulent flow. As resolution is in-
12 x creased, turbulent flow breaks down to progressively
smaller features. What also occurs but that cannot
Figure 7. Slices from four semiglobal 3D simulations of
be appreciated from a still figure is that the intermit-
neutrino-driven convection with parameterized neutrino cooling
and heating, carried out in a 45s wedge. The color map is tency of the flow increases as the turbulence is better
the specific entropy; blue colors mark low-entropy regions, resolved. This means that flow features are not persis-
red corresponds to high entropy. Only the resolution is varied. tent but quickly appear and disappear through non-
The wedge marked “ref.” is the reference resolution ()r = linear interactions of turbulent eddies. In this way,
3.8 km, )V = )O = 1.8s) that corresponds to the resolution
the turbulent cascade can be temporarily reversed
of present global 3D detailed radiation-hydrodynamics core-
collapse supernova simulations. Note how low resolution (this is called backscatter in turbulence jargon), cre-
favors large flow features and how the turbulence breaks down ating large-scale intermittent flow features similar to
to progressively smaller features with increasing resolution. what is seen at low resolution. The role of intermit-
This figure includes simulations up to 12 times the reference tency in neutrino-driven turbulence and its effect on
resolution that were run on 65,536 cores of Blue Waters.
the explosion mechanism remain to be studied.
Rendered by David Radice (Caltech).
A key challenge for 3D core-collapse supernova
simulations is to provide sufficient resolution so that
kinetic energy cascades away from the largest scales
Now, the Reynolds stress is dominated by tur- at the right rate. Resolution studies suggest that this
bulent fluctuations at the largest physical scales: a could require between 2 to 10 times the resolution
simulation that has more kinetic energy in large- of current 3D simulations. A 10-fold increase in
scale motions will explode more easily than a simu- resolution in 3D corresponds to a 10,000 times in-
lation that has less. This realization readily explains crease in computational cost. An alternative could
recent findings by multiple simulation groups, be to devise an efficient subgrid model that, if in-
namely, that 2D simulations appear to explode cluded, provides the correct rate of energy transfer
more readily than 3D simulations.21,22 This is likely to small scales. Work in that direction is still in its
a consequence of the different behaviors of turbu- infancy in the core-collapse supernova context.
lence in 2D and 3D. In 2D, turbulence transports
kinetic energy to large scales (which is unphysical), Making Magnetars: Resolving the
artificially increasing the turbulent pressure contri- Magnetorotational Instability
bution. In 3D, turbulence cascades energy to small The magnetorotational mechanism relies on the
scales (as it should and is known experimentally), presence of an ultra-strong (1015 to 1016 G) global,
so a 3D supernova will generally have less turbu- primarily toroidal, magnetic field around the proto-
lent pressure support than a 2D supernova. neutron star. Such a strongly magnetized protoneu-
Another recent finding by multiple groups is that tron star is called a protomagnetar. It has been the-
simulations with lower spatial resolution appear to orized that the MRI7 could generate a very strong
explode more readily than simulations with higher local magnetic field that could be transformed into
resolution. There are two possible explanations for this a global field by a dynamo process. While appeal-
and it is likely that they play hand-in-hand: one, low ing, it was not at all clear that this is what happens.
88 September/October 2016
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
www.computer.org/cise 89
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
COMPUTER SIMULATIONS
90 September/October 2016
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
departure from current code design and major code 5. G.S. Bisnovatyi-Kogan, “The Explosion of a Rotat-
development efforts. Several supernova groups are ing Star as a Supernova Mechanism,” Astronomi-
exploring new algorithms, numerical methods, and cheskii Zhurnal, vol. 47, Aug. 1970, p. 813.
parallelization paradigms. Discontinuous Galerkin 6. J.M. LeBlanc and J.R. Wilson, “A Numerical Exam-
(DG) finite elements24 have emerged as a promis- ple of the Collapse of a Rotating Magnetized Star,”
ing discretization approach that guarantees high Astrophysical J., vol. 161, Aug. 1970, pp. 541–551.
numerical order while minimizing the amount of 7. S.A. Balbus and J.F. Hawley, “A Powerful Local
subdomain boundary information that needs to Shear Instability in Weakly Magnetized Disks. I—
be communicated between processes. In addition, Linear Analysis. II—Nonlinear Evolution,” Astro-
switching to a new, more flexible parallelization ap- physical J., vol. 376, July 1991, pp. 214–233.
proach will likely be necessary to prepare supernova 8. A. Wongwathanarat, H. Janka, and E. Muller,
codes (and other computational astrophysics codes “Hydrodynamical Neutron Star Kicks in Three
solving similar equations) for exascale machines. Dimensions,” Astrophysical J. Letters, vol. 725, Dec.
A prime contender being considered by supernova 2010, pp. L106–L110.
groups is task-based parallelism, which allows for 9. C.D. Ott et al., “Two-Dimensional Multiangle,
fine-grained dynamical load balancing and asyn- Multigroup Neutrino Radiation-Hydrodynamic
chronous execution and communication. Frame- Simulations of Postbounce Supernova Cores,” As-
works that can become task-based backbones trophysical J., vol. 685, Oct. 2008, pp. 1069–1088.
of future supernova codes already exist, such as 10. E.F. Toro, Riemann Solvers and Numerical Methods
Charm++ (http://charm.cs.illinois.edu/research/ for Fluid Dynamics, Springer, 1999.
charm),
____ Legion (http://legion.stanford.edu/over- 11. T.W. Baumgarte and S.L. Shapiro, Numerical Rela-
view),
___ and Uintah (http://uintah.utah.edu).
_______________ tivity: Solving Einstein’s Equations on the Computer,
Cambridge Univ. Press, 2010.
Acknowledgments 12. T. Kuroda, T. Takiwaki, and K. Kotake, “A New
I acknowledge helpful conversations with and help Multi-Energy Neutrino Radiation-Hydrodynamics
from Adam Burrows, Sean Couch, Steve Drasco, Ro- Code in Full General Relativity and Its Application
land Haas, Kenta Kiuchi, Philipp Mösta, David Radice, to Gravitational Collapse of Massive Stars,” Astro-
Luke Roberts, Erik Schnetter, Ed Seidel, and Masaru physical J. Supplemental Series, vol. 222, Feb. 2016,
Shibata. I thank the Yukawa Institute for Theoretical article no. 20.
Physics at Kyoto University for hospitality while writ- 13. M. Liebendörfer et al., “Supernova Simulations
ing this article. This work is supported by the US Na- with Boltzmann Neutrino Transport: A Compari-
tional Science Foundation (NSF) under award numbers son of Methods,” Astrophysical J., vol. 620, Feb.
CAREER PHY-1151197 and TCAN AST-1333520, and 2005, pp. 840–860.
by the Sherman Fairchild Foundation. Computations 14. K. Sumiyoshi et al., “Multidimensional Features of
were performed on NSF XSEDE under allocation TG- Neutrino Transfer in Core-Collapse Supernovae,”
PHY100033 and on NSF/NCSA Blue Waters under Astrophysical J. Supplemental Series, vol. 216, Jan.
NSF PRAC award number ACI-1440083. Movies of 2015, article no. 5.
simulation results can be found on www.youtube.com/ 15. E. O’Connor and S.M. Couch, “Two Dimensional
SXSCollaboration.
___________ Core-Collapse Supernova Explosions Aided by
General Relativity with Multidimensional Neu-
References trino Transport,” submitted to Astrophysical J., Nov.
1. H.A. Bethe and J.R. Wilson, “Revival of a Stalled 2015; arXiv:1511.07443.
Supernova Shock by Neutrino Heating,” Astrophysi- 16. L.F. Roberts et al., “General Relativistic Three-
cal J., vol. 295, Aug. 1985, pp. 14–23. Dimensional Multi-Group Neutrino Radiation-
2. C.D. Ott et al., “General-Relativistic Simulations Hydrodynamics Simulations of Core-Collapse Su-
of Three-Dimensional Core-Collapse Supernovae,” pernovae,” submitted to Astrophysical J., Apr. 2016;
Astrophysical J., vol. 768, May 2013, article no. 115. arXiv:1604.07848.
3. H.-T. Janka, “Explosion Mechanisms of Core-Col- 17. A. Mirizzi et al., “Supernova Neutrinos: Produc-
lapse Supernovae,” Ann. Rev. Nuclear and Particle tion, Oscillations and Detection,” La Rivista del
Science, vol. 62, Nov. 2012, pp. 407–451. Nuovo Cimento, vol. 39, Jan. 2016, pp. 1–112.
4. P. Mösta et al., “Magnetorotational Core-Collapse 18. A. Vlasenko, G.M. Fuller, and V. Cirigliano, “Neu-
Supernovae in Three Dimensions,” Astrophysical J. trino Quantum Kinetics,” Physical Rev. D., vol. 89,
Letters, vol. 785, Apr. 2014, article no. L29. no. 10, 2014, article no. 105004.
www.computer.org/cise 91
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
COMPUTER SIMULATIONS
19. A.W. Steiner, M. Hempel, and T. Fischer, “Core- 24. J.S. Hesthaven and T. Warburton, Nodal Discon-
Collapse Supernova Equations of State Based on tinuous Galerkin Methods: Algorithms, Analysis, and
Neutron Star Observations,” Astrophysical J., vol. Applications, 1st ed., Springer, 2007.
774, Sept. 2013, article no. 17.
20. S.M. Couch et al., “The Three-Dimensional Evolu- Christian D. Ott is a professor of theoretical astrophys-
tion to Core Collapse of a Massive Star,” Astrophysi- ics in the Theoretical Astrophysics Including Cosmol-
cal J. Letters, vol. 808, July 2015, article no. L21. ogy and Relativity (TAPIR) group of the Walter Burke
21. E.J. Lentz et al., “Three-Dimensional Core-Col- Institute for Theoretical Physics at Caltech. His research
lapse Supernova Simulated Using a 15 M Progeni- interests include astrophysics and computational simula-
tor,” Astrophysical J. Letters, vol. 807, July 2015, tions of core-collapse supernovae, neutron star mergers,
article no. L31. and black holes. Ott received a PhD in physics from the
22. S.M. Couch and C.D. Ott, “The Role of Turbulence Max Planck Institute for Gravitational Physics and Uni-
in Neutrino-Driven Core-Collapse Supernova Explo- versität Potsdam. Contact him at [email protected].
_____________
sions,” Astrophysical J., vol. 799, Jan. 2015, article no. 5.
23. P. Mösta et al., “A Large-Scale Dynamo and Mag-
netoturbulence in Rapidly Rotating Core-Collapse
Supernovae,” Nature, vol. 528, no. 7582, 2015, pp. Selected articles and columns from IEEE Computer
376–379; www.nature.com/nature/journal/v528/ Society publications are also available for free at
n7582/full/nature15755.html.
_________________ http://ComputingNow.computer.org.
____________________________________________
92 September/October 2016
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
,(((&RPSXWHU6RFLHW\DZDUGVUHFRJQL]HRXWVWDQGLQJDFKLHYHPHQWVDQGKLJKOLJKWVLJQLƮFDQW
contributors in the teaching and R&D computing communities. All members of the profession
are invited to nominate individuals who they consider most eligible to receive international
recognition of an appropriate society award.
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
SECTION TITLE
LEADERSHIP COMPUTING
Editors: James
KonradJ.Hinsen,
Hack, [email protected]
[email protected]
________ | Michael| Konstantin
E. Papka, [email protected]
Läufer, laufer@
________
“M
olecular machines,” composed of protein The research team of Benoît Roux, a professor in the
components, consume energy to perform University of Chicago’s Department of Biochemistry and
specific biological functions. The concerted Molecular Biology and a senior scientist in Argonne Na-
actions of the proteins trigger many of the tional Laboratory’s Center for Nanoscale Materials, relies
critical activities that occur in living cells. However, like any on an integrative approach to discover and define the ba-
machine, the components can break (through various muta- sic mechanisms of biomolecular systems—an approach that
tions), and then the proteins fail to perform their functions relies on theory, modeling, and running large-scale simula-
correctly. tions on some of the fastest open science supercomputers in
It’s known that malfunctioning proteins can result in the world.
a host of diseases, but pinpointing when and how a mal- Computers have already changed the landscape of biol-
function occurs is a significant challenge. Very few func- ogy in considerable ways; modeling and simulation tools are
tional states of molecular machines are determined by routinely used to fill in knowledge gaps from experiments,
experimentalists working in wet laboratories. Therefore, helping design and define research studies. Petascale super-
more structure-function information is needed to develop computing provides a window into something else entirely:
an understanding of disease processes and to design novel the ability to calculate all the interactions occurring be-
therapeutic agents. tween the atoms and molecules in a biomolecular system,
94 Computing in Science & Engineering 1521-9615/16/$33.00 © 2016 IEEE Copublished by the IEEE CS and the AIP September/October 2016
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
www.computer.org/cise 95
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
LEADERSHIP COMPUTING
E1 E1–2Ca2+–ATP E1P–2Ca2+–ADP
R≈109 km
E2 E2–Pi E2P
Figure 1. Interaction of cytoplasmic domains in the calcium pump of sarscoplasmic reticulum. These six states have been structurally
characterized and represent important intermediates along the reaction cycle. The blue domain, shown in surface representation, is
called the phosphorylation domain (P). The red and green domains, shown as CF traces, are called actuator (A) and nucleotide binding
(N) domains, respectively. The red and green patches in the P domain are interacting with residues in A and N domains, respectively.
Two residues are considered to be in contact if at least one pair of non-hydrogen atoms is within 4 Å of each other. (Image: Avisek Das,
University of Chicago, used with permission.)
simulation capabilities and take advantage of the on Blue Gene/Q), the string method can achieve ex-
machine’s features, such as high processor counts treme scalability on leadership-class supercomputers.
or advanced chips, to evolve the system for longer ALCF staff provided maintenance and sup-
and longer periods of time. port for NAMD software and helped coordinate
Roux and his team used a premier MD simula- and monitor the jobs running on Mira, ALCF’s
tion code, called NAMD, that combines two ad- 10-Pflops IBM Blue Gene/Q.
vanced algorithms—the swarm-of-trajectory string ALCF computational scientist Wei Jiang has
method and multidimensional umbrella sampling. been actively collaborating with Roux’s team since
NAMD, which was first developed at the Uni- 2012, as part of Mira’s Early Science Program.
versity of Illinois at Urbana-Champaign by Klaus Jiang worked with IBM’s system software team on
Schulten and Laxmikant Kale, is a program used to early stage porting and optimization of NAMD
carry out classical simulations of biomolecular sys- on the Blue Gene/Q architecture. He’s also one of
tems. It’s based on the Charm++ parallel program- the core developers of NAMD’s multiple copy al-
ming system and runtime library, which provides in- gorithm, which is the foundation for multiple IN-
frastructure for implementing highly scalable parallel CITE projects that use NAMD.
applications. When combined with a machine-spe- Jiang, who has a background in computation-
cific communication library (such as PAMI, available al biology, considers the recent work a significant
96 September/October 2016
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
breakthrough. “Only in the third year of the proj- Leadership Computing Facility, which is a US DOE Of-
ect did we begin to see real progress,” he said. “The fice of Science User Facility supported under Contract
first and second year of an INCITE project is often DE-AC02-06CH11357.
accumulated experience.”
Laura Wolf is a science writer and editor for Argonne
National Laboratory. Her interests include science com-
Acknowledgments
An award of computer time was provided by the US
Department of Energy’s Innovative and Novel Compu- Selected articles and columns from IEEE Computer
tational Impact on Theory and Experiment (INCITE) Society publications are also available for free at
program. This research used resources of the Argonne http://ComputingNow.computer.org.
Keeping
YOU U at the
Center
er All the Knowledge
<RX1HHGŜ
<RX1HHGŜ
of Technologyl
logy On Your Time.
Learn something new.
Try Computer Society
eLearning today!
Computer Society eLearning
More at www.computer.org/elearning
www.computer.org/cise 97
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
M
any application fields produce large amounts of records); and business intelligence (think of large tables in
multidimensional data. Simply put, these are da- databases).
tasets where, for each measurement point (also While storing multidimensional data is easy, understand-
called data point, record, sample, observation, or ing it is not. The challenge lies not so much in having a large
instance), we can measure many properties of the underlying number of observations but in having a large number of di-
phenomenon. The resulting measurement values for all data mensions. Consider, for instance, two datasets A and B. Da-
points are usually called variables, dimensions, or attributes. taset A contains 1,000 samples of a single attribute, say, the
A multidimensional dataset can thus be described as an n × m birthdates of 1,000 patients in an EPD. Dataset B contains
data table having n rows (one per observation) and m 100 samples of 10 attributes, say, the amounts of 10 different
columns (one per dimension). When n is larger than drugs distributed to 100 patients. The total number of mea-
roughly 5, such data is called high-dimensional. Such surements in the two datasets is the same (1,000). Yet, un-
datasets are common in engineering (think of manufac- derstanding dataset A is quite easy, and it typically involves
turing specifications, quality assurance, and simulation displaying either a (sorted) bar chart of its single variable or a
or process control); medical sciences and e-government histogram showing the patients’ age distribution. In contrast,
(think of electronic patient dossiers [EPDs] or tax office understanding dataset B can be very hard—for example, it
98 Computing in Science & Engineering 1521-9615/16/$33.00 © 2016 IEEE Copublished by the IEEE CS and the AIP September/October 2016
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
might be necessary to examine the correlations of sorted column lets us see the distribution of values
any pair of two dimensions of the 10 available ones. of a given dimension.
In this article, we discuss projections, a particu- But while spreadsheet views are good for show-
lar type of tool that allows the efficient and effective ing detailed information, they don’t scale to data-
visual analysis of multidimensional datasets. Pro- sets having thousands of observations and tens of
jections have become increasingly interesting and dimensions or more. To address such scalability,
important tools for the visual exploration of high-di- table lenses refine the spreadsheet idea: they work
mensional data. Compared to other techniques, they much like zooming out of the drawing of a large
scale well in the number of observations and dimen- table, thereby reducing every row to a row of pix-
sions, are intuitive, and can be used with minimal els. Rather than showing the actual textual cell
effort. However, they need to be complemented by content, cell values are now drawn as horizontal
additional visual mechanisms to be of maximal add- pixel bars colored and scaled to reflect data values.
ed value. Also, as they’ve been originally developed As such, columns are effectively reduced to bar
in more formal communities, they’re less known or graphs. Using sorting, we can now view the varia-
accessible to mainstream scientists and engineers. tion of dimension values for much larger datasets.
We provide here a compact overview of how to use However, reasoning about the correlation of differ-
projections to understand high-dimensional data, ent dimensions isn’t easy using table lenses.
present a classification of projection techniques, and
discuss ways to visualize projections. We also com- Scatterplots
ment on the advantages of projections as opposed to Another well-known visualization technique for
other visualization techniques for multidimensional multidimensional data is a scatterplot, which shows
data, and illustrate their added value in a complex the distribution of all observations with respect to
visual analytics workflow for machine learning ap- two chosen dimensions i and j. Finding correla-
plications in medical science. tions, correlation strengths, and the overall distri-
bution of data values is now easy. To do this for m
Exploring High-Dimensional Data dimensions, a so-called m × m scatterplot matrix
Before outlining solutions for exploring high-dimen- can be drawn, showing the correlation of each di-
sional data, we need to outline typical tasks that must mension i with each other dimension j. However,
be performed during such exploration. These can reasoning about observations is hard now—an ob-
be classified into observation-centric tasks (which servation is basically a set of m2 points, one in each
address questions focusing on observations) and scatterplot in the matrix. Also, scatterplot matri-
dimension-centric tasks (which address questions fo- ces don’t scale well for datasets having more than
cusing on the dimensions). Observation-centric tasks roughly 8 to 10 dimensions.
include finding groups of similar observations and
finding outliers (observations that are very different Parallel Coordinates
from the rest of the data). Dimension-centric tasks A third solution for visualizing multidimensional
include finding sets of dimensions that are strongly data is parallel coordinates. Here, each dimension
correlated and dimensions that are mutually inde- is shown as a vertical axis, thus the name parallel
pendent. There exist also tasks that combine observa- coordinates. Each observation is shown as a frac-
tions and dimensions, such as finding which dimen- tured line that connects the m points along these
sions make a given group of observations different axes corresponding to its values in all the m dimen-
from the rest of the data. Several visual solutions ex- sions. Correlations of dimensions (shown by adja-
ist to address (parts of) these tasks, as follows. More cent axes) can now be spotted as bundles of parallel
details on these and other visualization techniques line segments; inverse correlations are shown by a
for high-dimensional data appear elsewhere.1,2 typical x-shaped line-crossing pattern. Yet, par-
allel coordinates don’t scale well beyond 10 to 15
Tables dimensions. Also, they might require careful order-
Probably the simplest method is to display the en- ing of the axes to bring dimensions that one wants
tire dataset as a n × m table, as we do in a spread- to compare close to each other in the plot.
sheet. Sorting rows on the values in a given column
lets us find observations with minimal or maximal Multidimensional Projections
values for that column and then read all their di- Projections take a very different approach to visual-
mensions horizontally in a row. Visually scanning a izing high-dimensional data. Think of the n data
www.computer.org/cise 99
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
VISUALIZATION CORNER
Figure 1. From a multivariate data table to a projection. Projections can be thought of as reducing the unnecessary
dimensionality of the data (the original m dimensions) keeping the inherent dimensionality (that which encodes dis-
tances, or similarities, between points).
points in an m-dimensional space. The dataset can as there are fewer dimensions to consider next. The
then be conceptually seen as a point cloud in this simplified dataset can next be used instead of the
space. If we could see in m dimensions, we could original one in various processing or analysis tasks.
then (easily) find outliers as those points that are The second use case involves reducing the number
far from all other points in the cloud and find im- of dimensions to two or three, so that we can vi-
portant groups of similar observations as dense and sually explore the reduced dataset. In contrast to
compact regions in the point cloud. the first case, this usually isn’t done by dropping
However, we can’t see in more than three di- dimensions but by creating two or three synthetic
mensions. Note also that a key ingredient of per- dimensions along which the data structure is best
forming the above-mentioned tasks is reasoning in preserved. We next focus on this latter use case.
terms of distances between the points in m dimen-
sions. Hence, if we could somehow map, or project, Projection Techniques
our point cloud from m to two or three dimen- Many different techniques exist to create a 2D or
sions, keeping the distances between point-pairs, 3D projection, and they can be classified according
we could do the same tasks by looking at a 2D or to several criteria, as follows.
3D scatterplot. Projections perform precisely this
operation, as illustrated by Figure 1. Intuitively, Dimension versus distance. The dimension versus
they can be thought of as reducing the unneces- distance classification looks at the type of informa-
sary dimensionality of the data (the original m tion used to construct a projection. Distance-based
dimensions), keeping the inherent dimensionality methods use only the distances, or similarities, be-
(that which encodes distances, or similarities, be- tween m-dimensional observations. Typical distances
tween points). Additionally, we can color-code the here are Euclidean and cosine, thus, the projection
projected points by the values of one dimension, to algorithm’s input is an n × n distance matrix be-
get extra insights. tween all observation pairs. Such methods are also
There are two main use cases for projections. known as multidimensional scaling (MDS) because
The first is to reduce the number of dimensions by they intuitively scale the m-dimensional distances to
keeping only one dimension from a set of dimen- 2D distances. Technically, this is done by optimizing
sions which are strongly correlated, or by dropping a function that minimizes the so-called aggregated
dimensions along which the data has a very low normalized stress, or summed difference between the
variance. Essentially, this preserves patterns in the inter-point distances in m dimensions and 2D, re-
data (clusters, outliers) but makes its usage simpler, spectively. The main advantage of MDS methods is
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
that they don’t require the original dimensions—a (small) subset of observations, called representatives,
dissimilarity matrix between observations is suf- from the initial dataset and then projecting these by
ficient and extremely useful in cases where we can using a high-accuracy method. This isn’t expensive,
measure the similarities in some data collections but as the number of representatives is small. Finally,
don’t precisely know which attributes (dimensions) the remaining observations close to each representa-
explain those similarities. The main disadvantage of tive are fit around the position of the representative’s
MDS methods is that they require storing (and ana- projection. This is cheaper, simpler, and also more
lyzing) an n × n distance matrix. For n being tens of accurate than using a global technique. Intuitively,
thousands of observations, this can be very expen- think of our Earth example as splitting the ball sur-
sive.3 Several MDS refinements have been proposed, face into several small patches and projecting these
such as ISOMAP,4 Pivot MDS,5 and Fastmap,6 to 2D. When such patches have low curvature, fit-
which can compute projections in (near) linear time ting them to a 2D surface is easier than if we were to
to the number of observations. project the entire ball at once. Good local methods
In contrast, dimension-based methods use as in- include PLMP9 and LAMP.10 Using representatives
put the actual m dimensions of all observations. For has another added value: users can arrange these as
datasets having many more observations than dimen- desired in 2D, thereby controlling the projection’s
sions (n much larger than n), this gives considerable overall shape with little effort.
savings. However, we now need to have access to the
original dimension values. Arguably the best known Distance versus neighborhood preserving. A final classi-
method in this class is principal component analysis fication looks into what a projection aims to preserve.
(PCA), whose variations are also known under the When it’s important to accurately assess the similar-
names of singular value decomposition (SVD) or ity of points, distance preservation is preferred. All
Karhunen-Loève transform (KLT).7 Intuitively put, projection techniques listed above fall into this class.
the idea of 2D PCA is to find the plane, in m dimen- However, as we’ve seen, getting a good distance pres-
sions, on which the projections of the n observations ervation for all points can be hard. When the number
have the largest spread. Visualizing these 2D projec- of dimensions is very high, the Euclidean (straight-
tions will then give us a good way of understanding line) distances between all point-pairs in a dataset
the actual variance of the data in m dimensions.8 tend to become very similar, so accurately preserving
While simple and fast, PCA-based methods work such distances has less value. In such cases, it’s often
well only if the observations are distributed close to a better to preserve neighborhoods in a projection—this
planar surface in m dimensions. To understand this, way, the projection can still be used to reason about
consider a set of observations uniformly distributed the groups and outliers existing in the high-dimen-
on the surface of the Earth (a ball in 3D). When pro- sional dataset. Actually, the depiction of groups could
jecting these, PCA will effectively squash the ball to get even clearer because the projection algorithm has
a planar disk, projecting diametrically opposed ob- more freedom to place observations in 2D, as long as
servations on the ball’s surface to the same location, the nearest neighbors of a point in 2D are the same
meaning the projection won’t preserve distances. as those of the same point in m dimensions. The best-
What we actually want is a projection that acts much known method in this class is t-stochastic neighbor
as a map construction process, where the Earth’s sur- embedding (t-SNE), which is used in many applica-
face is unfolded to a plane, with minimal distortions. tions in machine learning, pattern recognition, and
data mining, and has a readily usable implementation
Global versus local. The global versus local classifica- (https://lvdmaaten.github.io/tsne).
____________________
tion looks at the type of operation used to construct a
projection. Global methods define a single mapping, Type of data. Most projection methods handle
which is then applied for all observations. MDS and quantitative dimensions, whose values are typically
PCA methods fall in this class. The main disadvan- continuously varying over some interval. Examples
tage of global methods is that it can be very hard are temperature, time duration, speed, volume, or
to find a single function that optimally preserves financial transaction values. However, projection
distances of a complex dataset when projecting it techniques such as multiple correspondence analy-
(as in the Earth projection example). Another dis- sis (MCA) can also handle categorical data (types)
advantage is that computing such a global mapping or mixed datasets of quantitative and categorical
can be expensive (as in the case of classical MDS). data. A good description of MCA and related tech-
Local methods address both these issues, selecting a niques is given by Greenacre.11
www.computer.org/cise 101
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
VISUALIZATION CORNER
y legend
maximum value axis 3
minimum value
selected
dimension
for color
mapping
axis 7 axis 2 (gender)
axis 8
axis 4
axis 5
axis 1
x legend
axis 0
axis 6
axis 6
Male Female
axis 2 axis 4
axis 7 axis 1
(a) (b) (c)
error legend
y legend
variable 5
7: H– mass abundance
color: variable 5
x legend
(d) (e)
Figure 2. Projection visualizations with (a) thumbnails, (b) biplot axes, (c) and (d) axis legends, and (e) key local dimensions.
The Projection Explorer is a very good place to or even thumbnails to explain several of their
start working with projections in practice.12 This dimensions.
tool implements a wide range of state-of-the-art Figure 2a shows this for a dataset where ob-
projection techniques that can handle hundreds servations are images. The projection shows image
of thousands of observations with hundreds of di- thumbnails, organized by similarity. We can eas-
mensions and provides several visualizations to in- ily see here that our image collection is split into
teractively customize and explore projections. The two large groups; we can get more insight into
tool is freely downloadable from http://infoserver. the composition of the groups by looking at the
lcad.icmc.usp.br/infovis2/Tools.
____________________ thumbnails.
However, in many cases, there’s no easy way to
Visualizing Projections draw a small thumbnail-like depiction of all the m
The simplest and most widespread way to visu- attributes of an observation. Projections will then
alize a projection is to draw it as a scatterplot. show us groups and outliers, but how do we ex-
Here, each point represents an observation, and plain what these mean? In other words, how do we
the 2D distance between points reflects the put the dimension information back into the pic-
similarities of the observations in m dimensions. ture? Without this, the added value of a projection
Points can be also annotated with color, labels, is limited.
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
There are several ways of explaining projec- tical (y) axes of the plot mean. For a projection, this
tions. By far the simplest, and most common, is to isn’t easy because these axes don’t straightforwardly
color code the projection points by the value of a map to data dimensions but, rather, to combina-
user-chosen dimension. If we next see strong col- tions of dimensions. Luckily, we can compute the
or correlations with different point groups in the contribution of each of the original m data dimen-
projection, we can explain these in terms of the sions to the spread of points along the projection’s
selected dimension’s specific values or value rang- x and y axes. Next, we can visualize these contri-
es. However, if we have tens of dimensions, using butions by standard bar charts (see Figure 2c): for
each one to color code the projection is tedious at each dimension, the x and y axis legends show a
best. Moreover, it could be that no single dimen- bar indicating how much that dimension is visible
sion can explain why certain observations are simi- on the x and y axes. Long bars, thus, indicate di-
lar. Tooltips can be shown at user-chosen points, mensions that strongly contribute to the spread of
which does a good job explaining a few outliers one points along the horizontal and vertical directions.
by one, but it doesn’t work if we want to explain a Figure 2c shows how this works: the dataset con-
large number of points together. tains 583 patient records, each having 10 dimen-
One early way to explain projections is to draw sions describing patients’ gender, age, and eight
so-called biplot axes.13 For PCA projections and blood measurements. The projection shows two
variants, lines indicate the directions of maximal clusters placed aside each other.
variation in the 2D space of all m dimensions. In- How do we explain these? In the x axis legend,
tuitively put, biplot axes generalize the concept of we see a tall orange bar, which tells us that this
a scatterplot, where we can read the values of two dimension (gender) is strongly responsible for the
dimensions along the x and y axes, to the case where points’ horizontal spread. If we color the points by
we have m dimensions. Moreover, strongly corre- their gender value, we see that, indeed, gender ex-
lated dimensions appear as nearly parallel axes, and plains the clusters. Axis legends can also be used for
independent dimensions appear as nearly orthogo- 3D projections, as in Figure 2d, which shows a 3D
nal axes. Finally, the relative lengths of the axes projection of a 200,000-sample dataset with 10 di-
indicate the relative variation of the respective di- mensions coming from a simulation describing the
mensions. Biplots can also be easily constructed for formation of the early universe.14 As we rotate the
any other projection, including 3D projections that 3D projection, the bars in the axis legends change
generate a 3D point cloud rather than a 2D scat- lengths and are sorted from longest to shortest, in-
terplot.14 In such cases, the biplot axes need not be dicating the best-visible dimensions from a given
straight lines. Figure 2b shows an example of biplot viewpoint (dimensions 5 and 7, in our case). A third
axes for a dataset containing 2,814 abstracts of legend (Figure 2d, top right) shows which dimen-
scientific papers. Each observation (abstract) has sions we can’t see well in the projection from the
nine dimensions, indicating the frequencies of the current viewpoint. These dimensions vary strongly
nine most used technical terms in all abstracts. The along the viewing direction, so we shouldn’t use the
projection, created using a force-based technique, current viewpoint to reason about them.
places points close to each other if the respective ab- Biplot axes can also be inspected to get more
stracts are similar. Labels can be added to the axes detail. For example, we see that the projection’s
to tell their identity and also indicate their signs saddle shape is mainly caused by variable 7 and
(extremities associated to minimum and maximum that the spike outlier is caused by a combination
values). The curvature of the biplot axes tells us that of dimensions 5 and 6. This interactive viewpoint
the projection is highly nonlinear—intuitively, we manipulation of 3D projections effectively lets us
can think that the nine-dimensional space gets dis- create an infinite set of 2D scatterplot-like visual-
torted when squashed into the resulting 3D space. izations on the fly. Both biplot axes and axis leg-
This is undesirable because reading the values of the ends explain a projection globally. If well-separated
dimensions along such curved axes is hard. groups of points are visible, we can’t directly tell
Still, interpreting biplot axes can be challeng- which variables are responsible for their appearance
ing, especially when we have 10 or more variables, without visually correlating the groups’ positions
as we get too many lines drawn in the plot. More- with annotations, which can be tedious. Local
over, most users are accustomed to interpreting a explanations address this by explicitly splitting the
point cloud as a Cartesian scatterplot—that is, projection into groups of points that admit a single
they want to know what the horizontal (x) and ver- (simple) explanation, depicting this explanation atop
www.computer.org/cise 103
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
VISUALIZATION CORNER
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
selected
0.0023 0.1478 0.0016
group
viscontest
missing neighbors
of selected point missing
members
selected point
original distance (m dimensions)
(a)
(d) (e)
Figure 3. Projection visualized with (a) distance-centric methods and (b) through (e) observation-centric methods. The ideal projection
behavior is shown with red diagonal lines. Figures in each scatterplot show the aggregated normalized stress, telling us that LAMP is
generally better than the other two studied projections.
here only three hot spots, meaning that the fourth Using Projections in Visual Analytics
one in Figure 3b wasn’t caused by false neighbors. Workflows
Figure 3d shows errors created by missing neigh- So far, we’ve shown how we can construct projec-
bors—that is, points close in m dimensions but tions, check their quality, and visually annotate
far in 2D. The missing neighbors of the selected them to explain the contained patterns. But how
point of interest are connected by lines, which are are projections used in complex visual analytics
bundled to simplify the image. The discrepancy workflows? The most common way is to visually
between the 2D and original distances is also color explore them while searching for groups, and when
coded on the points themselves. In this image, we such groups appear, to use tools like the ones pre-
see that the missing neighbors of the selected point sented so far to explain them in terms of dimen-
are quite well localized on the other side of the pro- sions and dimension values.2 This is often done in
jection. This typically happens when a closed sur- data mining and machine learning.
face in m dimensions is split by the projection to We illustrate this with a visual analytics work-
be embedded in 2D. Finally, Figure 3e shows for flow for building classifiers for medical diagnosis.18
a selected group of points all the points that are The advent of low-cost, high-accuracy imaging de-
closer in m dimensions to a point in the group than vices has enabled both doctors and the public to
to any other point but closer to points outside that generate large collections of skin lesion images.
group in 2D. This lets us easily see if groups that Dermatologists want to automatically classify these
appear in the projection are indeed complete or if into benign (moles) and potentially malignant
they actually miss members. (melanoma), so they can focus their precious time
www.computer.org/cise 105
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
VISUALIZATION CORNER
Training Validation
data data
Feature Good Classifier
extraction Project separation? tool
1 2 3 6
T1
Input objects Features Projection Classifier design Classifier testing
Bad
4 separation?
Feature Too low Good
subset performance? 8 7 performance?
5
T2
Figure 4. Using projections to build and refine classifiers in supervised machine learning.
on analyzing the latter. For this, image classifiers little sense to spend energy on designing and
can be used: each skin image is described in terms testing a classifier, since we seem to have a poor
of several dimensions, or features, such as color his- feature choice (step 4) We can then interactively
tograms, edge densities and orientations, texture select the desired class groups in the projection
patterns, and pigmentation. Next, dermatologists and see which features discriminate them best,18
manually label a training dataset of images as be- repeating the cycle with a different feature subset
nign or malignant, using it to train a classifier so it (step 5). If, however, classes are well separated in
becomes able to label new images. Other applica- the projection (step 3), our features discriminate
tions of machine learning include algorithm opti- them well, so the classification task isn’t too hard.
mization, designing search engines, and predicting We then proceed to design, train, and test the
software quality. classifier (step 6). If the classifier yields a good
Designing good classifiers is a long-standing performance, we’re done: we have a production-
problem in machine learning and is often re- ready system (step 7). If not, we can again use
ferred to as the “black art” of classifier design.19 projections to see which are the badly classified
The problem is multiple-fold: understanding observations (step 8), which features are respon-
discriminative features; understanding which sible for this (step 9), and engineer new features
observations are hard to classify and why; and that separate these better (step 10). In this work-
selecting and designing features to improve clas- flow, projections serve two key tasks: predicting
sification accuracy. Projections can help all these the ease of building a good classifier ahead of the
tasks, via the workflow in Figure 4. Given a set of actual construction (T1), thereby saving us from
input observations, we first extract features that designing a classifier with unsuitable features,
are typically known to capture their essence (step and showing which observations are misclassified
1). This yields a high-dimensional data table with and their feature values (T2), thereby helping us
observations as rows and features as columns. design better features in a targeted way.
We also construct a small training set by manual
labeling. Next, we want to determine how easy
the classification problem ahead of us will be. For
this, we project the training set and color obser-
vations by class labels (step 2). If the classes we
P rojections are the new emerging instrument
for the visual exploration of large high-dimen-
sional datasets. Complemented by suitable visual
wish to recognize are badly separated, it makes explanations, they’re intuitive, easy to use, visually
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
compact, and easy to learn for users familiar with vol. 17, no. 12, 2011, pp. 2563–2571.
scatterplots. Recent technical developments allow 11. M. Greenacre, Correspondence Analysis in Practice,
their automatic computation from large datasets 2nd ed., CRC Press, 2007.
in seconds, helping users avoid complex parameter 12. P. Pagliosa et al., “Projection Inspector: Assessment
settings or needing to understand the underlying and Synthesis of Multidimensional Projections,”
technicalities. As such, they’re part of the visual Neurocomputing, vol. 150, 2015, pp. 599–610.
data scientist’s kit of indispensable tools. 13. M. Greenacre, Biplots in Practice, CRC Press,
But as projections become increasingly more 2007.
useful and usable, several new challenges have 14. D. Coimbra et al., “Explaining Three-Dimensional
emerged. Users require new ways to manipulate a Dimensionality Reduction Plots,” Information Visu-
projection to improve its quality in specific areas, alization, vol. 15, no. 2, 2015, pp. 154–172.
to obtain the best-tuned results for their datasets 15. R. da Silva et al., “Attribute-Based Visual Explana-
and problems. Developers require consolidated tion of Multidimensional Projections,” Proc. Eu-
implementations of projections that would let roVA, 2015, pp. 134–139.
them integrate them in commercial-grade applica- 16. F.V. Paulovich et al., “Semantic Wordification of
tions such as Tableau. And last but not least, users Document Collections,” Computer Graphics Forum,
and scientists require more examples of workflows vol. 31, no. 3, 2012, pp. 1145–1153.
showing how projections can be used in visual an- 17. R.M. Martins et al., “Visual Analysis of Dimen-
alytics sensemaking to solve problems in increas- sionality Reduction Quality for Parameterized
ingly diverse application areas. Projections,” Computers & Graphics, vol. 41, 2014,
pp. 26–42.
References 18. P.E. Rauber et al., “Interactive Image Feature Se-
1. S. Liu et al., “Visualizing High-Dimensional Data: lection Aided by Dimensionality Reduction,” Proc.
Advances in the Past Decade,” Proc. EuroVis– EuroVA, 2015, pp. 54–61.
STARs, 2015, pp. 127–147. 19. P. Domingos, “A Few Useful Things to Know
2. C. Sorzano, J. Vargas, and A. Pascual-Montano, “A about Machine Learning,” Comm. ACM, vol. 10,
Survey of Dimensionality Reduction Techniques,” no. 55, 2012, pp. 78–87.
2014; http://arxiv.org/pdf/1403.2877.
3. W.S. Torgeson, “Multidimensional Scaling of Renato R.O. da Silva is a PhD student at the University
Similarity,” Psychometrika, vol. 30, no. 4, 1965, pp. of São Paulo, Brazil. His research interests include mul-
379–393. tidimensional projections, information visualization,
4. J.B. Tenenbaum, V. de Silva, and J.C. Langford, and high-dimensional data analytics. Contact him at
“A Global Geometric Framework for Nonlinear [email protected].
__________
Dimensionality Reduction,” Science, vol. 290, no.
5500, 2000, pp. 2319–2323. Paulo E. Rauber is a PhD student at the University of
5. U. Brandes and C. Pich, “Eigensolver Methods Groningen, the Netherlands. His research interests in-
for Progressive Multidimensional Scaling of Large clude multidimensional projections, supervised classifier
Data,” Proc. Graph Drawing, Springer, 2007, pp. design, and visual analytics. Contact him at p.e.rauber@
_______
42–53. rug.nl.
____
6. C. Faloutsos and K.-I. Lin, “FastMap: A Fast Algo-
rithm for Indexing, Data-Mining and Visualization Alexandru C. Telea is a full professor at the University
of Traditional and Multimedia Datasets,” SIG- of Groningen, the Netherlands. His research interests
MOD Record, vol. 24, no. 2, 1995, pp. 163–174. include multiscale visual analytics, graph visualization,
7. K. Fukunaga, Introduction to Statistical Pattern and 3D shape processing. Telea received a PhD in com-
Recognition, Academic Press, 1990. puter science (data visualization) from the Eindhoven
8. I.T. Jolliffe, Principal Component Analysis, Springer, University of Technology, the Netherlands. Contact
2002, p. 487. him at [email protected].
_________
9. F.V. Paulovich, C.T. Silva, and L.G. Nonato, “Two-
Phase Mapping for Projecting Massive Data Sets,”
IEEE Trans. Visual Computer Graphics, vol. 16, no.
6, 2010, pp. 1281–1290. Selected articles and columns from IEEE Computer
10. P. Joia et al., “Local Affine Multidimensional Pro- Society publications are also available for free at
jection,” IEEE Trans. Visual Computer Graphics, http://ComputingNow.computer.org.
www.computer.org/cise 107
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
Computers in Cars
T
he major role that computational devices play in cars became dramatically apparent
last September, when the Environmental Protection Agency announced the results
of its investigation into Volkswagen. The EPA discovered that the German automaker
had installed software in some of its diesel-engine cars that controlled a system
for reducing the emission of environmentally hostile nitrogen oxides—but only during an
emissions test. On the open road, the car belched nitrogen oxides, unbeknownst to its driver.
Computers have been stealthily controlling cars for decades. My second car, a 1993 Honda
Civic hatchback, had a computational device—an engine control unit—whose microproces-
by Charles Day sor received data from sensors in and around the engine. On the basis of those data, the ECU
would consult preprogrammed lookup tables and adjust actuators that controlled and opti-
mized the mix of fuel and air, valve timing, idle speed, and other factors. This combination of
ECU and direct fuel injection not only reduced emissions and boosted engine efficiency, it was
also less bulky and mechanically simpler than the device it replaced, the venerable carburetor.
Unfortunately, however, the trend for computers in cars is toward greater complex-
ity, not simplicity. Consider another Honda, the second-generation Acura NSX, which
went on sale earlier this year. The supercar’s hybrid power train consists of a turbo-
charged V6 engine mated to three electric motors: one each for the two front wheels and
one for the two rear wheels. An array of sensors, microprocessors, and actuators ensures
that all three motors are optimally deployed during acceleration, cruising, and braking.
And talking of braking, the NSX’s brake pedal isn’t actually mechanically con-
nected to the brakes. Rather, it activates a rheostat, which controls the brakes electroni-
cally. To preserve the feel of mechanical braking, a sensor gauges how much hydraulic
pressure to push back on the driver’s foot.
In Formula One racing, the proliferation of computer control has led to an arms race among
manufacturers, which reached its apogee in 1993. Thanks in part to its computer-controlled anti-
lock brakes, traction control, and active suspension, the Williams FW15C won 10 of the season’s
16 races. The sport’s governing body responded by restricting electronic aids. By the 2008 season,
all cars were compelled to use the same standard ECU. The 23-year-old Williams FW15C retains
a strong claim to being the most technologically sophisticated Formula One car ever built.
Computers aren’t confined to supercars or racing cars. The July issue of Consumer Re-
ports ranked cars’ infotainment systems, with Cadillac’s being among the worst. Owners
reported taking months, even years, to master its user interface. “This car REALLY needs
a co-pilot with an IT degree,” one despairing owner told the magazine. And this past
May, USA Today reported that consumer complaints about vehicle software problems
filed with the National Highway Traffic Safety Administration (NHTSA) jumped 22
percent in 2015 compared with 2014. Recalls blamed on software rose 45 percent.
I’m not against computers in cars. Rather, I worry that their encroachment will be-
come so complete that consumers like me will be deprived of the choice to buy a car
that lacks such fripperies as a remote vehicle starter system, rear vision camera, head-up
display, driver seat memory, lane departure warning system, and so on. I worry, too, that
even as the NHTSA records more software problems, it’s also considering whether to
mandate computer-controlled safety features.
S o although I wouldn’t turn down an Acura NSX, I’d rather drive one of its ances-
tors, the Honda S800 roadster, circa 1968.
Charles Day is Physics Today’s editor in chief. The views in this column are his own and not nec-
essarily those of either Physics Today or its publisher, the American Institute of Physics.
108 Computing in Science & Engineering 1521-9615/16/$33.00 © 2016 IEEE Copublished by the IEEE CS and the AIP September/October 2016
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®
6HHWKHDZDUGLQIRUPDWLRQDW 6HHWKHDZDUGGHWDLOVDW
www.computer.org/web/awards/booth www.computer.org/web/awards/cse-undergrad-teaching
www.computer.org/ppa
qM
qM
qM
Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page qMqM
THE WORLD’S NEWSSTAND®