Vance 等 - 2024 - Big Data in Earth Science Emerging Practice and Promise
Vance 等 - 2024 - Big Data in Earth Science Emerging Practice and Promise
The five tenets of big data and the environments where they hold the most promise for new discov- READ THE FULL ARTICLE AT
eries. [Figure created by Kate Culpepper, NOAA/NOS/US IOOS Program] https://doi.org/10.1126/science.adh9607
O
raw data to information products, and the
ver the past decade, improvements in temporal domains; and (iii) unruly, in that the emergence of digital twins.
the number and resolution of Earth- data, for example data from social media, are
and satellite-based sensors and finer- not in a consistent format. The definition of How did we get here, who actually uses
resolution climate and meteorological these terms will always be relative and will big data, and how?
models have resulted in an explosion evolve as technology changes, especially with The development of new sensors able to gather
in the volume of Earth science data. These improvements in computing power and on- more data for longer time periods and the
data have enabled scientists to explore com- board processing of data. As an example of ability to measure totally new variables have
plex phenomena and Earth processes at un- deep data, starting in the 1970s, the World contributed to ever-larger Earth science data-
precedented scales and resolutions and have Ocean Circulation Experiment (WOCE) hydro- sets (Fig. 1). Drifting buoys and floats are widely
fueled the development of new tools and tech- graphic surveys took 24 to 36 salinity, temper- deployed, and better communications allow
niques for data analysis, visualization, and in- ature, and dissolved oxygen samples from the buoys to send data to shore in near real time
terpretation. As a result, some have argued that surface to the bottom every 30 nautical miles (5). Autonomous vehicles go on extended mis-
big data represent a fourth leg of scientific (~55 km) (http://woceatlas.tamu.edu/printed/ sions gathering data, and tagged animals send
research, distinct from theory, experimenta- SOA_WOCE.html). Argo floats now collect back streams of environmental and behavioral
tion, and computation (1). However, despite these variables and additional measurements data. Passive acoustic data show how animals
its transformative impacts on Earth science, using unmanned floats and collect 400 or use sound and the impacts of anthropogenic
a clear and concise definition of big data is more profiles a day from 3800 floats across the sounds (6). Models cover larger areas at finer
elusive (2). Big data are traditionally defined entire ocean (https://argo.ucsd.edu/about/). spatial scales and can be run for longer dura-
by five V’s—volume, velocity, veracity, variety, Wide data include projects such as the re- tions because of improvements in computing,
and value. These describe data that are large, gional cabled array on the Juan de Fuca Ridge such as device scaling, high-performance, and
arrive quickly, may be of mixed reliability or (https://interactiveoceans.washington.edu/ cloud computing generally (7, 8). The greater
accuracy, are in a number of formats, and have about/regional-cabled-array/), which has de- acceptance of citizen and crowdsourced data
high value. Used as early as the 1990s, the term ployed an array of 150 sensors across much has increased the amount of unruly data that
big data described how improvements in sen- of a 500 km–by–150 km tectonic plate. Data can contribute to research on climate change
sors, models, data management, and comput- can be transmitted at up to 240 gigabytes per and other phenomena (9). Crowdsourced data,
ing resources made it possible to gather and second, and two-way communication allows such as personal weather stations, have ex-
analyze ever-larger datasets (3). The term has real-time control of instruments. Unruly data panded to include bathymetric data, data from
and will continue to coevolve with the ever- include a variety of crowdsourced data, such watches and body-worn sensors, social media
changing advancements in technology and as weather conditions described in social me- posts about weather and oceanic phenomena,
datafication (4) of our natural and human dia posts as opposed to data provided by trained and sensors on trucks reporting weather and
systems. Big data can also be thought of as weather spotters. road conditions. New analytical methods, such
data that are (i) deep, in that there are sim- For this Review, we define Big Earth Data as artificial intelligence (AI), neural networks,
ply a large number of measurements; (ii) wide, as massive (relative to commonly available data- and machine learning, make use of these big
because to understand a problem, data from a sets) amounts of diverse, complex, and con- datasets to study everything from predicting
wide variety of sources or sensors are needed, tinuously accumulating data generated from seafloor carbon (10) to seismology (11).
and data may be collected over large spatial or heterogeneous sources that require advanced The need for big data was understood as early
(relative to commonly available computing re- as 2016 by Schnase et al. (12), who argued that
1
NOAA/US Integrated Ocean Observing System (IOOS), sources) and potentially novel computational the big questions and projects in climate sci-
Silver Spring, MD 20910, USA. 2NASA Jet Propulsion and analytical tools to extract meaningful in- ence, such as the Intergovernmental Panel on
Laboratory, California Institute of Technology, Pasadena, CA sights and knowledge about the Earth sys- Climate Change (IPCC), are tackling problems
91109, USA. 3Esri, Redlands, CA 92373, USA.
*Corresponding author. Email: [email protected] tem. Earth science is a vast and complex field that require big data for solutions. More recent-
†These authors contributed equally to this work. that studies Earth’s physical, chemical, and ly, big data have been used to support research
Surface hydrology
Surface waters in the form of rivers, lakes, res-
Fig. 1. Increases in the spatial granularity of climate models and the number of Earth- and space- ervoirs, streams, and wetlands are critical to
based sensors. The right axis shows the increase (green) in the number of active satellites beginning with human survival, yet are increasingly scarce as
the launch of Oscar 7 in 1974. The left axis shows the rapid increase (blue) in the number of ocean a result of water quantity and water quality
observations (casts) archived in the NOAA/IODE World Ocean Database (WOD). WOD data before 1974 are issues (23). Surface water presents specific
not shown. The inset depicts the rapid increase in the granularity of a representative climate model, CMIP. data challenges because of the amount of wa-
ter on Earth and its ephemeral nature. Flow-
ing surface waters can range in width from
toward the UN Sustainable Development Goals similar listing for their datasets at https://rda. a few meters to many kilometers, and their
(SDGs), such as climate action (SDG 13) and ucar.edu/resources/metrics/, which also in- extent can change across timescales ranging
life below water (SDG 14) (13). Authors have cludes more than 1000 citations for their data- from minutes to decades. Big data have enabled
provided a comprehensive review of the appli- sets. NOAA’s Open Data Dissemination Program more accurate and complete delineations of
cations of big data in geophysics (14), biology has assembled a list of papers citing their data, surface water, a more detailed understand-
(15), and for the use of big data and AI (16). The which shows 56 publications using their 14 ing of surface water dynamics, and informed
European Marine Board’s Future Science Brief most popular datasets. The Ocean Observato- models that more accurately quantify terres-
on Big Data in Marine Science (17) identifies ries Initiative lists 333 papers using data from trial water budgets. A necessary first step in
continued sensor development; infrastructure their cabled arrays (https://ooipublications. understanding surface waters is delineating
for data collection, processing, and archiving; whoi.edu/biblio). Although the NASA archives their spatial extent and generating a digital
near–real-time data transmission; and long- do not list citations, the Physical Oceanogra- hydrographic map (24). MERIT Hydro (http://
term funding as core recommendations. phy Distributed Active Archive (po.daac) does hydro.iis.u-tokyo.ac.jp/~yamadai/MERIT_Hydro/)
Determining who is actually using big data provide data on the volume of downloads. In and HydroSHEDs (https://www.hydrosheds.
can be a challenge. Ideally, papers using big 2022, 11 terabytes per day were downloaded, org/) are commonly used digital hydrographic
data would formally cite data DOIs, both to and the archive grew by 2 terabytes per day products derived from 90- and 30-m digital
enable tracking of data usage and as a way (https://www.earthdata.nasa.gov/eosdis/system- elevation models (DEMs), respectively. Both
to associate datasets and the researchers re- performance-and-metrics/eosdis-annual-metrics- products suffer from unavoidable data gaps
sponsible for curating and publishing them reports). and inconsistencies in the underlying data
with their use and citation in traditional peer- Tools for discovering, analyzing, and visual- sources.
reviewed publications (18). Another way to ex- izing big data are rapidly developing, with The HydroSHED product is being re-
tract trends is to look at the use of data from both commercial and open-source solutions engineered on the basis of improved DEMs
large repositories and project archives. The available. Platforms like Google’s GEE (Google generated from data from the TanDEM-X mis-
creation of world data centers amid the In- Earth Engine), a cloud-based geospatial anal- sion. The TanDEM-X dataset, which has a
ternational Geophysical Year gave a real push ysis platform; Microsoft’s Planetary Computer, resolution of 12 m at the equator, serves as
to creating big datasets (19). The TOGA/TAO/ a dedicated computing platform for Earth sci- the foundation for HydroSHEDS v2. This en-
TRITON buoy array maintains a list of more ence data analysis; and Esri’s Living Atlas of hanced global DEM incorporates advanced
than 1000 publications citing their data since the World, a comprehensive collection of geo- preprocessing techniques to preserve the high-
the array’s inception in 1986 (https://www. spatial datasets and applications, reduce the resolution details of the DEM. These techniques
pmel.noaa.gov/gtmba/tao-journal-publications). challenges faced by Earth scientists transition- include “an infill of invalid and unreliable ele-
NCAR’s Research Data Archive maintains a ing to data science roles (20, 21). Open-source vation values, an automatic coastline delineation
refined with manual corrections, an AI-based mous vehicles, and the Ocean Observatories surements of the heat inputs to hurricanes,
water detection algorithm, and a modifica- Initiative (OOI), gather and transmit large and improved modeling of oil spills and the
tion of elevation data in urban and vegetated amounts of data, often in real time, to provide dispersion of larval fish. These models have
areas for improved evaluation of the flow of a 3D view of previously data-poor regions (Fig. contributed to improved forecasting of hur-
water.” Additionally, the hydrologically pre- 2). These data have supported discoveries in ricanes, the siting of marine protected areas,
conditioned DEM and the water body mask everything from cross-shelf transport of water oil spill response, and improved understanding
derived from the TanDEM-X dataset undergo (27) to sensing whale calls (28) and ship traffic of marine ecosystems. As model outputs become
further processing with “refined hydrologi- (29) to studying heat flux from hydrothermal ever larger, there are questions about exactly
cal optimization and correction algorithms vents (20) and seismology (11). what outputs should be archived and what can
to derive flow direction and flow accumulation Models of the ocean (30) have become more better be recreated for specific needs (32).
maps” (25). The promise is a globally consistent, detailed, can cover larger areas, and increas- In biology, researchers have been able to
high-resolution digital hydrographic map for ingly include assimilation of observed data better understand the growth and extent of
the globe. (31). Models are both the product of big data– harmful algal blooms (HABs) using large data-
Surface waters are dynamic and generate supported boundary conditions, including de- sets of images of plankton. In a modern ver-
costly and deadly floods. Combining a spa- tailed bathymetric grids and forcing data from sion of the Continuous Plankton Recorder (33),
tially granular (10-m resolution) map of the river flows and wind patterns, and the producer the Imaging Flow Cytobot (IFCB) collects large
height above the nearest discharge (HAND) of big data outputs. Improvements in mod- numbers of images of plankton. The images
A three-dimensional ocean
Understanding the ocean as a three-dimensional
(3D) system has long been a goal of research-
ers. Although satellites can provide a synoptic
view, they do not see into the ocean depths.
Models can represent the full 3D ocean, but
they rely on observations both as inputs and
to verify results. Big data can be the result of
long-term campaigns, such the California Co-
operative Oceanic Fisheries Investigations
(CalCOFI), which has been observing the Cal-
ifornia current ecosystem since 1949 (https://
calcofi.org/). The desire to better understand
El Niño and its effects on climate led to the
deployment of the TOGA/TAO arrays of buoys
in the tropical Pacific starting in 1985 (26).
This is an early example of gathering detailed
and extensive observations resulting in big
data. Although the data are carefully formatted
and quality controlled, they are wide in time
and space and deep in that they measure a
large number of variables. TAO data have been
used to study many phenomena in the trop-
ical Pacific, for example to better understand
El Niño and its effects, to look at thermal
structures in the ocean, to look at variations
in wind and rain in the tropical Pacific, and Fig. 2. Increase in the volume of ocean temperature measurements. (A) The number of temperature
to conduct basin-wide studies of sea surface observations at all depths collected in 1995. An emphasis on near-coast data collection and oversampling
temperature. along ship routes is apparent. (B) The number of temperature observations collected in 2022, demonstrating
More recently, new technologies, such as the rapid expansion of ocean observing systems, such as the Argo float network. Data source: NOAA/IODE
Argo floats, surface and underwater autono- World Ocean Database (WOD).
health of plankton populations. Environmen- (ii) long-distance domestic moves are more and modeled analysis-ready data (https://www.
tal DNA (eDNA) measurements can rapidly prevalent compared with local or international noaa.gov/information-technology/open-data-
gather large amounts of data on the species moves, (iii) slow-onset changes like droughts dissemination, https://www.earthdata.nasa.
present in a location, the presence of unde- drive increased migration more than rapid- gov/eosdis/cloud-evolution) These massive
tected species, genetic relationships between onset changes like floods, and (iv) the severity data collections proximate to massive computer
populations, and the general biodiversity of of climate shocks affects migration in a non- resources hold great promise for enhancing
an area or region (35). linear manner, influenced by the dominance the sharing, reproducibility, replicability, and
of either capability or vulnerability channels. collaborative nature of research. These con-
Atmospheric modeling: Global climate models The growing frequency and intensity of severe cepts are foundational to operationalizing the
Global climate models (GCMs) are complex weather events can have significant economic larger goal of open science—“the principle and
mathematical representations of Earth’s cli- impacts on agricultural activities and can threaten practice of making research products and pro-
mate system that incorporate the physical, local and global food security (34). Using cli- cesses available to all, while respecting diverse
chemical, and biological processes governing mate data from the ERA5 (https://www.ecmwf. cultures, maintaining security and privacy, and
the atmosphere, oceans, land surfaces, and ice int/en/research/climate-reanalysis) atmospheric fostering collaborations, reproducibility and
(36). Since their inception in the early 1960s, reanalysis model (2000 through August 2018) equity” (49).
GCMs have evolved significantly in terms of and 82,000 crop yield reports, the authors could
complexity, resolution, and the inclusion of explain 65% of historical yield anomalies using Earth information products
DTs of the Earth system are intended to es- volve possible mitigation options as well as actions and causes and effects. It could involve
tablish highly accurate digital representations acquiring additional data and analysis. various remote sensing and in situ measure-
of the Earth system so that they can improve DTs present multidisciplinary, multivariate ments. Depending on the type of analysis (i.e.,
our understanding of the impacts of climate data challenges—that is, the five Vs of big data. global or regional), it would also require dif-
change and extreme weather events and poten- The expectation for a DT of Earth is to mirror ferent resolutions of data. Working with large
tially help us better assess potential socioeco- the Earth science system to not only under- collections of data requires new methods in
nomic and health impacts. The DT concept stand the current condition of our environment managing data to promote parallel computing
has proven to be effective in various commer- or climate but also to automatically analyze of the data.
cial sectors (58, 59). Figure 3 provides a high- changes in our environment and autonomous- 2) Assimilation and numerical models. The
level representation of a DT for the Earth ly acquire new data to improve its prediction GCM section presented the complexity of de-
system as an integrated information system, and forecast (60). For a DT of the Earth system veloping and running large-scale numerical
where each oval contributes to the overall to be useful, it must accurately represent the model simulations. DTs require ongoing up-
DT goals. The NASA Advanced Information interactions or forcing between the subsys- dates of the model runs with the latest state
Systems Technology (AIST) program’s web- tems. The accuracy of a DT is highly reliant on of the Earth system to generate the most ac-
site (https://esto.nasa.gov/aist) summarizes the quality of the data and the analysis that it curate forecast.
the primary goals of DTs of Earth: (i) to pro- incorporates. AI plays a vital role in the overall 3) Advanced AI. DTs require capabilities to
vide a continuous and accurate representation DT architecture. Our rapidly growing collec- identify the relevant data, analysis, and mod-
Fig. 3. High-level representation of the elements of a DT for the Earth system. The cyclic framework Conclusions
signifies the continuous flow of information within the DTs that bridge the physical and the digital representations. This Review focuses on three subdisciplines of
The arrows illustrate the general information flow between the elements: (1) The harmonized, analysis-optimized Earth science—hydrography, oceanography, and
data management solution for fast access and analysis. (2a) AI plays an essential role in DTs, including climate science—to illustrate the profound and
identifying the relevant data, analysis, and numerical models as well as resource management. AI formalizes ongoing impacts of big data on the broader
the process to learn from the past to improve the accuracy of future predictions. AI-based models require discipline. Big data have facilitated advance-
ongoing training and continuous validation. (2b) Advanced, physics-based models are essential to forecast ments in understanding surface water, such as
the environment’s reactions. Like AI-based models, numerical models require continuous assimilation and more accurate delineations of riparian networks,
validation. (3) One of the promises of DTs of Earth is to deliver actionable predictions. Actions could include improved flood predictions, and comprehen-
mitigation recommendations, new observations, and analysis to improve our understanding. (4) New sive water supply and demand assessments. In
observations could include retasking Earth-observing instruments, deploying unmanned vehicles, acquiring oceanography, big data have enabled research-
data from in situ sensors, and on-demand value-added product generations, etc. (5) The acquired or ers to achieve a deeper understanding of the
processed data are made available to the DT to be incorporated for improving analysis and predictions. 3D nature of the ocean, leading to discoveries
in areas ranging from cross-shelf water trans- Sciences. Izv. Russ. Acad. Sci., Phys. Solid Earth 58, 1–29 37. V. Eyring et al., Overview of the Coupled Model
port to whale behavior. The improved spatial (2022). doi: 10.1134/S1069351322010037 Intercomparison Project Phase 6 (CMIP6) experimental design
15. E. Aronova, K. S. Baker, N. Oreskes, Big science and big data in and organization. Geosci. Model Dev. 9, 1937–1958 (2016).
and temporal granularity of GCMs has re- biology: From the international geophysical year through doi: 10.5194/gmd-9-1937-2016
sulted in insights into climate processes and the international biological program to the long term ecological 38. R. Séférian et al., Tracking Improvement in Simulated
interactions. The new insights enabled through research (LTER) Network, 1957–Present. Hist. Stud. Nat. Sci. Marine Biogeochemistry Between CMIP5 and CMIP6.
40, 183–224 (2010). doi: 10.1525/hsns.2010.40.2.183 Curr. Clim. Change Rep. 6, 95–119 (2020). doi: 10.1007/
big data come at a cost. The size of big data 16. E. Verdu, Y. V. Nieto, N. Saleem, Big data and artificial s40641-020-00160-0; pmid: 32837849
presents challenges to scientific replicability intelligence in earth science: Recent progress and future 39. M. D. Priestley et al., An overview of the extratropical storm
and reproducibility and to data sharing and advancements. Acta Geophys. 71, 1373–1375 (2023). tracks in CMIP6 historical simulations. J. Clim. 33, 6315–6343
doi: 10.1007/s11600-023-01051-2 (2020). doi: 10.1175/JCLI-D-19-0928.1
management. Ensuring that these datasets 17. L. Guidi et al., Big Data in Marine Science, Zenodo (2020); 40. D. Carvalho, A. Rocha, X. Costoya, M. deCastro,
meet the requirements of the FAIR/CARE (65) https://doi.org/10.5281/zenodo.3755793. M. Gómez-Gesteira, Wind energy resource over Europe
principles and bridge the gap between data and 18. Data Citation Synthesis Group, Joint Declaration of Data under CMIP6 future climate projections: What changes from
Citation Principles, M. Martone, Ed. (FORCE11, 2014); CMIP5 to CMIP6. Renew. Sustain. Energy Rev. 151, 111594
information is a continuing effort. Traditional
https://doi.org/10.25490/a97f-egyk. (2021). doi: 10.1016/j.rser.2021.111594
incentives for researchers to publish papers 19. F. L. Korsmo, The Origins and Principles of the World Data 41. D. J. Kaczan, J. Orgill-Meyer, The impact of climate change on
need to be supplemented by investing time in Center System. Data Sci. J. 8, IGY55–IGY65 (2010). migration: A synthesis of recent empirical insights. Clim. Change
properly curating datasets. These challenges doi: 10.2481/dsj.SS_IGY-011 158, 281–300 (2020). doi: 10.1007/s10584-019-02560-0
20. L. M. Bouwer et al., Eds., Integrating Data Science and Earth 42. D. Beillouin, B. Schauberger, A. Bastos, P. Ciais, D. Makowski,
are outweighed by the promise that big data Science, Springer Briefs in Earth System Sciences (Springer, Impact of extreme weather conditions on European crop
bring for a deeper, more comprehensive un- 2022). doi: 10.1007/978-3-030-99546-1_1 production in 2018. Phil. Trans. R. Soc. B 375, 20190510
58. L. James, Digital twins will revolutionise healthcare: Digital twin 62. T. Huang et al., “Applications of Open-Source Digital Twins AC KNOWLED GME NTS
technology has the potential to transform healthcare in a Framework for Wildfire and Air Quality,” paper WE2.R9.4, We thank M. Wengren, R. Doel, N. Merati, and three anonymous
variety of ways – improving the diagnosis and treatment of International Geoscience and Remote Sensing Symposium reviewers for helpful comments and suggestions. We thank B. Voss
patients, streamlining preventative care and facilitating new (IGARSS), Pasadena, CA, 16 to 21 July 2023. of the NOAA Central Library and K. Szura of NOAA’s Open Data
approaches for hospital planning. Eng. Technol. 16, 50–53 63. R. Rodriguez-Suquet et al., “The SCO-FloodDAM Project Dissemination Program for bibliographic research and for locating
(2021). doi: 10.1049/et.2021.0210 Toward a Digital Twin for Flood Detection, Prediction and papers citing or using big data. Author contributions: All authors
59. M. Farsi, A. Daneshkhah, A. Hosseinian-Far, H. Jahankhani, Eds., Flood Risk Assessments,” paper WE2.R9.3, International contributed equally. Competing interests: The authors declare
Digital Twin Technologies and Smart Cities (Springer, 2020). Geoscience and Remote Sensing Symposium (IGARSS), that they have no competing interests. License information:
60. A. Fuller, Z. Fan, C. Day, C. Barlow, Digital Twin: Enabling Pasadena, CA, 16 to 21 July 2023. Copyright © 2024 the authors, some rights reserved; exclusive
Technologies, Challenges and Open Research. IEEE Access 8, 64. P. Bauer, B. Stevens, W. Hazeleger, A Digital Twin of Earth for licensee American Association for the Advancement of Science. No
108952–108971 (2020). doi: 10.1109/ACCESS.2020.2998358 the Green Transition. Nat. Clim. Chang. 11, 80–83 (2021). claim to original US government works. https://www.science.org/
61. T. Huang et al., “Big Data Smart: Federated Earth System doi: 10.1038/s41558-021-00986-y about/science-licenses-journal-article-reuse
Digital Twins,” paper TH1.R10.2, International Geoscience 65. Findable, Accessible, Interoperable and Reusable (FAIR) and
and Remote Sensing Symposium (IGARSS), Pasadena, CA, Collective Benefit, Authority to Control, Responsibility and Submitted 20 June 2023; accepted 23 January 2024
16 to 21 July 2023. Ethics (CARE). 10.1126/science.adh9607