Computational Intelligent Data Analysis For Sustainable Development PDF
Computational Intelligent Data Analysis For Sustainable Development PDF
Computational Intelligent
Data Analysis for
Sustainable Development
Edited by
Ting Yu, Nitesh V. Chawla, and Simeon Simoff
Computational Intelligent
Data Analysis for
Sustainable Development
Chapman & Hall/CRC
Data Mining and Knowledge Discovery Series
SERIES EDITOR
Vipin Kumar
University of Minnesota
Department of Computer Science and Engineering
Minneapolis, Minnesota, U.S.A.
PUBLISHED TITLES
ADVANCES IN MACHINE LEARNING AND DATA MINING FOR ASTRONOMY
Michael J. Way, Jeffrey D. Scargle, Kamal M. Ali, and Ashok N. Srivastava
BIOLOGICAL DATA MINING
Jake Y. Chen and Stefano Lonardi
COMPUTATIONAL INTELLIGENT DATA ANALYSIS FOR SUSTAINABLE DEVELOPMENT
Ting Yu, Nitesh V. Chawla, and Simeon Simoff
COMPUTATIONAL METHODS OF FEATURE SELECTION
Huan Liu and Hiroshi Motoda
CONSTRAINED CLUSTERING: ADVANCES IN ALGORITHMS, THEORY, AND APPLICATIONS
Sugato Basu, Ian Davidson, and Kiri L. Wagstaff
CONTRAST DATA MINING: CONCEPTS, ALGORITHMS, AND APPLICATIONS
Guozhu Dong and James Bailey
DATA CLUSTERING IN C++: AN OBJECT-ORIENTED APPROACH
Guojun Gan
DATA MINING FOR DESIGN AND MARKETING
Yukio Ohsawa and Katsutoshi Yada
DATA MINING WITH R: LEARNING WITH CASE STUDIES
Luís Torgo
FOUNDATIONS OF PREDICTIVE ANALYTICS
James Wu and Stephen Coggeshall
GEOGRAPHIC DATA MINING AND KNOWLEDGE DISCOVERY, SECOND EDITION
Harvey J. Miller and Jiawei Han
HANDBOOK OF EDUCATIONAL DATA MINING
Cristóbal Romero, Sebastian Ventura, Mykola Pechenizkiy, and Ryan S.J.d. Baker
INFORMATION DISCOVERY ON ELECTRONIC HEALTH RECORDS
Vagelis Hristidis
INTELLIGENT TECHNOLOGIES FOR WEB APPLICATIONS
Priti Srinivas Sajja and Rajendra Akerkar
INTRODUCTION TO PRIVACY-PRESERVING DATA PUBLISHING:
CONCEPTS AND TECHNIQUES
Benjamin C. M. Fung, Ke Wang, Ada Wai-Chee Fu, and Philip S. Yu
KNOWLEDGE DISCOVERY FOR COUNTERTERRORISM AND LAW ENFORCEMENT
David Skillicorn
KNOWLEDGE DISCOVERY FROM DATA STREAMS
João Gama
MACHINE LEARNING AND KNOWLEDGE DISCOVERY FOR
ENGINEERING SYSTEMS HEALTH MANAGEMENT
Ashok N. Srivastava and Jiawei Han
MINING SOFTWARE SPECIFICATIONS: METHODOLOGIES AND APPLICATIONS
David Lo, Siau-Cheng Khoo, Jiawei Han, and Chao Liu
MULTIMEDIA DATA MINING: A SYSTEMATIC INTRODUCTION TO CONCEPTS AND THEORY
Zhongfei Zhang and Ruofei Zhang
MUSIC DATA MINING
Tao Li, Mitsunori Ogihara, and George Tzanetakis
NEXT GENERATION OF DATA MINING
Hillol Kargupta, Jiawei Han, Philip S. Yu, Rajeev Motwani, and Vipin Kumar
RELATIONAL DATA CLUSTERING: MODELS, ALGORITHMS, AND APPLICATIONS
Bo Long, Zhongfei Zhang, and Philip S. Yu
SERVICE-ORIENTED DISTRIBUTED KNOWLEDGE DISCOVERY
Domenico Talia and Paolo Trunfio
SPECTRAL FEATURE SELECTION FOR DATA MINING
Zheng Alan Zhao and Huan Liu
STATISTICAL DATA MINING USING SAS APPLICATIONS, SECOND EDITION
George Fernandez
SUPPORT VECTOR MACHINES: OPTIMIZATION BASED THEORY, ALGORITHMS,
AND EXTENSIONS
Naiyang Deng, Yingjie Tian, and Chunhua Zhang
TEMPORAL DATA MINING
Theophano Mitsa
TEXT MINING: CLASSIFICATION, CLUSTERING, AND APPLICATIONS
Ashok N. Srivastava and Mehran Sahami
THE TOP TEN ALGORITHMS IN DATA MINING
Xindong Wu and Vipin Kumar
UNDERSTANDING COMPLEX DATASETS:
DATA MINING WITH MATRIX DECOMPOSITIONS
David Skillicorn
Computational Intelligent
Data Analysis for
Sustainable Development
Edited by
Ting Yu, Nitesh V. Chawla, and Simeon Simoff
MATLAB® is a trademark of The MathWorks, Inc. and is used with permission. The MathWorks does not
warrant the accuracy of the text or exercises in this book. This book’s use or discussion of MATLAB® soft-
ware or related products does not constitute endorsement or sponsorship by The MathWorks of a particular
pedagogical approach or particular use of the MATLAB® software.
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts
have been made to publish reliable data and information, but the author and publisher cannot assume
responsibility for the validity of all materials or the consequences of their use. The authors and publishers
have attempted to trace the copyright holders of all material reproduced in this publication and apologize to
copyright holders if permission to publish in this form has not been obtained. If any copyright material has
not been acknowledged please write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmit-
ted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented,
including photocopying, microfilming, and recording, or in any information storage or retrieval system,
without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.copyright.
com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood
Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and
registration for a variety of users. For organizations that have been granted a photocopy license by the CCC,
a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used
only for identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
Acknowledgments, xi
About the Editors, xiii
List of Contributors, xv
vii
viii ◾ Contents
INDEX 383
Acknowledgments
xi
xii ◾ Acknowledgments
xiii
xiv ◾ About the Editors
xv
xvi ◾ List of Contributors
Computational Intelligent
Data Analysis for
Sustainable Development
An Introduction and Overview
CONTENTS
1.1 Introduction to Sustainable Development 2
1.2 Introduction to Computational Intelligent Data Analysis 5
1.2.1 Process of Computational Intelligent Data Analysis 9
1.3 Computational Intelligent Data Analysis for Sustainable
Development 10
1.3.1 Spatio-Temporal Data Analysis 12
1.3.2 Integrated Sustainability Analysis 14
1.3.3 Computational Intelligent Data Analysis for Climate
Change 16
1.3.4 Computational Intelligent Data Analysis for Biodiversity
and Species Conservation 17
1.3.5 Computational Intelligent Data Analysis for Smart Grid
and Renewable Energy 19
1.3.6 Computational Intelligent Data Analysis for
Sociopolitical Sustainability 20
1.4 Conclusion and Research Challenges 21
References 22
1
2 ◾ Computational Intelligent Data Analysis for Sustainable Development
Economic Sustainability:
Profit, Cost Savings, Economic Growth,
Research & Development
Sociopolitical Sustainability:
Standard of Living, Education, Community,
Equal Opportunity
Environmental Sustainability:
Natural Resource Use, Environmental Management,
Pollution Prevention
1.2 INTRODUCTION TO COMPUTATIONAL
INTELLIGENT DATA ANALYSIS
Over the past 50 years, we have witnessed the creation of high-capacity dig-
ital data storage and powerful CPUs (central processing units) to store and
process millions and millions of bytes of data. Pens and paper have been
replaced by computers; days of mindless calculation have been replaced
by a command to the machine, which then effortlessly, accurately, effec-
tively, and instantaneously carries out the calculation. The popularization
of computers has enabled modern data analysts to type in a few symbols to
complete tasks that have previously taken days to perform.
A possible definition of data analysis is the process of computing various
summaries and derived values from the given collection of data [8]. No one
sets out simply to analyze data. One always has some objective in mind:
one wants to answer certain questions. These questions might be high-level
general questions, perhaps exploratory; or the questions might be more
specifically confirmatory. Orthogonal to the exploratory/ confirmatory
distinction, we can also distinguish between descriptive and inferential
analysis. A descriptive (or summarizing) analysis is aimed at making a
statement about the dataset at hand. In contrast, an inferential analysis
is aimed at trying to draw conclusions that have more general validity.
6 ◾ Computational Intelligent Data Analysis for Sustainable Development
Often, inferential studies are based on samples from a population, the aim
of which is to try to make some general statements about some broader
aspect of that population, most (or some) of which has not been observed.
In practice, data analysis is an iterative process (Figure 1.2). After a goal
(or a question) is defined to indicate the success of the data analysis, the
relevant data is collected. One studies the data, examining it using some
analytic techniques. One might decide to look at it another way, perhaps
modifying it in the process by transformation or partitioning, and then go
back to the beginning and apply another data analytics tool. This can
go round and round many times.
Computational intelligent data analysis does computerize the itera-
tive data analysis process by removing tedious and mindless calculations.
Computational intelligent data analysis goes beyond this scope. It aims
to design algorithms to solve increasingly complex data analysis prob-
lems in changing environments. It is the study of adaptive mechanisms
to enable or facilitate the data analysis process in complex and changing
environments. These mechanisms include paradigms that exhibit an abil-
ity to learn or adapt to new situations, to generalize, abstract, discover, and
associate [9]. In simple terms, it is the study of how to make computers do
things usually associated with human excellence [10]. To a great extent,
the ambition of totally autonomous data mining has now been abandoned
[11]. Computational intelligent data analysis does not completely replace
a human being.
Computational Intelligent Data Analysis for Sustainable Development ◾ 7
New computing and data storage technologies not only enhance tradi-
tional data analysis tools, but also change the landscape of data analysis
by raising new challenges. Because computer techniques are able to store
and process the latest data and information with milliseconds delay, com-
putational intelligent data analysis addresses a new problem: how do we
efficiently and correctly include the latest data, update models, adapt to
the new circumstance, and eventually provide sufficient evidence to make
timely and correct decisions?
Another motivation for computational intelligent data analysis has been
the inability of conventional analytical tools to handle, within reasonable
time limits, the quantities of data that are now being stored. Computational
intelligent data analysis is thus being seen as a useful method of providing
some measure of automated insight into the data being collected. However,
it has become apparent that while some useful patterns can be discovered
and the discipline has had a number of notable successes, the potential
for either logical or statistical error is extremely high. As a result, much
of the computational intelligent data analysis is, at best, a set of suggested
topics for further investigation [12]. The high unreliability of the results is
a major concern to many applications—for example, financial investment.
In these fields, the decision must be made prudently and transparently to
avoid any catastrophe. Trial and error is unacceptable.
Computational intelligent data analysis has its origins in statistics and
machine learning. As a study, it is not a haphazard application of statisti-
cal and machine learning tools, and not a random walk through the space
of analytics technique, but rather a carefully planned and considered pro-
cess of deciding what will be most useful and revealing. The process of
data analysis can be considered as having two distinct forms: (1) the ana-
lytical (or modeling) approach in which the real world is modeled in a
mathematical manner, from which predictions can be computed in some
way, and (2) pattern matching, or the inductive approach in which predic-
tion is made based on experience [12].
Whereas statistics induction starts with the latter and aims to translate
the process into one that is predominately the former, machine learning
largely takes place in the latter. Given a set of observations, machine-
learning algorithms form the null hypothesis space and search this large
hypothesis space to find the optimal hypothesis. This is as close as one can
get to achieving the underlying true hypothesis, which may or may not be
in the hypothesis space. The fallacy of induction comes into play when the
hypothesis developed from the observations resides in a different part of
8 ◾ Computational Intelligent Data Analysis for Sustainable Development
the space from the true solution and yet it is not contradicted by the avail-
able data. For example, we could build a regression model to relate one
variable to several potential explanatory variables, and perhaps obtain a
very accurate predictive model without having any claim or belief that the
model in any way represented the causal mechanism.
In most cases of machine-learning algorithms, the null hypothesis
space is infinity. Therefore, machine learning is essentially computation-
ally efficient enough to search the space to find the part of space not only to
best fit the observations, but also to make correct predictions for the new
observations. The latter is critical in addressing the problem of achieving
the balance of bias and variance. In terms of mathematical optimiza-
tion, it is important to find the global optimal instead of the local opti-
mal. However, the design of the search criteria is more important than the
optimization process. The criteria directly determine which final optimal
solution will achieve the balance of bias and variance to avoid overfitting.
The application of machine-learning methods to large databases is called
data mining or knowledge discovery [13].
In the computational intelligent data analysis community, the ana-
lyst works via more complex and sophisticated, even semiautomatic,
data analysis tools. Given a clearly defined criterion (e.g., sum of squared
errors), one can let the computer conduct a much larger search than could
have been conducted manually. The program has become a key part of the
analysis and has moved the analyst’s capabilities into realms that would be
impossible unaided. However, one challenge is to find the clearly defined
criteria—sometimes not one but a set—to represent the aim of the analy-
sis. The perspective from which the analyst instructs a program to go and
do the work is essentially a machine-learning perspective.
Machine-learning algorithms are critical to a range of technologies,
including Web search, recommendation systems, personalized Internet
advertising, computer vision, and natural language processing. Machine
learning has also made significant impacts on the natural sciences, for
example, biology; the interdisciplinary field of bioinformatics has facili-
tated many discoveries in genomics and proteomics [14].
In contrast, modern statistics is almost entirely driven by the notions
of hypotheses and models. Prior knowledge is often required to specify a
set of null hypotheses and alternative hypotheses and the structure of
a model. The data then is used for refutation of a hypothesis, improve
the model to better reflect the target problem, and also estimate the coef-
ficients via the calibration process. The nature of machine learning is that
Computational Intelligent Data Analysis for Sustainable Development ◾ 9
illustration of the effort that is necessary to ensure good and accurate data
so that effective answers can be obtained in data analysis. However, no
matter what analysis tool is being used, the principle is the same: garbage
in, garbage out. In many cases, a large dataset is one that has many cases
or records. Sometimes, however, the word “large” can refer to the num-
ber of variables describing each record. The latter introduces the curse of
dimensionality [16]. Artificial neural network and kernel methods map
input data into a high-dimensional feature space to convert the nonlinear
problem to a linear approximation. The price of doing such mapping is
the curse of dimensionality. Dimension reduction is a set of techniques to
combine or transform the input data to eliminate less important features.
In scientific communities, the word “experiment” describes an investi-
gation in which (some of) the potentially influential variables can be con-
trolled. Typically, it is not possible to control all the potentially influential
variables. To overcome this, subjects (or objects) are randomly assigned to
the classes defined by those variables that one wishes to control.
As a summary, there are two main players in the computational intel-
ligent data analysis: (1) computational intelligent data analysis techniques
and (2) human analysts. The technique must be efficient enough to pro-
cess a large amount of data, accurate enough to find the best solution, and
reliable enough to resist noise. Human analysts need to understand the
problems, define the criteria, and choose the right techniques and analyze
the outcomes.
continuous, rapid data records. Also, due to the data collection process,
multifrequency time series often coexist, such as daily, weekly, monthly,
and yearly time series. It is relatively easy to aggregate the high-frequency
time series from the lower- frequency ones, such as daily data from
monthly data. However, vice versa, lower to higher, is extremely difficult.
Finally, an extreme event occurs much more often than what normal
distribution indicates. We have already witnessed the impact of extreme
cases (e.g., the “Global Financial Crisis” and natural disasters)—profound
enough to completely derail so-called normal life. The “black swan theory”
indicates that the frequency and impact of totally unexpected events is
generally underestimated. Unfortunately, the disproportionate role of
high-impact, hard-to-predict, and rare events are beyond the realm of nor-
mal expectations in history, science, finance, and technology. The proba-
bility of consequential rare events is small [23]. A significant proportion of
sustainable analysis aims to predict, prevent, or recover from these kinds
of extreme events, such as natural disease or social unrest.
process over its entire life cycle. Environmental LCA is often thought of
as “cradle-to-grave” and therefore is the most complete accounting of the
environmental costs and benefits of a product or service [25]. For exam-
ple, the CO2 footprint of a paper shopping bag might be the sum of CO2
emitted by logging vehicles, the paper factory, the transport from the fac-
tory to the shop, and the decomposition of the bag after it has been used.
Among the various LCA methods, the Economic Input-Output Life-
Cycle Assessment (EIO-LCA) method uses information about industry
transactions—purchases of materials by one industry from other industry,
and the information about direct environmental emissions of industries—
to estimate the total emissions throughout the supply chain [25]. In the
EIO-LCA method, the input-output table acts as the key engine. The
input-output table simply uses a matrix representing the intra-industry
flows and the flow between industrial sections and consumption, or the
flow between the value-added section and the industrial section. As the
economy constantly evolves, the input-output table must be updated at
least annually to reflect the new circumstance. A typical input-output
table for the Australian economy is represented in the format of seven
2800-by-2800 matrices.
More than 100 countries worldwide regularly publish input-output
tables according to guidelines governed by the UN Department of
Economic and Social Affairs Statistics Division [26]. Unfortunately, in
most countries, including Australia, the input-output table is released
every three to four years, due to the large amounts of monetary and human
costs involved. The Integrated Sustainability Analysis (ISA) group at the
University of Sydney has developed large-scale computational modeling
methods comprehensively covering the process of estimating and updat-
ing the input-output tables for different levels of economy, and following
reporting phases based on the estimated input-output tables.
Globalization, combined with an increasing human population and
increased consumption, means that the ecological footprint is now fall-
ing more widely and more heavily across the planet. The environmental
impact is increasing, and lengthening supply chains mean that consumers
are often far removed from the impacts they drive. To manage and reduce
our footprint, we must first be able to measure it. The ISA group calculated
the net trade balances of 187 countries in terms of implicated commodi-
ties to construct a high-resolution global trade input-output table. Using a
high-resolution global trade input-output table, they traced the implicated
commodities from the country of their production, often through several
16 ◾ Computational Intelligent Data Analysis for Sustainable Development
Nathan Eagle uses mobile phone data to gain an in-depth look into and
understanding of the slum dweller population toward quantitative results
measuring slum dynamics [38]. Because slums are informally established,
unplanned, and unrecognized by the government, scientists have a very
limited understanding of the 200,000 slums worldwide and the billion
individuals living in them. Chris Barrett of the ICS has studied the socio-
economic interrelationship between poverty, food security, and environ-
mental stress in Africa, particularly the links between resource dynamics
and the poverty trap in smallholder agrarian systems [39].
Chapter 11 gives an excellent example of how temporal and spatial
data analysis tools are used to get insight into behavioral data and address
important social problems such as criminal offense. Over the past several
years, significant amounts of data have been collected from all aspects of
human society. As Chapter 11 reveals, there were nearly 1 million theft-
related crimes from year 1991 through 1999, and data on 200,000 reported
crimes within the City of Philadelphia were collected. This amount of data
was unimaginable before the invention of digital data storage systems.
Regional planning is the science of efficient placement of activities and
infrastructures for the sustainable growth of a region [40]. On the regional
plan, the policy maker must take into account the impact on the environ-
ment, the economy, and the society. Chapter 12 presents an application of
Constraint Logic Programming (CLP) to a planning problem, the envi-
ronmental and social impact assessment of the regional energy plan of the
Emilia-Romagna region of Italy.
REFERENCES
1. UNEP, Report of the World Commission on Environment and Development,
General Assembly Resolution 42/187, United Nations, Editor. 1987.
2. UNCED, Agenda 21, The Rio Declaration on Environment and Development,
U.N.D.o.E.a.S.A. The United Nations Division for Sustainable Development
(DSD), Editor. 1992: Rio de Janerio, Brazil.
3. CEC, Clean Energy Australia 2011, C.E. Council, Editor. 2012.
Computational Intelligent Data Analysis for Sustainable Development ◾ 23
25
Chapter 2
Tracing Embodied
CO2 in Trade Using
High-Resolution
Input–Output Tables
Daniel Moran and Arne Geschke
CONTENTS
2.1 Summary 28
2.2 Structure of IO Tables 30
2.3 Populating the MRIO 35
2.3.1 Data Processing Language, Distributed Database Server 35
2.4 Reconciling Conflicting Constraints: Eora as a Solution of a
Constrained Optimization Problem 37
2.4.1 Reliability Information 38
2.4.2 The Concept of Constrained Optimization 39
2.4.3 Mathematical Interpretation of an MRIO 40
2.4.4 Summarizing the Balancing Constraints in a Single
Matrix Equation 41
2.4.5 Formulating the Constrained Optimization Problem 43
2.5 The Leontief Inverse 48
2.6 Applications of the Eora MRIO 49
References 51
27
28 ◾ Computational Intelligent Data Analysis for Sustainable Development
2.1 SUMMARY
Input-output (IO) tables document all the flows between sectors of an
economy. IO tables are useful in sustainability applications for tracing
connections between consumers and upstream environmental or social
harms in secondary and primary industries. IO tables have historically
been difficult to construct because they are so data intensive. The Eora
multi-region IO (MRIO) table is a new high-resolution table that records
the bilateral flows between 15,000 sectors in 187 countries. Such a com-
prehensive, high-resolution model is advantageous for analysis and imple-
mentation of sustainability policy. This chapter provides an overview of
how the Eora IO table was built. A custom data processing language was
developed to read, aggregate, disaggregate, and translate raw data from a
number of government sources into a harmonized tensor. These raw data
often conflict and do not result in a balanced matrix. A custom optimiza-
tion algorithm was created to reconcile conflicting data and balance the
table. Building and balancing the Eora MRIO is computationally inten-
sive: it requires approximately 20 hours of compute time per data year on
a cluster with 66 cores, 600 GB of RAM, and 15Tb of storage. We con-
clude by summarizing some sustainability applications of high-resolution
MRIO tables, such as calculating carbon footprints.
Globalization combined with an increasing human population and
increased consumption means that our ecological footprint is now fall-
ing more widely and more heavily across the planet. Our environmental
impact is increasing. And lengthening supply chains mean that consum-
ers are often far removed from the impacts they drive. In order to manage
and reduce our footprint we must first be able to measure it. This is the
aim of environmentally extended input-output analysis.
We want to identify which countries and sectors are directly caus-
ing, and which are ultimately benefiting from, environmental harms.
We want to link individuals, households, and companies to the upstream
environmental harms they ultimately drive. This principle is called
consumer responsibility: the idea that consumers, not producers, should be
Tracing Embodied CO2 in Trade Using High-Resolution IO Tables ◾ 29
2.2 STRUCTURE OF IO TABLES
This section provides a brief overview of the structure of IO tables for
readers unfamiliar with input-output analysis. To learn more, the authori-
tative text by Miller and Blair (2009) is recommended.
The elements of an IO matrix are the sum of sales from one sector to
another. Each row in the matrix represents the goods and services pro-
duced by one sector, that is, that sector’s output. The columns along that
row are the various sectors that consume those products. Each element in
the matrix thus represents the total value of transactions between sector A
and sector B. In a single-country IO table, the rows and columns represent
sectors within that economy.
Figure 2.1 shows the layout of a single-country IO table and illustrates
the element of the transaction matrix recording the total value of Steel sec-
tor inputs to the Vehicles Production sector. This transaction’s matrix T is
the main block of an IO table. Appended to it on the right is another block
of columns Y representing final consumption by households, government,
inventories, and so on. Also added below is a block of rows V for so-called
“primary inputs.” The most important of these is value added. By treating
value added as an input to production, a sector can sell its output for more
than the sum of its other inputs.
In a multi-region IO table, the sectors represent each country’s sectors.
For example, instead of one Steel sector, Japanese Steel and Australian Steel
are two separate rows and two columns. This is illustrated in Figure 2.2. It
is possible to construct an MRIO where the regions are not countries but
Tracing Embodied CO2 in Trade Using High-Resolution IO Tables ◾ 31
Vehicles
T Y
Steel
FIGURE 2.1 A basic IO table with annotation showing where in the transactions
matrix the total value of outputs from the Steel sector to the Vehicle Production
sector is recorded.
Vehicles Vehicles
Japanese Inputs
Steel
tjj tja
Australian Inputs
Steel
taj taa
FIGURE 2.2 A two-country MRIO. The transactions tjj, tja, taj, and taa record the
value of steel used in domestic vehicle industries and exported to another coun-
try’s vehicle industry.
32 ◾ Computational Intelligent Data Analysis for Sustainable Development
FIGURE 2.3 Diagram of an MRIO with satellite accounts for threatened species.
Tracing Embodied CO2 in Trade Using High-Resolution IO Tables ◾ 33
International Trade
Domestic Activity
FIGURE 2.4 Heatmap of Eora and zoom-in on the Korean IO table. Each pixel
represents one element in the transaction matrix, with darker cells representing
larger transaction values.
Commodity 2
Government
Government
Households
Households
Inventories
Inventories
Industry N
Industry N
Industry 1
Industry 2
Industry 1
Industry 2
Industry 1
Other
FIGURE 2.6 Rows and columns on the sheets follow a tree structure hierarchy.
We face two challenges. The first is that the source data we want to use
exists in a variety of formats and aggregation levels. We have to aggregate,
disaggregate, and reclassify these input datasets so that each nation and its
trading partner use the same classification scheme. The second challenge
is more substantial: the raw data inputs often conflict.
The first challenge to building an MRIO is harmonizing all the input
data so they may be combined in one table. A primary goal of the Eora
project is to incorporate as many available economic and environmental
data as possible. These data cover a wide range of aggregation levels, for-
mats, and classification schemes. Some examples of how the input data
must be translated are:
We created a data processing language to help with this task. This lan-
guage assists with aggregation and disaggregation and facilitates address-
ing and populating areas of the tensor. A disaggregation command in the
processing language could specify a total value of steel sector sales that
should be allocated proportionally, or equally, among a number of metal
manufacturing industries.
Tightly integrated with the language is a large library of correspon-
dence matrices. Correspondence matrices contain a weighted mapping
that reallocates values in a source vector into a different destination vec-
tor. A correspondence matrix maps a source vector of length N to a des-
tination vector of length M using an N × M matrix. Each row contains
a set of weights that sum to 1. The first element of the source vector is
allocated to the destination vector using the weights specified in the first
row, the second element added to the destination vector using the weights
of the second row, and so on. Correspondence matrices are an especially
convenient tool for aggregating and disaggregating data and projecting
data in one classification scheme into another scheme.
Tracing Embodied CO2 in Trade Using High-Resolution IO Tables ◾ 37
For each data source (each table from all the agencies supplying data,
e.g., UN, national agencies, environmental agencies, and so on), we wrote a
processing language script to populate the tensor with the contents of that
source. The scripts specify which concordance matrix/matrices to use to
translate the data into the classification used in the Eora table. The scripts
also specify the destination address in the Eora tensor where the reclassi-
fied data should be placed.
The data processing language interpreter is called AISHA (An Automated
System for Harmonised Accounts). AISHA is essentially a database server.
It reads data processing scripts in batch and populates the tensor. Our
implementation uses a distributed architecture. A number of workers pro-
cess these scripts autonomously, and then the master process aggregates
the results.
This would be the end of the story if data sources did not overlap and
conflict. AISHA does not actually populate a tensor but rather populates
an initial estimate and constraint matrix, which are run through an opti-
mization package to produce the final tensor.
To build Eora, the problem of generating a table that adheres to all spec-
ified constraints (balancing, conflicting data and/or others) was taken in
one step. Reconciling the data in order to fulfill any given external condi-
tion was achieved by interpreting this problem as a mathematically con-
strained optimization problem.
We begin the process of building Eora by constructing an initial ver-
sion of the table using raw data from a number of different sources. Let’s
call this version of MRIO the Raw Data Table. If all these different data
sources coincided, the MRIO would be finished. But most likely, the chal-
lenges described in the previous section will hold: the MRIO will violate
our specified constraints.
2.4.1 Reliability Information
In order to approach the problem, we first have to introduce reliability data.
The Maximum Entropy Principle introduced by Jaynes (1957) asserts that
an IO-transaction is a random variable with a best guess and an associ-
ated reliability. IO tables are assembled carefully by statistical bureaus but
still the individual inter-sectoral transaction values are not 100% certain.
Hence, both elements of the MRIO and external conditions are subject
to a certain, individual reliability. Additionally, the elements of an MRIO
can be assumed to be normally distributed and statistically independent
from one another. Therefore, each transaction can be interpreted as the
best guess of the corresponding variable, and the reliability is expressed in
the standard deviation of this random variable. A transaction value with
large reliability corresponds to a very small standard deviation; a less reli-
able transaction value corresponds to a larger standard deviation. Usually,
reliability information is not provided for every transaction of an MRIO.
In this case, the standard deviation values can be estimated. In general, it
can be assumed that larger values are far more reliable than smaller values.
A good approximation is to define the standard deviation of the largest
value in an MRIO to be 3% of its own absolute value, and 200% for the
smallest value of the MRIO. The standard deviation of the remaining val-
ues can then be calculated by a logarithmic function whose coefficients
must be calculated by the user. Lenzen et al. (2010) give a detailed motiva-
tion and explanation of this strategy. Additionally the reliability of some
constraints might be known. For example, the balancing condition must
be exact; otherwise an MRIO cannot represent an economy appropriately
(see Leontief, 1986). This means that the corresponding standard deviation
of the balancing constraints is 0. Constraints with a standard deviation of
Tracing Embodied CO2 in Trade Using High-Resolution IO Tables ◾ 39
0 are called hard constraints. Other hard constraints include upper and
lower bounds, specifying that transactions may never be negative, or the
constraint that subsidies (recorded as negative values) may never be posi-
tive. Therefore, these constraints also have standard deviations of 0. Other
constraints like the previous example of the U.S. GDP might be less reli-
able and have positive standard deviation values. These are soft constraints.
Clearly, if the same piece of information from two different sources states
different facts, at least one of the two must be incorrect. In reality, this will
most likely hold for both. But one of the data sources might be more reli-
able than the other one. In this case, for example, the data for the total U.S.
GDP provided by a national statistical agency could be far more reliable
than what the UN reported. Hence, an external condition that is not 100%
reliable does not have to be 100% fulfilled by the MRIO.
This concept holds for the elements in the MRIO. Each transaction
value in the MRIO is subject to a certain reliability. That means that every
element in the Raw Data Table can be adjusted (within a certain range
determined by its reliability) and still represent the real transaction value
reasonably well.
Reliability data is usually not published by statistical agencies. In these
cases, the reliability information can be estimated. We use a variety of
heuristics to estimate the reliability of various datasets. Large transaction
values are typically well-measured and thus more reliable than small val-
ues (Lenzen et al., 2010). We assign a higher reliability to national statisti-
cal agency data than to UN data, and higher reliability to UN economic
data than to trade data. The published Eora results include the informa-
tion about our reliability estimates of every dataset. Because the final
MRIO table is a composite of these data, we also provide reliability figures
for the final table showing how reliable each value in the table is, based on
the combined reliability of the various input data informing that result.
Or in short: Find a Final MRIO that fulfills all external conditions while
minimally disturbing the Raw Data Table.
t11 t12 y1
T = t 21 t 22 y1
v1 v2
In mathematical terminology, the Raw Data Table is called the initial esti-
mate. The term initial estimate sometimes causes confusion as the Raw
Data Table was sourced from an officially published dataset; hence it is not
an estimate at all. But from a mathematical point of view, the Final MRIO
is the solution of an optimization problem and the Raw Data Table is the
initial estimate of what the solution will be.
In order to fulfill the balancing condition, the sum over all elements
of the first row of the table must equal the sum over all elements of the
first column of the table. The same must hold for the second row and sec-
ond column. The equations for the balancing constraints (or balancing
condition) for the table T are given by
The diagonal elements t11 and t22 appear with altering signs in the equa-
tions; hence they cancel each other out. The final equations for the balanc-
ing constraints are thus
Tracing Embodied CO2 in Trade Using High-Resolution IO Tables ◾ 41
t12 + y1 − t 21 − v1 = 0
(2.1)
t 21 + y 2 − t12 − v 2 = 0
y1 + y 2 = a
In this case, we can be almost certain that the value a is not totally reli-
able. Hence, the standard deviation for the value a would not be equal to
zero, that is, σ > 0. This equation can be violated by the Final MRIO that is
to be computed. The acceptable amount of violation is determined by the
standard deviation σ of the value a.
t11
t12
y1
t11 t12 y1 t 21
T = t 21 t 22 y1 becomes a = t 22
v1 v2 y2
v1
v2
42 ◾ Computational Intelligent Data Analysis for Sustainable Development
t12 + y1 − t 21 − v1 = 0
as a vector-by-vector equation:
t11
t12
y1
t 21
(0
1 1 −1 0 0 −1 0 ) t 22 = 0 .
=:g T y2
v1
v2
a
g T a = 0 (2.2)
The vector g is the so-called coefficients vector holding the corresponding coef-
ficients for the elements of the vector p to represent the balancing equation.
Every constraint can be formulated in the form of (Equation 2.2).
Hence, each constraint provides a constraint vector g. These constraint
vectors can now be summarized in a constraint matrix G. Every row of
this constraints matrix represents one constraint. For the three constraint
examples previously used in this section, the coefficients matrix G is
t11
t12
y1
0 1 1 −1 0 0 −1 0 0
0 −1 0 t 21
1 0 1 −1 0 = 0 (2.3)
t 22
0 0 1 0 0 1 0 0 a
y 2
G =1c
v1
v2
a
Tracing Embodied CO2 in Trade Using High-Resolution IO Tables ◾ 43
Ga = c
The corresponding reliability data for the vectorized MRIO a and the
right-hand side values c can be stored in two separate vectors σa and σc
that have the same sizes as the vectors whose reliability information they
hold, namely a and c.
Find a Final MRIO that fulfills all external conditions while minimally
disturbing the Raw Data Table.
With the concepts that were developed so far, the mathematical expres-
sion of the statement “MRIO that fulfills all external conditions” is simply
given by
Ga = c
( )
min f a 0, a subject to Ga = c
For the Eora model, the objective function was based on Byron’s (1978)
approach, which uses a quadratic Lagrange function of the form
( )∑ (a )
−1
f = a0 − a 0
− a + λ ′ ( Ga − c )
a
where Σa denotes a diagonal matrix with the σa values of the vector as the
main diagonal. First-order conditions then must be applied in order to
find the Lagrange multiplier and then the Final MRIO.
To calculate the Lagrange multiplier, a matrix inversion must be cal-
culated, which would have proven to be calculation intensive and pos-
sibly numerically unstable for a very large problem like the Eora problem.
To avoid the explicit calculation of the Lagrange multipliers and matrix
inversion, Van der Ploeg (1982) elegantly reformulates Byron’s approach
using the following ideas:
1. Reorder the rows of matrix G and the right-hand side vector c such
that all rows for hard constraints are at the top, followed by the rows
belonging to soft constraints. Let Ghard and Gsoft denote the block of
constraint lines that belongs to the hard constraints and the soft con-
straints, respectively. Then G and c can take the form
G hard c hard
G= and c =
G soft c soft
Because soft constraints are not completely reliable, they may be vio-
lated to some extent. These violations are taken care of by introducing
a disturbance εi for each soft constraint (note that there are no distur-
bances introduced for hard constraints, as they must be adhered to
exactly). Let ε be the vector of all disturbances; then the system of soft
constraints becomes
2. Defining
a
p=
ε
the system of equation for the soft constraints can then be rewritten as
The diagonal matrix of standard deviations for the vector p then becomes
∑a 0
∑p =
0 ∑ c
( )∑ ( p )
−1
min p0 − p 0
− p subject to Cp = c
p
46 ◾ Computational Intelligent Data Analysis for Sustainable Development
a0
p0 =
0
The advantage of Van der Ploeg’s approach is that the reliability infor-
mation of the right-hand side vector c has shifted to the iterate, which is
now called p. The disadvantage is that the iterate (formerly a now p grows
by as many variables as there are soft constraints and, hence, the prob-
lem becomes significantly bigger). However, the number of constraints
remains the same. The solution of this problem is the Final MRIO and the
vector of disturbances of the soft constraints. The Final MRIO adheres to
all constraints and considers the reliability of the raw data and the con-
straints during the calculation process.
Often, certain values of the MRIO must stay within certain boundaries.
Transaction values of the basic price table, for example, must be positive
values. Values in the subsidies sheet can only be negative values. Hence,
each element pi can be subject to an upper or lower bound, that is, li ≤ pi ≤
ui . By allowing positive or negative infinity as a feasible value for the upper
or lower bound, a bound equation can be formulated for each element pi .
The upper bound and lower bound values can be summarized in vectors
of equal size to the size of p. The boundary conditions for the whole MRIO
can then be summarized as
l≤p≤u
(
min p0 − p )∑ ( p
p
0
−p ) subject to Gp = c , l ≤ p ≤ u (2.4)
∑( )
2
G dp # G dp = g ij dp0j = dc
j i
The standard deviations dp were part of the input to the optimization process
and are therefore known. The standard deviations dp are unknown. However,
the shift that each element of the MRIO experiences during the optimization
process to obtain p from p0 are known. Using the distance vector p – p0 as the
initial guess for the standard deviations dp the algorithm SDRAS can solve
the underdetermined optimization problem
G dp # G dp = dc
Penalty
D1 S D2 Value
L = I + A + A2 + A3 + … = ∑A
n=0
n
L = (I – A)–1
Each element of the matrix Leontief inverse matrix L thus reports the total
quantity input required to produce one unit of output. The inputs can be
weighted by their environmental load, using the satellite indicators, in
order to find the environmentally weighted total input required for each
unit product produced.
Structural Path Analysis (SPA) can be used to selectively perform this
series expansion on individual elements and trace out supply chains. SPA
is commonly used to search the top-ranked (largest flow) supply chain or
chains, starting or ending at sectors of interest. The idea of SPA is not dif-
ficult but implementation is an art. SPA algorithms must essentially search a
15,000 * 15,000 = 225,000,000-branch tree whose leaf node values asymptot-
ically approach zero with depth. In the worst case, a single input could visit
every single sector in the world before being finally consumed. In practice,
evaluating each supply chain to 10 to 15 steps captures 99% of the chain’s
value. Still, intelligent heuristics for pruning and sorting are mandatory. The
Leontief inverse and SPA are used complementarily. Footprints calculated
using the Leontief inverse report the total footprint of products and sectors
and SPA algorithms search for the individual supply chains involved.
• Energy
• GHG emissions
• Air pollutants
• Ecological footprint (hectares)
• Human appropriated net primary productivity (grams of carbon)
• Water use (liters)
• Material flow (kilograms of materials)
The Eora MRIO has also been used for studies on the footprint of biodi-
versity, linking 30,000 species threat records from the International Union
for Conservation of Nature (IUCN) Red List to production industries and
final consumers, and to study conflict mineral (coltan and rare earth met-
als) supply chains originating in Africa and Asia.
The power of IO analysis to distinguish and link producers, supply
chains, and consumers makes it useful for developing sustainability poli-
cies. IO analysis can quantify the links between producers and consumers,
and systematically identify and quantify supply chains of interest. This
data-rich resource can be used to inform sustainability policies for pro-
ducers, traders, and consumers.
Most environmental legislation is currently designed to control the
footprint of production. The footprint of production can be constrained
with regulation requiring cleaner production, protection and conserva-
tion measures, better enforcement of existing legislation, and by buyers
demanding high environmental standards from suppliers.
Trade flows in environmentally deleterious products can be constrained.
For example, the Convention on International Trade in Endangered
Species of Wild Fauna and Flora (CITES) broadly restricts any trade
in endangered species and derived products. Proposed carbon taxes
function similarly, effectively restricting trade of an undesirable good.
Tracing Embodied CO2 in Trade Using High-Resolution IO Tables ◾ 51
REFERENCES
Byron, R. 1978. The estimation of large social account matrices. Journal of the
Royal Statistical Society Series A, 141(3): 359–367.
Jaynes, E.T. 1957. Information theory and statistical mechanics. Physical Review,
106: 620–630.
Lenzen, M., R. Wood, and T. Wiedman. 2010. Uncertainty analysis for Multi-
Region Input-Output models—A case study of the UK’s carbon footprint.
Economic Systems Research, 22: 43–63.
Leontief, W. 1986. Input-Output Economics. New York: Oxford University Press.
Miller, R.E. and P.D. Blair. 2009. Input-Output Analysis: Foundations and Extensions
2nd ed. Cambridge, UK: Cambridge University Press.
Van der Ploeg, F. 1982. Reliability and the adjustment of sequences of large eco-
nomic accounting matrices. Journal of the Royal Statistical Society Series A,
145(2): 169–194.
Chapter 3
Aggregation Effects
in Carbon Footprint
Accounting Using
Multi-Region
Input–Output Analysis
Xin Zhou, Hiroaki Shirakawa,
and Manfred Lenzen
CONTENTS
3.1 Introduction 54
3.2 Test of Aggregation Effect 56
3.3 Results 60
3.3.1 Magnitude of Aggregation Effect 61
3.3.2 Factors Influencing the Aggregation Effect 65
3.4 Conclusions 67
Acknowledgments 69
References 69
Appendix 3A: Sector Classification in the AIO 2000 and GTAP6
Database 71
Appendix 3B: Ranking of Top 30 Aggregation Errors 73
53
54 ◾ Computational Intelligent Data Analysis for Sustainable Development
3.1 INTRODUCTION
Hertwich, 2008; Hertwich and Peters, 2009; Zhou et al., 2010). For reviews
in this area, see Wiedmann (2009b) and Minx et al. (2012).
The MRIO model used is Asian Input-Output (AIO) Table 2000 (AIO,
2000), which is published by the Institute of Developing Economies (2006).
The AIO 2000 is a Chenery-Moses type of model (Chenery, 1953; Moses,
1955), including 76 sectors for ten Asian-Pacific economies (Indonesia,
Malaysia, the Philippines, Singapore, Thailand, Mainland China, Taiwan,
the Republic of Korea, Japan, and the United States). By aggregating
sectors randomly, we tested the magnitude of aggregation errors and ana-
lyzed them as random variables.
xir = ∑ ∑ t +∑ f
s j
rs
ij
s
i
rs
(3.1)
where r and s are regions, i and j are sectors, xir is sector i’s outputs in
region r, tijrs is sector i’s outputs in region r that are used as inputs to sector j
in region s (intermediate demand), and firs is final demand of product i in
region s that is supplied by region r.
By defining the input coefficients aijrs = tijrs/xjs, Equation (3.1) for a k-region
model can be expressed in a matrix format as follows:
x 1 A11 A 12
A 1k
x1 ∑f s
1s
x 2 A21
=
A22 A2k
x2
+ ∑f
s
2s
(3.2)
k k1 k
x A Ak 2 Akk x
∑s
f ks
TABLE 3.1 Definitions
Unaggregated Model Aggregated Model
rs – – –
Intermediate demand T = [T ] = [t ]
rs
ij T = [T rs] = [tijrs ]
rs – –rs –
Final demand f = [f rs] = [f ]
i f = [ f ] = [ f irs ]
Total output x = {xr} = {x }i
r
x– = {x–r} = {x–ir}
– – rs
Input coefficients A = [Ars] = [a ] rs
ij A = [A ] = [a–ijrs ]
rs – – –
Leontief inverse L = [Lrs] = [l ]
ij L = [Lrs] = [l ijrs ]
r – – –
Carbon intensity (emissions per unit output) c = {cr} = {c }
i c = {c r} = {cir}
Carbon footprint w = {w s} = {w } j
s – = {w–s} = {w–js}
w
x = ( I − A)−1 f (3.3)
Define
k
m=
∑n(r ) ≤ nk
r =1
58 ◾ Computational Intelligent Data Analysis for Sustainable Development
as the size of the aggregate model. The intermediate demand after aggre-
— —
gation, T, is an m × m square matrix. Each block on the diagonal, T rr, is an
n(r) × n(r) square matrix. Off-diagonal matrices are basically rectangular,
depending on the size of the supply region r and the size of the demand
—
region s after aggregation. For example, the size of T rs is n(r) × n(s).
We define the aggregation matrix as follows:
z (1) 0 0 0
0 z ( 2) 0 0
Z=
0 0 0
0 0 0 z (k )
ZTZ ′ = T (3.4)
Zx = x (3.5)
ZF = F (3.6)
Based on Equation (3.3), we have the following relations for the unag-
gregated model:
x = ( I − A)−1 F = Lf (3.8)
C ⊗ x = C ⊗ Lf (3.9)
w sj = ∑ (∑ ∑ c l
q r i
r rq
i ij )f j
qs
(3.10)
For the aggregate model, there are two alternative ways for the calcula-
tion of CFs:
Procedure 1:
w = Zw (3.11)
where w is the CFs calculated from the unaggregated model. Because all
the information of the unaggregated system is known and Procedure 1 is
only a summation of relevant sectoral CFs in the original model, there
– as the “true” value.
is no aggregation bias. We define w
Procedure 2:
In another procedure, we first aggregate the MRIO model based on
Equations (3.4) through (3.7). From the aggregated model, we then calcu-
late the CFs:
x = ( I − A)−1 f = L f (3.12)
–
C ⊗ x = C ⊗ L f (3.13)
ŵ sj = ∑ (∑ ∑ c l
q r i
r rq
i ij )f j
qs (3.14)
where ŵ is the CFs calculated from the aggregate model. Procedure 2 is the
question at issue in which aggregation bias will occur in the calculations
60 ◾ Computational Intelligent Data Analysis for Sustainable Development
3.3 RESULTS
To examine aggregation bias, we set aggregation schemes, determined
by Z, randomly using Monte Carlo simulation. The procedure includes
the following:
We first match sectors in the GTAP Database with those in the AIO 2000
(see Appendix 3A), based on which we recalculate the carbon intensity for
75 sectors.
Density
0.6
Aggregated sectors
0.5 Unaggregated sectors
0.4
0.3
0.2
0.1
0 Error Rate
–0.5 –0.4 –0.3 –0.2 –0.1 0 0.1 0.2 0.3 0.4 0.5
Indonesia
Density
0.6
Aggregated sectors
0.5 Unaggregated sectors
0.4
0.3
0.2
0.1
0 Error Rate
–0.5 –0.4 –0.3 –0.2 –0.1 0 0.1 0.2 0.3 0.4 0.5
China
Density
0.6
Aggregated sectors
0.5 Unaggregated sectors
0.4
0.3
0.2
0.1
0 Error Rate
–0.5 –0.4 –0.3 –0.2 –0.1 0 0.1 0.2 0.3 0.4 0.5
Malaysia
Density
0.6
Aggregated sectors
0.5 Unaggregated sectors
0.4
0.3
0.2
0.1
0 Error Rate
–0.5 –0.4 –0.3 –0.2 –0.1 0 0.1 0.2 0.3 0.4 0.5
Taiwan
Density
0.6
Aggregated sectors
0.5 Unaggregated sectors
0.4
0.3
0.2
0.1
0 Error Rate
–0.5 –0.4 –0.3 –0.2 –0.1 0 0.1 0.2 0.3 0.4 0.5
Philippines
For aggregated sectors, the error rates of samples showed a wide range,
from −479 (aggregation of three sectors in China, i.e., “crude petroleum
and natural gas,” “iron and steel,” and “finance and insurance”) to 166
(aggregation of two sectors, “paddy” and “synthetic resins and fiber” in
Thailand) (Table 3.2). The mean of error rates ranged from 0.029 (for
Korea) to 0.167 (for China). This indicates that, in general, aggregation of
Aggregation Effects in Carbon Footprint Accounting ◾ 63
Density
0.6
Aggregated sectors
0.5 Unaggregated sectors
0.4
0.3
0.2
0.1
0 Error Rate
–0.5 –0.4 –0.3 –0.2 –0.1 0 0.1 0.2 0.3 0.4 0.5
Korea
Density
0.6
Aggregated sectors
0.5 Unaggregated sectors
0.4
0.3
0.2
0.1
0 Error Rate
–0.5 –0.4 –0.3 –0.2 –0.1 0 0.1 0.2 0.3 0.4 0.5
Singapore
Density
0.6
Aggregated sectors
0.5 Unaggregated sectors
0.4
0.3
0.2
0.1
0 Error Rate
–0.5 –0.4 –0.3 –0.2 –0.1 0 0.1 0.2 0.3 0.4 0.5
Japan
the AIO 2000 will have moderate effects on CF accounting; however, the
aggregation of some sectors can cause considerable bias. For unaggregated
sectors, the range of error rates were from −0.876 (for Singapore) to more
than 16 (for Indonesia) (Table 3.3), with mean levels between −2% and
−5%. The confidence interval showed that, for example, for Indonesia, 95%
of the error rates fell in the interval of −0.055 and 0.465 for aggregated
64 ◾ Computational Intelligent Data Analysis for Sustainable Development
Density
0.6
Aggregated sectors
0.5 Unaggregated sectors
0.4
0.3
0.2
0.1
0 Error Rate
–0.5 –0.4 –0.3 –0.2 –0.1 0 0.1 0.2 0.3 0.4 0.5
Thailand
Density
0.6
Aggregated sectors
0.5 Unaggregated sectors
0.4
0.3
0.2
0.1
0 Error Rate
–0.5 –0.4 –0.3 –0.2 –0.1 0 0.1 0.2 0.3 0.4 0.5
USA
3.4 CONCLUSIONS
The construction of MRIO tables is both time-consuming and expen-
sive. There are still very few MRIO tables available for practical applica-
tions. MRIO tables are generally constructed based on national IO tables
and bilateral trade data. Different countries have different key sectors and
national priorities that can influence sector classification and the level
of aggregation in their national IO tables. In constructing MRIO tables,
reclassification and aggregation are usually necessary to adjust the dif-
ferences among national IO tables. In the application of MRIO models
68 ◾ Computational Intelligent Data Analysis for Sustainable Development
ACKNOWLEDGMENTS
Xin Zhou would like to thank the Ministry of the Environment, Japan, for
its financial support of the project on the Assessment of the Environmental,
Economic, and Social Impacts of Resource Circulation Systems in Asia,
under the Policy Study on Environmental Economics (PSEE). Hiroaki
Shirakawa would like to thank the Ministry of the Environment, Japan,
for supporting the project on Development and Practice of Advanced Basin
Model in Asia: Towards Adaptation of Climate Change (FY2011–FY2013)
under its Environment Research & Technology Development Fund (E-1104).
REFERENCES
Andrew, R., Peters, G.P., and Lennox, J., 2009. Approximation and regional aggre-
gation in multi-regional input-output analysis for national carbon footprint
accounting. Economic Systems Research, 21(3), 311–335.
Ara, K., 1959. The aggregation problem in input-output analysis. Econometrica,
27, 257–262.
Blair, P., and Miller, R.E., 1983. Spatial aggregation in multiregional input-output
models. Environment and Planning A, 15, 187–206.
Chenery, H.B., 1953. Regional Analysis. In Chenery, H.B., Clark, P.G., and Pinne,
V.C., Eds., The Structure and Growth of the Italian Economy: 97–129, U.S.
Mutual Security Agency, Rome.
Dimaranan, B.V., 2006. Global Trade, Assistance, and Production: GTAP 6 Data
Base, 2-8, 2-9. Center for Global Trade Analysis, Purdue University, West
Lafayette, IN.
Doeksen, G.A., and Little, C.H., 1968. Effect of size of the input-output model
on the results of an impact analysis. Agricultural Economics Research, 20(4),
134–138.
Fisher, W.D., 1958. Criteria for aggregation in input-output analysis. The Review of
Economics and Statistics, 40, 250–260.
Fisher, W.D., 1962. Optimal aggregation in multi-equation prediction models.
Econometrica, 30, 744–769.
Fisher, W.D., 1966. Simplification in economic models. Econometrica, 34, 563–584.
Gibbons, J.C., Wolsky, A.M., and Tolley, G., 1982. Approximate aggregation and
error in input-output models. Resources and Energy, 4, 203–230.
Hertwich, E.G., and Peters, G.P., 2009. Carbon footprint of nations: A global,
trade-linked analysis. Environmental Science & Technology, 43, 6416–6420.
Hewings, G.J.D., 1974. The effect of aggregation on the empirical identification of
key sectors in a regional economy: A partial evaluation of alternative tech-
niques. Environment and Planning A, 6, 439–453.
70 ◾ Computational Intelligent Data Analysis for Sustainable Development
continued
72 ◾ Computational Intelligent Data Analysis for Sustainable Development
Other manufacturing
products
25 Philippines 3 8.198 Forestry, Crude petroleum 1.603 0.426, 2.280 1.319, 1.853 0.584, 655956 13838, –3,
and natural gas, 284.147, 1.486, 284.258, 642121
Electronic computing 0.002 2.580 0.195
equipment
26 Philippines 3 −8.139 Crude petroleum and 3.084 284.147, 2.259 1.486, 4.084 284.258, –2702 –3, 23284,
natural gas, Knitting, 0.2571, 2.275, 0.6918, –25983
Non-ferrous metal 0.020 2.280 0.527
27 Philippines 2 8.049 Crude petroleum and 10.308 284.147, 2.186 1.486, 12.620 284.258, 30573 –3, 30576
natural gas, Glass and 0.707 2.215 1.422
glass products
28 Philippines 2 8.039 Crude petroleum and 10.308 284.147, 2.184 1.486, 12.606 284.258, 30573 –3, 30576
natural gas, Glass and 0.707 2.215 1.422
glass products
29 Japan 2 −7.846 Crude petroleum and 0.096 0.001, 0.102 2.021 1.651, 0.174 0.060, –6197 –10486,
natural gas, Non-metallic 2.048 0.183 4289
ore and quarrying
30 China 4 7.833 Fish products, 1.159 0.060, 0.060, 2.751 2.419, 3.923 1.34, 17640078 6907401,
Slaughtering, Other 0.472, 1.544 2.769, 1.119, 12134064,
rubber products, Iron and 2.737, 2.386, 2761198,
steel 2.813 5.191 –4162585
Aggregation Effects in Carbon Footprint Accounting ◾ 77
II
Computational Intelligent Data
Analysis for Climate Change
79
Chapter 4
Climate Informatics
Claire Monteleoni, Gavin A. Schmidt,
Francis Alexander, Alexandru Niculescu-M izil,
Karsten Steinhaeuser, Michael Tippett,
Arindam Banerjee, M. Benno Blumenthal,
Auroop R. Ganguly, Jason E. Smerdon,
and Marco Tedesco
CONTENTS
4.1 Introduction 82
4.2 Machine Learning 84
4.3 Understanding and Using Climate Data 84
4.3.1 In-Situ Observations 85
4.3.2 Gridded/Processed Observations 85
4.3.3 Satellite Retrievals 86
4.3.4 Paleoclimate Proxies 86
4.3.5 Reanalysis Products 87
4.3.6 Global Climate Model (GCM) Output 87
4.3.7 Regional Climate Model (RCM) Output 88
4.4 Scientific Problems in Climate Informatics 88
4.4.1 Parameterization Development 89
4.4.2 Using Multimodel Ensembles of Climate Projections 89
4.4.3 Paleoreconstructions 90
4.4.4 Data Assimilation and Initialized Decadal Predictions 90
4.4.5 Developing and Understanding Perturbed Physics
Ensembles (PPEs) 90
81
82 ◾ Computational Intelligent Data Analysis for Sustainable Development
4.1 INTRODUCTION
4.2 MACHINE LEARNING
Over the past few decades, the field of machine learning has matured sig-
nificantly, drawing on ideas from several disciplines, including optimiza-
tion, statistics, and artificial intelligence [4, 34]. Application of machine
learning has led to important advances in a wide variety of domains rang-
ing from Internet applications to scientific problems. Machine learning
methods have been developed for a wide variety of predictive modeling
as well as exploratory data analysis problems. In the context of predictive
modeling, important advances have been made in linear classification and
regression, hierarchical linear models, nonlinear models based on kernels,
as well as ensemble methods that combine outputs from different predic-
tors. In the context of exploratory data analysis, advances have been made
in clustering and dimensionality reduction, including nonlinear methods
to detect low-dimensional manifold structures in the data. Some of the
important themes driving research in modern machine learning are moti-
vated by properties of modern datasets from scientific, societal, and com-
mercial applications. In particular, the datasets are extremely large scale,
running into millions or billions of data points; are high-dimensional,
going up to tens of thousands or more dimensions; and have intricate
statistical dependencies that violate the “independent and identically dis-
tributed” assumption made in traditional approaches. Such properties
are readily observed in climate datasets, including observations, reanaly-
sis, as well as climate model outputs. These aspects have led to increased
emphasis on scalable optimization methods [94], online learning methods
[11], and graphical models [47], which can handle large-scale data in high
dimensions with statistical dependencies.
provide some suggestions on how they can be used. The discussion opens
up some interesting problems. There are multiple sources of climate data,
ranging from single-site observations scattered in an unstructured way
across the globe to climate model output that is global and uniformly
gridded. Each class of data has particular characteristics that should be
appreciated before it can be successfully used or compared. We provide
here a brief introduction to each, with a few examples and references for
further information. Common issues that arise in cross-class syntheses
are also addressed.
4.3.2 Gridded/Processed Observations
Given a network of raw in-situ data, the next step is synthesizing those
networks into quality-controlled regularly gridded datasets. These have
a number of advantages over the raw data in that they are easier to work
with, are more comparable to model output (discussed below), and have
86 ◾ Computational Intelligent Data Analysis for Sustainable Development
4.3.3 Satellite Retrievals
Since 1979, global and near-global observations of the Earth’s climate have
been made from low-earth orbit and geostationary satellites. These obser-
vations are based either on passive radiances (either emitted directly from
the Earth, or via reflected solar radiation) or by active scanning via lasers or
radars. These satellites, mainly operated by U.S. agencies (NOAA, NASA),
the European Space Agency, and the Japanese program (JAXA), and data
are generally available in near-real-time. There are a number of levels of
data, ranging from raw radiances (Level 1), processed data as a function
of time (Level 2), and gridded averaged data at the global scale (Level 3).
Satellite products do have specific and particular views of the climate
system, which requires that knowledge of the “satellite-eye” view be incor-
porated into any comparison of satellite data with other products. Many
satellite products are available for specific instruments on specific plat-
forms; synthesis products across multiple instruments and multiple
platforms are possible, but remain rare.
4.3.4 Paleoclimate Proxies
In-situ instrumental data only extends on a global basis to the mid-19th
century, although individual records can extend to the 17th or 18th century.
For a longer term perspective, climate information must be extracted from
so-called “proxy” archives, such as ice cores, ocean mud, lake sediments,
tree rings, pollen records, caves, or corals, which retain information that is
sometimes highly correlated to specific climate variables or events [41].
As with satellite data, appropriate comparisons often require a for-
ward model of the process by which climate information is stored and that
Climate Informatics ◾ 87
incorporates the multiple variables that influence any particular proxy [75].
However, the often dramatically larger signals that can be found in past cli-
mates can overcome the increase in uncertainty due to spatial sparseness
and nonclimatic noise, especially when combined in a multi-proxy approach
[58]. Problems in paleoclimate are discussed in further detail in Section 4.8.
4.3.5 Reanalysis Products
Weather forecast models use as much observational data (in-situ, remote
sensing, etc.) as can be assimilated in producing 6-hour forecasts (the
“analyses”), which are excellent estimates of the state of the climate at any
one time. However, as models have improved over time, the time series of
weather forecasts can contain trends related only to the change in model
rather than changes in the real world. Thus, many of the weather forecast-
ing groups have undertaken “reanalyses” that use a fixed model to reprocess
data from the past in order to have a consistent view of the real world
(see reanalyses.org for more details). This is somewhat equivalent to a physics-
based interpolation of existing datasets and often provides the best estimate
of the climate state over the instrumental period (e.g., ERA-Interim [16]).
However, not all variables in the reanalyses are equally constrained
by observational data. Thus, sea-level pressure and winds are well char-
acterized, but precipitation, cloud fields, and surface fluxes are far more
model dependent and thus are not as reliable. Additionally, there remain
unphysical trends in the output as a function of changes in the observing
network over time. In particular, the onset of large-scale remote sensing in
1979 imparts jumps in many fields that can be confused with real climate
trends [105].
4.4.1 Parameterization Development
Climate models need to deal with the physics that occurs at scales smaller
than any finite model can resolve. This can involve cloud formation, tur-
bulence in the ocean, land surface heterogeneity, ice floe interactions,
chemistry on dust particle surfaces, etc. This is dealt with by using
parameterizations that attempt to capture the phenomenology of a spe-
cific process and its sensitivity in terms of the (resolved) large scales. This
is an ongoing task, and is currently driven mainly by scientists’ physical
intuition and relatively limited calibration data. As observational data
become more available, and direct numerical simulation of key pro-
cesses becomes more tractable, there is an increase in the potential for
machine learning and data mining techniques to help define new param-
eterizations and frameworks. For example, neural network frameworks
have been used to develop radiation models [50].
4.4.3 Paleoreconstructions
Understanding how climate varied in the past before the onset of wide-
spread instrumentation is of great interest—not least because the climate
changes seen in the paleo-record dwarf those seen in the 20th century and
hence may provide much insight into the significant changes expected this
century. Paleo-data is, however, even sparser than instrumental data and,
moreover, is not usually directly commensurate with the instrumental
record. As mentioned in Section 4.3, paleo-proxies (such as water isotopes,
tree rings, pollen counts, etc.) are indicators of climate change but often
have nonclimatic influences on their behavior, or whose relation to what
would be considered more standard variables (such as temperature or pre-
cipitation) is perhaps nonstationary or convolved. There is an enormous
challenge in bringing together disparate, multi-proxy evidence to produce
large-scale patterns of climate change [59], or from the other direction
build enough “forward modeling” capability into the models to use the
proxies directly as modeling targets [76]. This topic is discussed in further
detail in Section 4.8.
that nonetheless sample a good deal of the intrinsic uncertainty that arises
in choosing any specific set of parameter values. These ”Perturbed Physics
Ensembles” (PPEs) have been used successfully in the climateprediction.
net and Quantifying Uncertainty in Model Predictions (QUMP) projects
to generate controlled model ensembles that can be compared systemati-
cally to observed data and make inferences [46, 64]. However, designing
such experiments and efficiently analyzing sometimes thousands of simu-
lations is a challenge, but one that will increasingly be attempted.
4.5.1 Abrupt Changes
Earth system processes form a nonlinear dynamical system and, as a result,
changes in climate patterns can be abrupt at times [74]. Moreover, there
is some evidence, particularly in glacial conditions, that climate tends to
remain in relatively stable states for some period of time, interrupted by
sporadic transitions (perhaps associated with so-called tipping points) that
delineate different climate regimes. Understanding the causes behind sig-
nificant abrupt changes in climate patterns can provide a deeper under-
standing of the complex interactions between Earth system processes. The
92 ◾ Computational Intelligent Data Analysis for Sustainable Development
first step toward realizing this goal is to have the ability to detect and iden-
tify abrupt changes from climate data.
Machine learning methods for detecting abrupt changes, such as
extensive droughts that last for multiple years over a large region, should
have the ability to detect changes with spatial and temporal persistence,
and should be scalable to large datasets. Such methods should be able to
detect well-k nown droughts such as the Sahel drought in Africa, the 1930s
Dust Bowl in the United States, and droughts with similar characteristics
where the climatic conditions were radically changed for a period of time
over an extended region [23, 37, 78, 113]. A simple approach for detecting
droughts is to apply a suitable threshold to a pertinent climate variable,
such as precipitation or soil moisture content, and label low-precipitation
regions as droughts. While such an approach will detect major events
like the Sahel drought and dust bowls, it will also detect isolated events,
such as low precipitation in one month for a single location that is clearly
not an abrupt change event. Thus, the number of “false positives” from
such a simple approach would be high, making subsequent study of each
detected event difficult.
To identify drought regions that are spatially and temporally persistent,
one can consider a discrete graphical model that ensures spatiotemporal
smoothness of identified regions. Consider a discrete Markov Random
Field (MRF) with a node corresponding to each location at each time
step and a meaningful neighborhood structure that determines the edges
in the underlying graph G = (V,E) [111]. Each node can be in one of two
states: “normal” or “drought.” The maximum a posteriori (MAP) infer-
ence problem in the MRF can be posed as
x ∈{0 ,1} N
∑
x * = arg max θu ( xu ) +
x ∈V
∑
(u ,v ) ∈E
θuv ( xu , x v )
where θu,θuv are node-wise and edge-wise potential functions that, respec-
tively, encourage agreement with actual observations and agreement
among neighbors; and is the state (i.e., “normal” or “drought”) at node
u ∈ V. The MAP inference problem is an integer programming problem
often solved using a suitable linear programming (LP) relaxation [70, 111].
Figure 4.1 shows results on drought detection over the past century
based on the MAP inference method. For the analysis, the Climatic Research
Climate Informatics ◾ 93
180°W 150°W 120°W 90°W 60°W 30°W 0° 30°E 60°E 90°E 120°E 150°E 180°E
90°N
75°N
60°N
45°N
30°N
15°N
0°
15°S
30°S
45°S
60°S
75°S
90°S
180°W 150°W 120°W 90°W 60°W 30°W 0° 30°E 60°E 90°E 120°E 150°E 180°E
90°N
75°N
60°N
45°N
30°N
15°N
0°
15°S
30°S
45°S
60°S
75°S
90°S
FIGURE 4.1 (See color insert.) The drought regions detected by our algorithm.
Each panel shows the drought starting from a particular decade: 1905–1920 (top
left), 1921–1930 (top right), 1941–1950 (bottom left), and 1961–1970 (bottom right).
The regions in black rectangles indicate the common droughts found by [63].
Unit (CRU) precipitation dataset was used at 0.5° × 0.5° latitude-longitude
spatial resolution from 1901 to 2006. The LP involved approximately
7 million variables and was solved using efficient optimization techniques.
The method detected almost all well-k nown droughts over the past cen-
tury. More generally, such a method can be used to detect and study
abrupt changes for a variety of settings, including heat waves, droughts,
94 ◾ Computational Intelligent Data Analysis for Sustainable Development
180°W 150°W 120°W 90°W 60°W 30°W 0° 30°E 60°E 90°E 120°E 150°E 180°E
90°N
75°N
60°N
45°N
30°N
15°N
0°
15°S
30°S
45°S
60°S
75°S
90°S
180°W 150°W 120°W 90°W 60°W 30°W 0° 30°E 60°E 90°E 120°E 150°E 180°E
90°N
75°N
60°N
45°N
30°N
15°N
0°
15°S
30°S
45°S
60°S
75°S
90°S
FIGURE 4.1 (See color insert.) (continued) The drought regions detected by our
algorithm. Each panel shows the drought starting from a particular decade:
1905–1920 (top left), 1921–1930 (top right), 1941–1950 (bottom left), and
1961–1970 (bottom right). The regions in black rectangles indicate the common
droughts found by [63].
4.5.2 Climate Networks
Identifying dependencies between various climate variables and climate
processes form a key part of understanding the global climate system. Such
Climate Informatics ◾ 95
dependencies can be represented as climate networks [19, 20, 106, 107], where
relevant variables or processes are represented as nodes and dependencies
are captured as edges between them. Climate networks are a rich represen-
tation for the complex processes underlying the global climate system, and
can be used to understand and explain observed phenomena [95, 108].
A key challenge in the context of climate networks is to construct such
networks from observed climate variables. From a statistical machine
learning perspective, the climate network should reflect suitable dependen-
cies captured by the joint distribution of the variables involved. Existing
methods usually focus on a suitable measure derived from the joint distri-
bution, such as the covariance or the mutual information. From a sample-
based estimate of the pairwise covariance or mutual information matrix,
one obtains the climate network by suitably thresholding the estimated
matrix. Such approaches have already shown great promise, often identify-
ing some key dependencies in the global climate system [43] (Figure 4.2).
Going forward, there are a number of other computational and algorith-
mic challenges that must be addressed to achieve more accurate representa-
tions of the global climate system. For instance, current network construction
methods do not account for the possibility of time-lagged correlations, yet
we know that such relationships exist. Similarly, temporal autocorrelations
and signals with varying amplitudes and phases are not explicitly handled.
There is also a need for better balancing of the dominating signal of spatial
autocorrelation with that of possible teleconnections (long-range dependen-
cies across regions), which are often of high interest. In addition, there are
many other processes that are well known and documented in the climate
NCEP
AO
NAO
PNA
SOI
AAO ACC
FIGURE 4.2 (See color insert.) Climate dipoles discovered from sea-level pres-
sure (reanalysis) data using graph-based analysis methods (see [42] for details).
96 ◾ Computational Intelligent Data Analysis for Sustainable Development
min
θ∈Nm
2
y − Xθ + λ1 θ 1 + λ 2 ∑θ
g =1
g
2
t = Air Temperature
p = Precipitation
r = Rel. Humidity
h = Hor. Wind Speed
v = Vert. Wind Speed
s = Sea Level Press.
TEMP
ensures that only few locations get non-zero weights, and even among
these locations, only a few variables are selected. Figure 4.3 shows the
locations and features that were consistently selected for the task of tem-
perature prediction in Brazil.
4.6.2 Data Challenges
Here we introduce some challenges posed by the available data. Data chal-
lenges are further discussed in Section 4.11. Serious constraints come from
the dimensions of the available data. Reliable climate observations often
do not extend more than 40 or 50 years into the past. This means that, for
example, there may be only 40 or 50 available observations of January–
March average precipitation. Moreover, the quality and completeness of
that data may vary in time and space. Climate forecasts from GCMs often
do not even cover this limited period. Many seasonal climate forecast
systems started hindcasts in the early 1980s when satellite observations,
particularly of SST, became available. In contrast to the sample size, the
dimension of the GCM state-space may be of the order 106, depending
on spatial grid resolution. Dimension reduction (principal component
analysis [PCA] is commonly used) is necessary before applying classical
methods like canonical correlation analysis to find associated features in
predictions and observations [5]. There has been some use of more sophis-
ticated dimensionality reduction methods in seasonal climate prediction
problems [53]. Methods that can handle large state-spaces and small sam-
ple size are needed. An intriguing recent approach that avoids the problem
of small sample size is to estimate statistical models using long climate
simulations unconstrained by observations and test the resulting model
on observations [18, 115]. This approach has the challenge of selecting
GCMs whose climate variability is “realistic,” which is a remarkably dif-
ficult problem given the observational record.
that incorporate optimal time and space filtering and that optimize more
general measures of predictability.
While predicting the weather of an individual day is not possible in a
seasonal forecast, it may be possible to forecast statistics of weather such as
the frequency of dry days or the frequency of consecutive dry days. These
quantities are often more important to agriculture than seasonal totals.
Drought has a complex time-space structure that depends on multiple
meteorological variables. Data mining and machine learning (DM/ML)
methods can be applied to observations and forecasts to identify drought,
as discussed in Section 4.5.
Identification of previously unknown predictable climate features may
benefit from the use of DM/ML methods. Cluster analysis of tropical
cyclone tracks has been used to identify features that are associated with
ENSO and MJO variability [9]. Graphical models, the nonhomogeneous
Hidden Markov Model in particular, have been used to obtain stochastic
daily sequences of rainfall conditioned on GCM seasonal forecasts [32].
The time and space resolution of GCM forecasts limits the physi-
cal phenomena they can resolve. However, they may be able to predict
proxies or large-scale antecedents of relevant phenomena. For instance,
GCMs that do not resolve tropical cyclones (TCs) completely do form TC-
like structures that can be used to make TC seasonal forecasts [8, 110].
Identifying and associating GCMs “proxies” with observed phenomena is
also a DM/ML problem.
Regression methods are used to connect climate quantities to associ-
ated variables that are either unresolved by GCMs or not even climate
variables. For instance, Poisson regression is used to relate large-scale cli-
mate quantities with hurricanes [104], and generalized additive models
are used to relate heat waves with increased mortality [68]. Again, the
length of the observational record makes this challenging.
scales [40]. The CE also spans the rise and fall of many human civiliza-
tions, making paleoclimatic information during this time period impor-
tant for understanding the complicated relationships between climate and
organized societies [7, 15].
Given the broad utility and vast number of proxy systems that are
involved, the study of CE climate is a wide-ranging and diverse enterprise.
The purpose of the following discussion is not meant to survey this field
as a whole, but instead to focus on a relatively recent pursuit in CE paleo-
climatology that seeks to reconstruct global or hemispheric temperatures
using syntheses of globally distributed multi-proxy networks. This par-
ticular problem is one that may lend itself well to new and emerging data
analysis techniques, including machine learning and data mining meth-
ods. The motivation of the following discussion therefore is to outline the
basic reconstruction problem and describe how methods are tested in syn-
thetic experiments.
Data Matrix
60
Reconstruction Period
Multi-Proxy Network
0
–30
–60
Instrumental Temperature Record
–90 (Calibration Period)
(a)
Space
Number of
Proxies
(b)
FIGURE 4.4 (a) Representation of the global distribution of the most up-to-date
global multi-proxy network used by Mann et al. [58]. Grey squares indicate the 5°
grid cells that contain at least one proxy in the unscreened network from ref. [58].
(b) Schematic of the data matrix for temperature field reconstructions spanning
all or part of the CE. Grey regions in the data matrix are schematic representa-
tions of data availability in the instrumental temperature field and the multi-
proxy matrix. White regions indicate missing data in the various sections of the
data matrix.
4.8.2 Pseudoproxy Experiments
The literature is replete with discussions of the variously applied CFR
methods and their performance (see [29] for a cogent summary of many
employed methods). Given this large number of proposed approaches, it
has become important to establish means of comparing methods using
common datasets. An emerging tool for such comparisons is millennium-
length, forced transient simulations from Coupled General Circulation
Models (CGCMs) [1, 30]. These model simulations have been used as syn-
thetic climates in which to evaluate the performance of reconstruction
methods in tests that have been termed pseudoproxy experiments (see
[85] for a review). The motivation for pseudoproxy experiments is to adopt
a common framework that can be systematically altered and evaluated.
They also provide a much longer, albeit synthetic, validation period than
can be achieved with real-world data, and thus methodological evalua-
tions can extend to lower frequencies and longer timescales. Although one
must always be mindful of how the results translate into real-world impli-
cations, these design attributes allow researchers to test reconstruction
techniques beyond what was previously possible and to compare multiple
methods on common datasets.
The basic approach of a pseudoproxy experiment is to extract a por-
tion of a spatiotemporally complete CGCM field in a way that mimics the
available proxy and instrumental data used in real-world reconstructions.
The principal experimental steps proceed as follows: (1) pseudoinstru-
mental and pseudoproxy data are subsampled from the complete CGCM
field from locations and over temporal periods that approximate their
real-world data availability; (2) the time series that represent proxy infor-
mation are added to noise series to simulate the temporal (and in some
cases spatial) noise characteristics that are present in real-world proxy
networks; and (3) reconstruction algorithms are applied to the model-
sampled pseudo-instrumental data and pseudoproxy network to produce
a reconstruction of the climate simulated by the CGCM. The culminating
fourth step is to compare the derived reconstruction to the known model
108 ◾ Computational Intelligent Data Analysis for Sustainable Development
target as a means of evaluating the skill of the applied method and the
uncertainties expected to accompany a real-world reconstruction product.
Multi-method comparisons can also be undertaken from this point.
Multiple datasets are publicly available for pseudoproxy experiments
through supplemental Websites of published papers [57, 87, 89, 103].
The Paleoclimate Reconstruction Challenge is also a newly established
online portal through the Paleoclimatology Division of the National
Oceanographic and Atmospheric Administration that provides additional
pseudoproxy datasets.* This collection of common datasets is an important
resource for researchers wishing to propose new methodological applica-
tions for CFRs, and is an excellent starting point for these investigations.
* http://www.ncdc.noaa.gov/paleo/pubs/pr-challenge/pr-challenge.html.
Climate Informatics ◾ 109
are quite sensitive to predictor used, and not all predictors appear to be
useful. Finally, Gifford [27] shows a detailed study of team learning, col-
laboration, and decision applied to ice-penetrating radar data collected in
Greenland in May 1999 and September 2007 as part of a model-creation
effort for subglacial water presence classification.
The above-mentioned examples represent a few cases where machine
learning tools have been applied to problems focusing on studying the
polar regions. Although the number of studies appears to be increasing,
likely because of both the increased research focusing on climate change
and the poles and the increased computational power allowing machine
learning tools to expand in their usage, they are still relatively rare com-
pared to simpler but often less efficient techniques.
Machine learning and data mining can be used to enhance the value
of the data by exposing information that would not be apparent from
single-dataset analyses. For example, identifying the link between dimin-
ishing sea ice extent and increasing melting in Greenland can be done
through physical models attempting to model the connections between
the two through the exchange of atmospheric fluxes. However, large-scale
connections (or others at different temporal and spatial scales) might be
revealed through the use of data-driven models or, in a more sophisticated
fashion, through the combination of both physical and data-driven mod-
els. Such an approach would, among other things, overcome the limitation
of the physical models that, even if they represent the state-of-the-art in
the corresponding fields, are limited by our knowledge and understanding
of the physical processes. ANNs can also be used in understanding not
only the connections among multiple parameters (through the analysis of
the neurons connections), but also to understand potential temporal shifts
in the importance of parameters on the overall process (e.g., increased
importance of albedo due to the exposure of bare ice and reduced solid
precipitation in Greenland over the past few years). Applications are not
limited to a pure scientific analysis but also include the management of
information, error analysis, missing linkages between databases, and
improving data acquisition procedures.
In synthesis, there are many areas in which machine learning can
support studies of the poles within the context of climate and climate
change. These include climate model parameterizations and multimodel
ensembles of projections for variables such as sea ice extent, melting in
Greenland, and sea-level rise contribution, in addition to those discussed
in previous sections.
112 ◾ Computational Intelligent Data Analysis for Sustainable Development
can impart a bias or skewness to the observation relative to what the real
world may nominally be doing. Examples in satellite remote sensing are
common—for example, a low cloud record from a satellite will only be
able to see low clouds when there are no high clouds. Similarly, a satellite
record of “mid-tropospheric” temperatures might actually be a weighted
integral of temperatures from the surface to the stratosphere. A paleo
climate record may be of a quantity that while related to temperature or
precipitation, may be a complex function of both, weighted towards a spe-
cific season. In all these cases, it is often advisable to create a ‘forward
model’ of the observational process itself to post-process the raw simula-
tion output to create more commensurate diagnostics.
4.11.3.1 Data Scale
The size of datasets is rapidly outstripping the ability to store and serve the
data. We have difficulty storing even a single copy of the complete archive
of the CMIP3 model results, and making complete copies of those results
and distributing them for analysis becomes both a large undertaking and
limits the analysis to the few places that have data storage facilities of
that scale. Analysis done by the host prior to distribution, such as averag-
ing, reduces the size to something more manageable, but currently those
reductions are chosen far in advance, and there are many other useful
analyses that are not currently being done.
A cloud-based analysis framework would allow such reductions to be
chosen and still executed on machines with fast access to the data.
and final results, would provide the basis of both reproducibility and com-
munication of results. Provenance graphs provide the information neces-
sary to rerun a calculation and get the same results; they also provide the
basis of the full documentation of the results. This full network would
need to have layers of abstraction so that the user could start with an over-
all picture and then proceed to more detailed versions as needed.
4.12 CONCLUSION
The goal of this chapter is to inspire future work in the nascent field of
climate informatics. We hope to encourage work not only on some of the
challenge problems proposed here, but also on new problems. A profuse
amount of climate data of various types is available, providing a rich and
fertile playground for future machine learning and data mining research.
Even exploratory data analysis could prove useful for accelerating discov-
ery. To that end, we have prepared a climate informatics wiki as a result of
the First International Workshop on Climate Informatics, which includes
climate data links with descriptions, challenge problems, and tutorials on
machine learning techniques [14]. We are confident that there are myriad
collaborations possible at the intersection of climate science and machine
learning, data mining, and statistics. We hope our work will encourage
progress on a range of emerging problems in climate informatics.
ACKNOWLEDGMENTS
The First International Workshop on Climate Informatics (2011) served
as an inspiration for this chapter, and some of these topics were dis-
cussed there. The workshop sponsors included Lamont-Doherty Earth
Observatory (LDEO)/Goddard Institute for Space Studies (GISS) Climate
Center, Columbia University; Information Science and Technology
Center, Los Alamos National Laboratory; NEC Laboratories America,
Department of Statistics, Columbia University; Yahoo! Labs; and The New
York Academy of Sciences.
KS was supported in part by National Science Foundation (NSF) Grant
1029711. MKT and MBB are supported by a grant/cooperative agreement
from the National Oceanic and Atmospheric Administration (NOAA.
NA05OAR4311004). The views expressed herein are those of the authors
and do not necessarily reflect the views of NOAA or any of its subagencies.
AB was supported, in part, by NSF grants IIS-1029711, IIS-0916750, and
IIS-0812183, and NSF CAREER award IIS-0953274. ARG’s research
Climate Informatics ◾ 119
reported here has been financially supported by the Oak Ridge National
Laboratory and Northeastern University grants, as well as the National
Science Foundation award 1029166, in addition to funding from the U.S.
Department of Energy and the Department of Science and Technology
of the Government of India. The work of JES was supported in part by
NSF grant ATM0902436 and by NOAA grants NA07OAR4310060 and
NA10OAR4320137. MT would like to acknowledge NSF grant ARC
0909388. GAS is supported by the NASA Modeling and Analysis Program.
REFERENCES
1. C.M. Ammann, F. Joos, D.S. Schimel, B.L. Otto-Bliesner, and R.A. Tomas.
Solar influence on climate during the past millennium: Results from tran-
sient simulations with the NCAR Climate System Model. Proc. U.S. Natl.
Acad. Sci., 104(10): 3713–3718, 2007.
2. K.J. Anchukaitis, M.N. Evans, A. Kaplan, E.A. Vaganov, M.K. Hughes, and
H.D. Grissino-Mayer. Forward modeling of regional scale tree-ring pat-
terns in the southeastern United States and the recent influence of summer
drought. Geophys. Res. Lett., 33, L04705, DOI:10.1029/2005GL025050.
3. A.G. Barnston and T.M. Smith. Specification and prediction of global sur-
face temperature and precipitation from global SST using CCA. J. Climate, 9:
2660–2697, 1996.
4. C.M. Bishop. Machine Learning and Pattern Recognition. New York: Springer,
2007.
5. C.S. Bretherton, C. Smith, and J.M. Wallace. An intercomparison of methods
for finding coupled patterns in climate data. J. Climate, 5: 541–560, 1992.
6. P. Brohan, J.J. Kennedy, I. Harris, S.F.B. Tett, and P.D. Jones. Uncertainty esti-
mates in regional and global observed temperature changes: A new dataset
from 1850. J. Geophys. Res., 111, D12106, 2006.
7. B.M. Buckley, K.J. Anchukaitis, D. Penny et al. Climate as a contributing
factor in the demise of Angkor, Cambodia. Proc. Nat. Acad. Sci. USA, 107:
6748–6752, 2010.
8. S.J. Camargo and A.G. Barnston. Experimental seasonal dynamical forecasts
of tropical cyclone activity at IRI. Wea. Forecasting, 24: 472–491, 2009.
9. S.J. Camargo, A.W. Robertson, A.G. Barnston, and M. Ghil. Clustering of east-
ern North Pacific tropical cyclone tracks: ENSO and MJO effects. Geochem.
Geophys. and Geosys., 9:Q06V05, 2008. doi:10.1029/2007GC001861.
10. M.A. Cane, S.E. Zebiak, and S.C. Dolan. Experimental forecasts of El Niño.
Nature, 321: 827–832, 1986.
11. N. Cesa-Bianchi and G. Lugosi. Prediction, Learning and Games. Cambridge
(UK) and New York: Cambridge University Press, 2006.
12. V. Chandrasekaran, S. Sanghavi, P. Parril, and A. Willsky. Rank-sparsity
incoherence for matrix decomposition. SIAM Journal of Optimization,
21(2): 572–596, 2011.
120 ◾ Computational Intelligent Data Analysis for Sustainable Development
61. G.A. Meehl, T.F. Stocker, W.D. Collins et al. Global climate projections.
Climate Change 2007: The Physical Science Basis. Contribution of Working
Group I to the Fourth Assessment Report of the Intergovernmental Panel
on Climate Change, S. Solomon et al. (Eds), Cambridge University Press,
Cambridge (UK) and New York, 2007.
62. M.J. Menne, C.N. Williams Jr., and M.A. Palecki. On the reliability of the
U.S. surface temperature record. J. Geophys. Res., 115: D11108, 2010.
63. C. Monteleoni, G.A. Schmidt, S. Saroha, and E. Asplund. Tracking climate
models. Statistical Analysis and Data Mining, 4: 372–392, 2011.
64. J.M. Murphy, B.B. Booth, M. Collins et al. A methodology for probabilistic
predictions of regional climate change from perturbed physics ensembles.
Phil. Trans. Roy. Soc. A, 365: 2053–2075, 2007.
65. G.T. Narisma, J.A. Foley, R. Licker, and N. Ramankutty. Abrupt changes in rain-
fall during the twentieth century. Geophys. Res. Lett., 34: L06710, March 2007.
66. S. Negahban, P. Ravikumar, M.J. Wainwright, and B. Yu. A unified frame-
work for high-dimensional analysis of m-estimators with decomposable
regularizers. Arxiv, 2010. http://arxiv.org/abs/1010.2731v1.
67. H. Owhadi, J.C. Scovel, T. Sullivan, M. McKems, and M. Ortiz. Optimal
uncertainty quantification, SIAM Review, 2011 (submitted).
68. R.D. Peng, J.F. Bobb, C. Tebaldi, L. McDaniel, M.L. Bell, and F. Dominici.
Toward a quantitative estimate of future heat wave mortality under global
climate change. Environ. Health Perspect., 119: 701–706, 2010.
69. W.B. Powell and P. Frazier. Optimal learning. In Tutorials in Operations
Research: State-of-the-Art Decision Making Tools in the Information Age.
Hanover, MD, 2008.
70. P. Ravikumar, A. Agarwal, and M.J. Wainwright. Message-passing for graph-
structured linear programs: Proximal projections, convergence and round-
ing schemes. J. Machine Learning Res., 11: 1043–1080, 2010.
71. D.B. Reusch. Ice-core reconstructions of West Antarctic Sea-Ice variability: A
neural network perspective. Fall Meeting of the American Geophysical Union,
2010.
72. C.F Ropelewski and M.S. Halpert. Global and regional scale precipitation
patterns associated with the El Niño/Southern Oscillation. Mon. Wea. Rev.,
115: 1606–1626, 1987.
73. R. Salakhutdinov and A. Mnih. Probabilistic matrix factorization. In
Advances in Neural Information Processing Systems 20, 2008.
74. M. Scheffer, S. Carpenter, J.A. Foley, C. Folke, and B. Walker. Catastrophic
shifts in ecosystems. Nature, 413(6856): 591–596, October 2001.
75. G.A. Schmidt. Error analysis of paleosalinity calculations. Paleoceanography,
14: 422–429, 1999.
76. G.A. Schmidt, A. LeGrande, and G. Hoffmann. Water isotope expressions
of intrinsic and forced variability in a coupled ocean-atmosphere model.
J. Geophys. Res., 112, D10103, 2007.
77. T. Schneider. Analysis of incomplete climate data: Estimation of mean values
and covariance matrices and imputation of missing values. J. Climate, 14:
853–871, 2001.
124 ◾ Computational Intelligent Data Analysis for Sustainable Development
78. S.D. Schubert, M.J. Suarez, P.J. Pegion, R.D. Koster, and J.T. Bacmeister. On
the cause of the 1930s dust bowl. Science, 303: 1855–1859, 2004.
79. C. Scovel and I. Steinwart. Hypothesis testing for validation and certification.
J. Complexity, 2010 (submitted).
80. G. Shafer and V. Vovk. A tutorial on conformal prediction. J. Mach. Learn.
Res., 9: 371–421, 2008.
81. J. Shukla. Dynamical predictability of monthly means. Mon. Wea. Rev., 38:
2547–2572, 1981.
82. J. Shukla. Predictability in the midst of chaos: A scientific basis for climate
forecasting. Science, 282: 728–731, 1998.
83. F.H. Sinz. A Priori Knowledge from Non-Examples. Diplomarbeit (thesis),
Universität Tübingen, Germany, 2007.
84. F.H. Sinz, O. Chapelle, A. Agrawal, and B. Schölkopf. An analysis of inference
with the universum. In Advances in Neural Information Processing Systems,
20, 2008.
85. J.E. Smerdon. Climate models as a test bed for climate reconstruction meth-
ods: Pseudoproxy experiments. Wiley Interdisciplinary Reviews Climate
Change, in revision, 2011.
86. J.E. Smerdon and A. Kaplan. Comment on “Testing the Fidelity of Methods
Used in Proxy-Based Reconstructions of Past Climate”: The Role of the
Standardization Interval. J. Climate, 20(22): 5666–5670, 2007.
87. J.E. Smerdon, A. Kaplan, and D.E. Amrhein. Erroneous model field repre-
sentations in multiple pseudoproxy studies: Corrections and implications.
J. Climate, 23: 5548–5554, 2010.
88. J.E. Smerdon, A. Kaplan, D. Chang, and M.N. Evans. A pseudoproxy evalua-
tion of the CCA and RegEM methods for reconstructing climate fields of the
last millennium. J. Climate, 24: 1284–1309, 2011.
89. J.E. Smerdon, A. Kaplan, E. Zorita, J.F. González-Rouco, and M.N. Evans.
Spatial performance of four climate field reconstruction methods targeting
the Common Era. Geophys. Res. Lett., 38, 2011.
90. D.M. Smith, S. Cusack, A.W. Colman et al. Improved surface temperature
prediction for the coming decade from a global climate model. Science, 317:
769–799, 2007.
91. L.-K. Soh and C. Tsatsoulis. Unsupervised segmentation of ERS and Radarsat
sea ice images using multiresolution peak detection and aggregated popula-
tion equalization. Int. J. Remote S., 20: 3087–3109, 1999.
92. L.-K. Soh, C. Tsatsoulis, D. Gineris, and C. Bertoia. ARKTOS: An intelligent
system for SAR sea ice image classification. IEEE T. Geosci. Remote S., 42:
229–248, 2004.
93. A. Solomon, L. Goddard, A. Kumar, J. Carton, C. Deser, I. Fukumori,
A. Greene, G. Hegerl, B. Kirtman, Y. Kushnir, M. Newman, D. Smith,
D. Vimont, T. Delworth, J. Meehl, and T. Stockdale. Distinguishing the
roles of natural and anthropogenically forced decadal climate variability:
Implications for prediction. Bull. Amer. Meteor. Soc., 92: 141–156, 2010.
94. S. Sra, S. Nowozin, and S. Wright. Optimization for Machine Learning. MIT
Press, 2011.
Climate Informatics ◾ 125
95. K. Steinhaeuser, A.R. Ganguly, and N.V. Chawla. Multivariate and multiscale
dependence in the global climate system revealed through complex net-
works. Climate Dynamics, doi:10.1007/s00382-011-1135-9, in press, 2011.
96. K.E. Taylor, R. Stouffer, and G. Meehl. The CMIP5 experimental design. Bull
Amer. Meteorol. Soc., 2011 (submitted).
97. C. Tebaldi and R. Knutti. The use of the multi-model ensemble in probabilis-
tic climate projections. Phil. Trans. Roy. Soc. A, 365: 2053–2075, 2007.
98. M. Tedesco and E.J. Kim. A study on the retrieval of dry snow parameters
from radiometric data using a dense medium model and genetic algorithms.
IEEE T. Geosci. Remote S., 44: 2143–2151, 2006.
99. M. Tedesco, J. Pulliainen, P. Pampaloni, and M. Hallikainen. Artificial neural
network based techniques for the retrieval of SWE and snow depth from
SSM/I data. Remote Sens. Environ., 90: 76–85, 2004.
100. D.M. Thompson, T.R. Ault, M.N. Evans, J.E. Cole, and J. Emile- Geay.
Comparison of observed and simulated tropical climate trends using a for-
ward model of coral δ18O. Geophys. Res. Lett., in review, 2011.
101. R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of
Royal Statistical Society B, 58: 267–288, 1996.
102. M.P. Tingley, P.F. Craigmile, M. Haran, B. Li, E. Mannshardt-Shamseldin,
and B. Rajaratnam. Piecing Together the Past: Statistical Insights into
Paleoclimatic Reconstructions. Technical report 2010-09, Department of
Statistics, Stanford University, 2010.
103. M.P. Tingley and P. Huybers. A Bayesian algorithm for reconstructing cli-
mate anomalies in space and time. Part I: Development and applications to
paleoclimate reconstruction problems. J. Climate, 23(10): 2759–2781, 2010.
104. M.K. Tippett, S.J. Camargo, and A.H. Sobel. A Poisson regression index
for tropical cyclone genesis and the role of large-scale vorticity in genesis.
J. Climate, 24: 2335–2357, 2011.
105. K.E. Trenberth, P.D. Jones, P. Ambenje et al. Observations: Surface and
atmospheric climate change. Climate Change 2007: The Physical Science
Basis. Contribution of Working Group I to the Fourth Assessment Report
of the Intergovernmental Panel on Climate Change, S. Solomon et al. (Eds.),
Cambridge (UK) and New York: Cambridge University Press, 2007.
106. A.A. Tsonis, K.L. Swanson, and P.J. Roebber. What do networks have to do with
climate? Bulletin of the American Meteorological Society, 87(5): 585–595, 2006.
107. A.A. Tsonis and P.J. Roebber. The architecture of the climate network.
Physica A, 333: 497–504, 2004.
108. A.A. Tsonis and K.L. Swanson. Topology and predictability of El Niño and
La Niña networks. Physical Review Letters, 100(22): 228502, 2008.
109. K.Y. Vinnikov, N.C. Grody, A. Robok et al. Temperature trends at the surface
and in the troposphere. J. Geophys. Res., 111: D03106, 2006.
110. F.D. Vitart and T.N. Stockdale. Seasonal forecasting of tropical storms using
coupled GCM integrations. Mon. Wea. Rev., 129: 2521–2537, 2001.
111. M.J. Wainwright and M.I. Jordan. Graphical models, exponential families,
and variational inference. Foundations and Trends in Machine Learning,
1(1–2): 1–305, 2008.
126 ◾ Computational Intelligent Data Analysis for Sustainable Development
Computational Data
Sciences for Actionable
Insights on Climate
Extremes and Uncertainty
Auroop R. Ganguly, Evan Kodra, Snigdhansu
Chatterjee, Arindam Banerjee, and Habib N. Najm
CONTENTS
5.1 Overview and Motivation 128
5.1.1 Climate Extremes: Definitions and Concepts 128
5.1.2 Societal and Stakeholder Priorities 128
5.1.3 Computational Data Sciences: Challenges and
Opportunities 130
5.1.3.1 Overview of Research Areas: 1. Extremes
Characterization 132
5.1.3.2 Overview of Research Areas: 2. Uncertainty
Assessments 132
5.1.3.3 Overview of Research Areas: 3. Enhanced
Predictions 133
5.2 Extremes Characterization 134
5.3 Uncertainty Assessments 136
5.3.1 Statistical Modeling of Uncertainty in Multimodel
Ensembles 136
5.3.2 Parametric Uncertainties in Individual Climate Models 142
5.4 Enhanced Predictions 145
127
128 ◾ Computational Intelligent Data Analysis for Sustainable Development
(Larger Arguably the largest knowledge gap in climate science relevant for informing
Uncertainty) adaptation and policy
FIGURE 5.1 Uncertainty quantification for climate extremes, which are broadly
construed in this context, represents one of the largest knowledge gaps in terms
of translating the physical science basis of climate to information relevant for
impacts assessments and adaptation decisions, and eventually to mitigation
policy. However, the cascade of uncertainties is difficult to quantify. The soci-
etal costs of action and inaction are both potentially large for climate adaptation
and mitigation policies; hence, uncertainties in climate are important to effec-
tively characterize and communicate. Climate extremes may broadly include
large shifts in regional climate patterns or severe weather or hydrological events
caused or exacerbated by natural climate variability or climate change. This
chapter primarily focuses on the statistical attributes of severe events, or, changes
in tail behavior.
“Reanalysis”
Surrogate
Observations –
Observational Data used especially
for areas lacking
past data
Weather Stations/Radar
FIGURE 5.2 Remote or in-situ sensor observations and climate model simu-
lations can be investigated through computational data science methods for
multimodel evaluations, enhanced projections, and multiscale assessments
to inform decisions and policy. The growth in climate data from models and
observations is expected to grow exponentially over the next several decades
(Overpeck et al., 2011), providing a vast set of challenges and opportunities for
data science communities.
132 ◾ Computational Intelligent Data Analysis for Sustainable Development
numerical simulations (Pall et al., 2011): these methods can benefit from
rigorous uncertainty quantification approaches. New mathematical meth-
ods for uncertainty are critically needed in these areas.
5.2 EXTREMES CHARACTERIZATION
There are several challenges in characterizing and analyzing data related
to climate extremes. One of the first challenges is the nature of the data:
observational data are of relatively short duration and typically do not
allow for many important extreme conditions to be manifest, they are
unevenly spread spatially, and data quality is also uneven. Climate model
outputs and reanalysis data do not have several of these problems, but
Mannshardt-Shamseldin et al. (2011) demonstrate that the nature of
extremes from gridded data differ considerably from observed data.
Moreover, as Wehner (2010) observes:
…to the extent that climate models can be tuned to reproduce the
recent past, model developers focus on the mean values of climatic
observations, not the tails of their distributions.
(5) the spatial pattern of rainfall, and several other variables can be of
interest. Accordingly, the definition and the notion of an extreme may
be different. The trifecta of intensity, duration, frequency (IDF), which is
often characterized using extreme value theory (Kao and Ganguly, 2011)
is useful in many cases, but not all. Another example is that of cold tem-
peratures, which are important for crop production and food security. The
variables of interest in this example could be the number of days of a cer-
tain level of frost, consecutive frost days, and time spent below a tempera-
ture threshold (Kodra et al., 2011a). Not all such definitions of “extremes”
lend themselves to common, theoretically satisfying statistical analysis
(Coles, 2001).
Another potential problem is that of identification of extreme events
versus rare events, which are not always the same; in other words, an event
might be extreme in impact but not rare, and vice versa. The definition of
an extreme event may often be determined by its impact, and this defini-
tion will, in turn, often determine its rarity. The rarity of the defined events,
along with other data properties, will dictate which statistical inference
approaches may be appropriate. In some cases, summary measures have
been used to obtain conclusions about extreme events (Goswami et al.,
2006), although subsequent uses of the extreme-value model have pro-
vided different conclusions on similar data (Ghosh et al., 2011). Also, as
Ferro and Segers (2003) observe, extremes can be clustered, which may
present additional challenges related to the independence assumed by
some extreme value statistical approaches.
From a purely statistical perspective, there is a gap between finite
sample data-based extreme events and the general asymptotic theory
that is used for extreme event analysis. Classic extreme-value statisti-
cal approaches attempt to extrapolate the extreme behavior of variables
by fitting distributions to tail observations, such as annual maxima or
exceedances above or below some predetermined (quantile) threshold
(Coles, 2001; Kharin et al., 2007). Note that the generalized extreme value
distribution or the generalized Pareto distribution, which have been used
in the climate extremes literature (Kharin and Zwiers, 2000; Perkins
et al., 2009; Kao and Ganguly, 2011), are asymptotic limits of probabili-
ties relating to finite-sample size extreme events, and need not be exact
characterizations. Also, most real data are temporally and spatially cor-
related, a fact that is often ignored in computing return-level characteris-
tics, quantifying uncertainty, or making inference. There is no consensus
136 ◾ Computational Intelligent Data Analysis for Sustainable Development
about the best parameter estimation and inference technique for extreme-
value distributions (Hosking and Wallis, 1997; Kharin and Zwiers, 2000;
Coles and Dixon, 1999; Kharin and Zwiers, 2005), and approaches for
including information from covariables are still in development (Hall and
Tajvidi, 2000).
The bootstrap, broadly speaking, is a class of resampling techniques that
can be employed to quantify sampling variability (uncertainty) in param-
eter estimation, among other uses (Efron, 1979). Parametric bootstrap and
the traditional nonparametric bootstrap approaches of Efron (1979) were
used in conjunction with the L-moments method and the maximum like-
lihood method for studying climate extremes in Kharin and Zwiers (2000;
2005; 2007), who also compared various estimation techniques and listed
several caveats. Inference for climate extremes may benefit from a bet-
ter understanding of the limits and appropriateness of popular statistical
inference procedures (such as extreme value theory), as well as the appli-
cation and/or creation of other approaches that relax assumptions or are
robust to limitations of available extreme data.
5.3 UNCERTAINTY ASSESSMENTS
5.3.1 Statistical Modeling of Uncertainty in Multimodel Ensembles
Here we discuss the state-of-the-art in uncertainty quantification (UQ)
for situations where ensembles of global climate models or Earth system
models (GCMs/ESMs) are used to assess regional climate change. While
statistical and dynamical (regional climate models) downscalings are often
used for regional assessments, they are in turn driven by ESMs, and hence
UQ in ESMs remains an important challenge. UQ is often inundated with
a sense of urgency, and ensembles of ESMs are tools from which practical
and timely uncertainty assessments can be readily formed.
Structural uncertainty, or that which arises from variations in the
mathematical mechanics of climate models, is the principal focus of
UQ in approaches discussed in this section; it has been studied in sev-
eral forms with multimodel ensembles where weights are assigned to
individual models as a measure of their reliability. We distinguish the
ensemble approaches discussed here from other UQ methodologies—for
example, physics perturbed ensembles—that have been used to explore
parameter uncertainty within single climate models (Stainforth et al.,
2005), and approaches based on or similar to polynomial chaos expan-
sion (see Section 5.3.2). It is important to be aware of all approaches for
Computational Data Sciences for Actionable Insights ◾ 137
1
λ B , j = min 1, , (5.1)
xj −µ
1
λ C , j = min 1, , (5.2)
y j − Y
are more than 1, then 1 is chosen as λB,j or λC,j , with the notion that ∙Xj – μ∙
~ ~
or ∙Yj – Y∙ could have been small just by chance. “Y ” is an unknown and
thus must be estimated using the following weights:
( )
1/mn
λ j = λ mB , j λ nC , j , (5.3)
Σ j λ jY j
Y = . (5.4)
Σ jλ j
2007). Additionally, the criterion of GCM skill (bias in most recent work)
is difficult to define and evaluate; in most cases, it is difficult to determine
whether metrics measuring past GCM skill will translate to the future
(Tebaldi and Knutti, 2007; Knutti et al., 2010).
The REA was admittedly ad hoc; however, its two notions of GCM skill
and consensus (Giorgi and Mearns, 2002) have formed the foundation for
a prominent line of work, beginning with Tebaldi et al., (2004), that for-
malized them in a Bayesian model. One of the most recent versions of this
statistical model can be found in Smith et al. (2009), which also allows for
the joint consideration of multiple regions. Specifically, using this model,
a posterior distribution for past or current temperature μ and future tem-
perature υ can be simulated from a Markov Chain Monte Carlo (MCMC)
sampler using a weight λj for each GCM j. Next, each λj can be simulated
from a posterior by considering the bias and consensus of GCM j. The
weights λj then inform a new estimate of μ and υ, which informs a new
estimate of each λj , and so on. Specifically, λj follows a Gamma posterior
distribution with the following expectation:
a +1
E λ j • = (5.5)
(
b + 0.5 ( X j − µ ) + 0.5θ Y j − υ − β ( X j − µ ) )
2 2
67%
50%
25%
20%
9%
0.5
0.0
0 2 4 6 8 10
Deg. C
FIGURE 5.3 The univariate (one region) Bayesian model from Smith et al. (2009)
illustrates the importance of the parameter θ in dictating the spread of the prob-
ability density function (PDF) for change in regional mean temperature. This
particular set of PDFs is obtained for spatiotemporally averaged Greenland tem-
perature change from 1970 to 1999 to 2070 to 2099. The horizontal axis measures
change in degrees Celsius, while the vertical axis measures frequency. The legend
indicates the condition of θ and from top to bottom corresponds to PDFs from
narrow to wide: “random” is the narrowest density, and treats θ as a random
unknown quantity as in Smith et al. (2009). For the remaining PDFs, values of θ
are fixed at different quantities that come to represent the relative importance of
convergence versus bias (where the importance of bias is 100% minus that of θ).
Notably, treating θ as a random quantity yields a result where convergence is
favored much more than bias.
make sense that that uncertainty should actually increase. The degree of
shrinkage toward the mean is dictated by the parameter θ. Its impor-
tance is stressed in Figure 5.3 and clear from Equation 5.1: if θ ≫ 1, then
holding all else constant, consensus is favored by the statistical model.
Indeed, an earlier work by Tebaldi et al. (2004) featured a slight variant
of their statistical model with a provision that, through a prior distribu-
tion, restricted the influence of θ; this provision was not included later by
Smith et al. (2009). While the above represents the most prominent line
of work on combining multimodel ensembles for quantifying uncertainty
in regional climate, a few other initial approaches have been developed.
These approaches have extended the broad multimodel UQ line of work by
integrating methodology for quantifying inter-model covariance (Greene
et al., 2006) and spatial variability structure (Furrer et al., 2007; Sain et al.,
142 ◾ Computational Intelligent Data Analysis for Sustainable Development
∑
P
χ≈ α k Ψ k ( ξ1 , ξ 2 , …, ξn ) (5.6)
k =0
5.4 ENHANCED PREDICTIONS
The key desiderata from predictive models in the context of extremes
include accurate and uncertainty-quantified projections of crucial vari-
ables related to extremes as well as succinct characterizations of covariates
and climate processes collectively influencing extremes. Such character-
izations must be cognizant of complex and possibly nonlinear dependency
patterns while staying physically interpretable, thereby yielding scien-
tific understanding of extremes and the processes driving them. While
uncertainty quantification methods were discussed in Section 5.3, we now
briefly introduce some methodology that could be useful in enhancing
predictions and perhaps reducing uncertainty in crucial climate variables
that are not captured well by current-generation physical models.
Standard predictive models, such as least squares or logistic linear
regression, fall short of such desiderata in multiple ways. The number p
of covariates and fine-scale climate processes potentially influencing key
variables such as extreme precipitation far surpass the number of examples
n of such extreme events. In the n ≪ p regime for regression, consistency
guarantees from standard theory breakdown, implying that the model
inferred is not even statistically meaningful (Girko, 1995; Candès and Tao,
2007). Moreover, such standard models will assign nonzero regression
146 ◾ Computational Intelligent Data Analysis for Sustainable Development
ACKNOWLEDGMENTS
This research was primarily supported by the National Science Foundation
through their Expeditions in Computing program; grant number 1029166.
ARG and EK were supported in part by a Northeastern University grant,
as well as by the Nuclear Regulatory Commission. HNN acknowledges
support by the U.S. Department of Energy (DOE), Office of Basic Energy
Sciences (BES) Division of Chemical Sciences, Geosciences, and Biosciences.
Sandia National Laboratories is a multi- program laboratory operated
by Sandia Corporation, a Lockheed Martin Company, for the United States
Department of Energy under contract DE-AC04-94-AL85000.
REFERENCES
Agovic, A., Banerjee, A., and Chatterjee, S. 2011. Probabilistic Matrix Addition.
International Conference on Machine Learning (ICML).
Álvarez, M.A., and Lawrence, N.D. 2011. Computationally efficient convolved
multiple output Gaussian processes. Journal of Machine Learning Research,
12: 1425–1466.
Annan, J.D., and Hargreaves, J.C. 2006. Using multiple observationally-based
constraints to estimate climate sensitivity. Geophysical Research Letters, 33,
L06704, doi: http://dx.doi.org/10.1029/2005GL025259.
Babacan, S., Molina, R., and Katsaggelos, A., 2010. Bayesian comprehensive sens-
ing using LaPlace priors. IEEE Transactions on Image Processing, 19(1): 53–63.
150 ◾ Computational Intelligent Data Analysis for Sustainable Development
Banerjee, S., Gelfand, A.E., Finley, A.O., and Sang, H. 2008. Gaussian predictive
process models for large spatial data sets. Journal of the Royal Statistical
Society B, 70(4): 825–848.
Buser, C.M., Künsch, H.R., Lüthi, D., Wild, M., and Schär, C. 2009. Bayesian multi-
model projection of climate: Bias assumptions and interannual variability.
Climate Dynamics, 33: 849–868, doi:10.1007/s00382-009-0588-6.
Caflisch, R. 1998. Monte Carlo and quasi-Monte Carlo methods. Acta Numerica,
7: 1–49.
Candès, E., and Tao, T. 2007. The Dantzig selector: Statistical estimation when p is
much larger than n. Annals of Statistics, 35(6): 2313–2351.
Candès, E.J., and Wakin, M.B., 2008. An introduction to compressive sampling.
IEEE Signal Processing Magazine, 25(2), 21–30.
Chatterjee, S., Banerjee, A., Chatterjee, S., Steinhaeuser, K., and Ganguly, A.R.
2012. Sparse group lasso: Consistency and climate applications. Proceedings
of the SIAM International Conference on Data Mining (in press).
Christensen, C., Aina, T., and Stainforth, D. 2005. The challenge of volunteer com-
puting with lengthy climate model simulations. First International Conference
on e-Science and Grid Computing. doi: 10.1109/E-SCIENCE.2005.76.
Coles, S.G. 2001. An Introduction to Statistical Modeling of Extreme Values. New
York: Springer, 208 pp.
Coles, S.G., and Dixon, M.J. 1999. Likelihood-based inference for extreme value
models. Extremes, 2(1): 5–23.
Cressie, N., and Johannesson, G. 2008. Fixed rank kriging for very large spatial
data sets. Journal of the Royal Statistical Society B, 70(1): 209–226.
Debusschere, B., Najm, H., Pebay, P., Knio, O., Ghanem, R., and Le Maitre, O.
2004. Numerical challenges in the use of polynomial chaos representations
for stochastic processes. SIAM Journal of Scientific Computing, 26: 698–719.
Donoho, D.L. 2006. Compressed Sensing. IEEE Transactions on Information
Theory, 52(4): 1289–1306.
Drignei, D., Forest, C.E., and Nychka, D. 2009. Parameter estimation for computa-
tionally intensive nonlinear regression with an application to climate model-
ing. Annals of Applied Statistics, 2(4): 1217–1230.
Efron, B. 1979. Bootstrap methods: Another look at the jackknife. Annals of
Statistics, 7: 101–118.
Emanuel, K., Sundararajan, R., and Williams, J. 2008. Hurricanes and global
warming: Results from downscaling IPCC AR4 simulations. Bulletin of the
American Meteorological Society, 89(3): 347–368.
Embrechts, P., Klüppelberg, C., and Mikosch, T. 2011. Modelling Extremal Events
for Insurance and Finance. New York: Springer.
Engelhaupt, E. 2008. Climate change: A matter of national security. Environmental
Science & Technology, 42(20): 7458–7549.
Falk, M., Hüsler, J., and Reiss, R. 2010. Laws of Small Numbers: Extremes and Rare
Events. New York and Berlin: Springer.
Computational Data Sciences for Actionable Insights ◾ 151
Fern, A., and Givan, R. 2003. Online ensemble learning: An empirical study.
Machine Learning, 53: 71–109.
Ferro, C.A.T., and Segers, J. 2003. Inference for clusters of extreme values. Journal
of the Royal Statistical Society, Series B, 65(2): 545–556.
Furrer, R., and Sain, S.R. 2009. Spatial model fitting for large datasets with appli-
cations to climate and microarray problems. Statistics and Computing, 19:
113–128, doi: http://dx.doi.org/10.1007/s11222-008-9075-x.
Furrer, R., Sain, S.R., Nychka, D., and Tebaldi, C. 2007. Multivariate Bayesian
analysis of atmosphere-ocean general circulation models. Environmental
and Ecological Statistics, 14: 249–266.
Ganguly, A.R., Steinhaeuser, K., Erickson, D.J., Branstetter, M., Parish, E.S.,
Singh, N., Drake, J.B., and Buja, L. 2009b. Higher trends but larger uncer-
tainty and geographic variability in 21st century temperature and heat
waves. Proceedings of the National Academy of Sciences of the United States of
America, 106(37): 15555–15559.
Ganguly, A.R., Steinhaeuser, K., Sorokine, A., Parish, E.S., Kao, S.-C., and
Branstetter, M. 2009a. Geographic analysis and visualization of climate
extremes for the Quadrennial Defense Review. Proceedings of the 17th
ACM SIGSPATIAL International Conference on Advances in Geographic
Information Systems, p. 542–543.
Gelfand, A.E., and Banerjee, S. 2010. Multivariate spatial process models. In
A.E. Gelfand, P. Diggle, P. Guttorp, and M. Fuentes (Eds.), Handbook of
Spatial Statistics. Boca Raton, FL: CRC Press.
Ghanem, R., and Spanos, P. 1991. Stochastic Finite Elements: A Spectral Approach.
New York and Berlin: Springer.
Ghosh, S., Das, D., Kao, S.-C., and Ganguly, A.R. 2011. Lack of uniform trends
increasing spatial variability in observed Indian rainfall extremes. Nature
Climate Change. (In press—available online). 2, 86–91, DOI: 1038/
nclimate1377.
Giorgi, F., and Mearns, L.O. 2002. Calculation of average, uncertainty range, and
reliability of regional climate changes from AOGCM simulations via the
“Reliability Ensemble Averaging” (REA) method. Journal of Climate, 15:
1141–1158.
Girko, V.L. 1995. Statistical Analysis of Observations of Increasing Dimension. New
York: Kluwer Academic.
Goswami, B.N., Venugopal, V., Sengupta, D., Madhusoodanan, M.S., and Xavier,
P.K. 2006. Increasing trend of extreme rain events over India in a warming
environment. Science, 314(5804): 1442–1445.
Govaerts, Y.M., and Lattanzio, A. 2007. Retrieval error estimation of surface albedo
derived from geostationary large band satellite observations: Application
to Meteosat-2 and Meteosat-7 data. Journal of Geophysical Research, 112
D05102, doi: http://dx.doi.org/10.1029/2006JD007313.
Greene, A.M., Goddard, L., and Lall, U. 2006. Probabilistic multimodel regional
temperature projections. Journal of Climate, 19: 4326–4343.
152 ◾ Computational Intelligent Data Analysis for Sustainable Development
Hagedorn, R.F., Doblas-Reyes, F.J., and Palmer T.N. 2005. The rationale behind the
success of multi-model ensembles in seasonal forecasting. I. Basic concept.
Tellus A, 57: 219–233.
Hall, P., and Tajvidi, N. 2000. Nonparametric analysis of temporal trend when
fitting parametric models to extreme value data. Statistical Science, 15(2):
153–167.
Higdon, D.M. 2002. Space and space-time modelling using process convolutions.
In Quantitative Methods for Current Environmental Issues, p. 37–56. Berlin:
Springer-Verlag.
Hosking, J.R.M., and Wallis, J.R. 1997. Regional Frequency Analysis: An Approach
Based on L-Moments. Cambridge: Cambridge University Press.
IPCC. 2007. Fourth Assessment Report (AR4), Working Group I, Chapter 8.
IPCC. 2011. Summary for Policymakers. In Intergovernmental Panel on Climate
Change Special Report on Managing the Risks of Extreme Events and
Disasters to Advance Climate Change Adaptation, Field, C.B., Barros, V.,
Stocker, T.F., Qin, D., Dokken, D., Ebi, K.L., Mastrandrea, M.D., Mach, K.J.,
Plattner, G.-K., Allen, S., Tignor, M., and Midgley, P.M. (Eds.). Cambridge
(UK) and New York: Cambridge University Press.
Jackson, C., Sen, M., and Stoffa, P. 2004. An efficient stochastic Bayesian approach
to optimal parameter and uncertainty estimation for climate model predic-
tions. Journal of Climate, 17(14): 2828–2841.
Jackson, C., Xia, Y., Sen, M., and Stoffa, P. 2003. Optimal parameter and uncer-
tainty estimation of a land surface model: A case example using data from
Cabauw. Netherlands Journal of Geophysical Research, 108 (D18): 4583 http://
dx.doi.org/10.1029/2002JD002991.
Jackson, C.S., Sen, M.K., Huerta, G., Deng, Y., and Bowman, K.P. 2008. Error
reduction and convergence in climate prediction. Journal of Climate, 21:
6698–6709, doi: 10.1175/2008JCLI2112.1.
Kao, S.-C., and Ganguly, A.R. 2011. Intensity, duration, and frequency of precipita-
tion extremes under 21st-century warming scenarios. Journal of Geophysical
Research, 116(D16119): 14 pages. DOI: 10.1029/2010JD015529.
Karl, T.R., Melillo, J.M., and Peterson, T.C. Eds. 2009. Global Climate Change
Impacts in the United States. Cambridge (UK) and New York: Cambridge
University Press, 196 pp.
Kawale, J., Liess, S., Kumar, A., Steinbach, M., Ganguly, A., Samatova, N.F., Semazzi,
F., Snyder, P., and Kumar, V. 2011. Data guided discovery of dynamic cli-
mate dipoles. Proceedings of the 2011 NASA Conference on Intelligent Data
Understanding (CIDU).
Khan, S., Bandyopadhyay, S., Ganguly, A.R., Saigal, S., Erickson, D.J., Protopopescu,
V., and Ostrouchov, G. 2007. Relative performance of mutual information
estimation methods for quantifying the dependence among short and noisy
data. Physical Review, E 026209.
Khan, S., Ganguly, A.R., Bandyopadhyay, S., Saigal, S., Erickson, D.J., Protopopescu,
V., and Ostrouchov, G. 2006. Nonlinear statistics reveals stronger ties between
ENSO and the tropical hydrological cycle. Geophysical Research Letters, 33:
L24402.
Computational Data Sciences for Actionable Insights ◾ 153
Kharin, V.V., and Zwiers, F.W. 2000. Changes in the extremes in an ensemble
of transient climate simulations with a coupled atmosphere-ocean GCM.
Journal of Climate, 13: 3760–3788.
Kharin, V.V., and Zwiers, F.W. 2005. Estimating extremes in transient climate
change simulations. Journal of Climate, 18: 1156–1173.
Kharin, V.V., Zwiers, F.W., Zhang, X., and Hegerl, G.C. 2007. Changes in tem-
perature and precipitation extremes in the IPCC ensemble of global coupled
model simulations. Journal of Climate, 20: 1419–1444.
Knutti, R. 2010. The end of model democracy? Climatic Change, 102(3-4): 395–404.
Knutti, R., Furrer, R., Tebaldi, C., Cermak, J., and Meehl, G.A. 2010. Challenges
in combining projections from multiple climate models. Journal of Climate,
23(10): 2739–2758.
Kodra, E., Chatterjee, S., and Ganguly, A.R. 2011b. Challenges and opportuni-
ties toward improved data-guided handling of global climate model ensem-
bles for regional climate change assessments. ICML Workshop on Machine
Learning for Global Challenges.
Kodra, E., Ghosh, S., and Ganguly, A.R. 2012. Evaluation of global climate models
for Indian monsoon climatology. Environmental Research Letters, 7, 014012,
7 pp., doi: 10.1088/1748-9326/7/1/014012.
Kodra, E., Steinhaeuser, K., and Ganguly, A.R. 2011a. Persisting cold extremes
under 21st- century warming scenarios. Geophysical Research Letters,
38(L08705): 5 pages. DOI: 10.1029/2011GL047103.
Koltchinskii, V., and Yuan, M. 2008. Sparse recovery in large ensembles of kernel
machines. Conference on Learning Theory (COLT).
Krishnamurti, T.N. et al. 1999. Improved weather and seasonal climate forecasts
from multimodel superensemble. Science, 285(5433): 1548–1550.
Le Maitre, O., Najm, H., Ghanem, R., and Knio, O. 2004. Multi-resolution analysis
of Wiener-type uncertainty propagation schemes. Journal of Computational
Physics, 197: 502–531.
Liu, S.C., Fu, C., Shiu, C.-J., Chen, J.-P., and Wu, Fu. 2009. Temperature dependence
of global precipitation extremes. Geophysical Research Letters, 36(L17702):
4 pages. DOI: 10.1029/2009GL040218.
Lozano, A.C., Abe, N., Liu, Y., and Rosset, S. 2009b. Grouped graphical Granger
modeling methods for temporal causal modeling. ACM SIGKDD Conference
on Knowledge Discovery and Data Mining (KDD) 2009, p. 577–586.
Lozano, A.C., Li, H., Niculescu-Mizil, A., Liu, Y., Perlich, C., Hosking, J., and
Abe, N. 2009a. Spatial-temporal causal modeling for climate change attri-
bution. Proceedings of the 15th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, p. 587–596.
Malik, N., Bookhagen, B., Marwan, N., and Kurths, J. 2011. Analysis of spatial and
temporal extreme monsoonal rainfall over South Asia using complex networks.
Climate Dynamics 39(3–4), 971–987, DOI: 10,1007/500382-011-1156-4.
Mannshardt-Shamseldin, E.C., Smith, R.L., Sain, S.R., Mearns, L.O., and Cooley,
D. 2011. Downscaling extremes: A comparison of extreme value distribu-
tions in point-source and gridded precipitation data. Annals of Applied
Statistics, 4(1), 486–502.
154 ◾ Computational Intelligent Data Analysis for Sustainable Development
O’Gorman, P.A., and Schneider, T. 2009. The physical basis for increases in precip-
itation extremes in simulations of 21st-century climate change. Proceedings
of the National Academy of Sciences of the United States of America, 106(35):
14773–14777.
Overpeck, J.T., Meehl, G.A., Bony, S., and Easterling, D.R. 2011. Climate
data challenges in the 21st century. Science, 331: 700-702. doi: 10.1126/
science.1197869.
Pall, P., Aina, T., Stone, D.A., Stott, P.A., Nozawa, T., Hilberts, A.G.J., Lohmann, D.,
and Allen, M.R. 2011. Anthropogenic greenhouse gas contribution to flood
risk in England and Wales in autumn 2000. Nature, 470: 382–385.
Perkins, S.E., Pitman, A.J., and Sisson, S.A. 2009. Smaller projected increases in
20-year temperature returns over Australia in skill-selected climate models.
Geophysical Research Letters, 36: L06710, doi:10.1029/2009GL037293.
Pierce, D.W., Barnett, T.P., Santer, B.D., and Gleckler, P.J. 2009. Selecting global cli-
mate models for regional climate change studies. Proceedings of the National
Academy of Sciences USA, 106(21): 8441–8446.
Raskutti, G., Wainwright, M.J., and Yu, B. 2010. Minimax-Optimal Rates for
Sparse Additive Models over Kernel Classes via Convex Programming.
Technical report, http://arxiv.org/abs/1008.3654, UC Berkeley, Department
of Statistics, August.
Ravikumar, P., Wainwright, M.J., and Lafferty, J. 2010. High-dimensional Ising
model selection using L1 regularized logistic regression. Annals of Statistics,
38(3): 1287–1319.
Sain, S., Furrer, R., and Cressie, N. 2008. Combining Ensembles of Regional
Climate Model Output via a Multivariate Markov Random Field Model.
Technical report, Department of Statistics, The Ohio State University.
Saltelli, A., Chan, K., and Scott, E.M. 2000. Sensitivity Analysis. Wiley Series in
Probability and Statistics New York.
Sanso, B., Forest, C.E., and Zantedeschi, D. 2008. Inferring climate system proper-
ties using a computer model. Bayesian Analysis, 3(1): 1–38.
Santer, B.D. et al. 2009. Incorporating model quality information in climate
change detection and attribution studies. Proc. Natl. Acad. Sci. USA, 106(35):
14778–14783.
Schiermeier, Q. 2010. The real holes in climate science. Nature, 463: 284–287.
Seni, G., and Elder, J.F. 2010. Ensemble methods in data mining: Improving accu-
racy through combining predictions. Synthesis Lectures on Data Mining and
Knowledge Discovery.
Smith, R.L., Tebaldi, C., Nychka, D., and Mearns, L.O. 2009. Bayesian model-
ing of uncertainty in ensembles of climate models. Journal of the American
Statistical Association, 104(485): 97–116. doi:10.1198/jasa.2009.0007.
Soize, C., and Ghanem, R. 2004. Physical systems with random uncertainties:
Chaos representations with arbitrary probability measure. SIAM Journal of
Scientific Computing, 26: 395–410.
Stainforth, D.A. et al. 2005. Uncertainty in predictions of the climate response to
rising levels of greenhouse gases. Nature, 433: 403–406.
156 ◾ Computational Intelligent Data Analysis for Sustainable Development
Steinhaeuser, K., Chawla, N.V., and Ganguly, A.R. 2011a. Complex networks as a
unified framework for descriptive analysis and predictive modeling in cli-
mate science. Statistical Analysis and Data Mining, 4(5): 497–511.
Steinhaeuser, K., Ganguly, A.R., and Chawla, N. 2011b. Multivariate and multi-
scale dependence in the global climate system revealed through complex
networks. Climate Dynamics, DOI: 10.1007/s00382-011-1135-9. 39(3–4),
889–895
Sugiyama, M., Shiogama, H., and Emori, S. 2010. Precipitation extreme changes
exceeding moisture content increases in MIROC and IPCC climate mod-
els. Proceedings of the National Academy of Sciences of the United States of
America, 107(2): 571–575.
Tebaldi, C., and Knutti, R. 2007. The use of the multimodel ensemble in probabi-
listic climate projections. Philosophical Transactions of the Royal Society A,
365, 2053–2075.
Tebaldi, C., Mearns, L.O., Nychka, D., and Smith, R.L. 2004. Regional probabili-
ties of precipitation change: A Bayesian analysis of multimodel simulations.
Geophysical Research Letters, 31, L24213, doi:10.1029/2004GL021276.
Tebaldi, C., and Sanso, B. 2009. Joint projections of temperature and precipitation
change from multiple climate models: A hierarchical Bayesian approach.
Journal of the Royal Statistical Society A, 172(1): 83–106.
Tibshirani, R. 1996. Regression shrinkage and selection via the lasso. Journal of the
Royal Statistical Society, Series B, 58(1): 267–288.
Tollefson, J. 2008a. Climate war games. Nature, 454(7205): 673. August 7.
Tollefson, J. 2008b. Climate war games: The “Angry Red Chart.” Nature Blogs.
30 July: http://blogs.nature.com/news/blog/events/climate_war_game/.
Tollefson, J. 2011. Climate change: Cold spells in a warm world. Nature, 139.
Research Highlights. 14 April. DOI:10.1038/472139d.
Tsonis, A.A., Swanson, K., and Kravtsov, S. 2007. A new dynamical mechanism
for major climate shifts. Geophysical Research Letters, 34: L13705, 5 pp.,
doi:10.1029/2007GL030288.
Villagran, A., Huerta, G., Jackson, C.S., and Sen. M.K. 2008. Computational meth-
ods for parameter estimation in climate models. Bayesian Analysis, 3: 823–
850, doi: http://dx.doi.org/10.1214/08-BA331.
Wackernagel, H. 2003. Geostatistics: An Introduction with Applications. Berlin:
Springer-Verlag.
Wehner, M. 2010. Source of uncertainty in the extreme value statistics of climate
data. Extremes, 13(2): 205–217.
Weigel, A.P., Knutti, R., Linger, M.A., and Appenzeller, C. 2010. Risks of model
weighting in multimodel climate projections. Journal of Climate, 23:
4175–4191.
Xiu, D., and Karniadakis, G. 2002. The Wiener-Askey polynomial chaos for sto-
chastic differential equations. SIAM Journal on Scientific Computing, 24:
619–644.
Zhao, P., and Yu, B. 2006. On model selection consistency of lasso. Journal of
Machine Learning Research, 7: 2541–2567.
III
Computational Intelligent Data Analysis
for Biodiversity and Species Conservation
157
Chapter 6
Mathematical
Programming
Applications to Land
Conservation and
Environmental Quality
Jacob R. Fooks and Kent D. Messer
CONTENTS
6.1 Introduction 160
6.1.1 Background of Optimization in Conservation Planning 160
6.1.2 An Overview of the Pennsylvania Dirt and Gravel
Roads Program 163
6.2 Theoretical Models and Results 168
6.2.1 Issue 1: How Best to Account for In-Kind Funding 170
6.2.2 Issue 2: Sensitivity of Results Given the Uncertainty in
Project Cost and In-Kind Contribution 172
6.2.3 Issue 3: Sensitivity of Results to State Fund Distribution
Rules 174
6.3 Conclusions 175
References 176
159
160 ◾ Computational Intelligent Data Analysis for Sustainable Development
6.1 INTRODUCTION
BLP or CEA, but also that these managers did not consider that being
cost-effective was a major priority, and also reported lacking incentives
to adopt alternative selection approaches (Messer, Allen, and Chen, 2011).
A common concern identified in these studies was a “black-box” per-
ception of BLP. Program managers’ duties can include more than merely
maximizing benefit scores. They also must defend the “value” achieved
from donor, funding agency, and taxpayer money; ensure that partici-
pants get a fair deal from a transparent decision mechanism; and distrib-
ute funds in a manner that is perceived as equitable. BLP, as classically
implemented, has been seen as lacking transparency and the flexibility
necessary to address many of these duties, which can be thought of as
secondary or operational objectives. These objectives may not immedi-
ately impact the primary goal of protecting high-quality land but still may
be important factors in the decision-making process and thus significant
barriers to the adoption of new approaches. Operational objectives can be
incorporated as constraints, as was done by Önal et al. (1998), who used an
optimization model to address both environmental and equity concerns
in a watershed management scenario. Their model maximized total profit
across a watershed with a chance constraint on chemical runoff levels to
account for the stochastic nature of rainfall and a constraint on the equity
of the program’s impact that was measured by an index of deviation from
a uniform loss-sharing level. The study varied these constraints to exam-
ine trade-offs among income, pollution, and equity losses. This approach,
however, still offered little ability to consider the sensitivity of the single
solution provided or alternatives to it.
A second option is to format the process as a Multiple-Objective Linear
Programming (MOLP) problem (also referred to as Goal Programming)
with the secondary objectives included as weighted goals. MOLP has been
applied to several conservation programs: a balancing of economic and
biological objectives over short-term and long-term time frames in fish-
ery management (Drynan and Sandiford, 1985; Mardle and Pascoe, 1999;
Mardle et al., 2000), an optimization of environmental, social, and eco-
nomic goals in energy production (Silva and Nakata, 2009); and manage-
ment of public water resources (Neely et al., 1977; Ballestero et al., 2002).
Önal (1997) considered an approach similar to MOLP in a forest man-
agement setting; he employed a model that, instead of minimizing devia-
tions from a goal as is done in MOLP, used constrained deviations from a
goal to maximize the discounted future harvest value while maintaining
a minimum value for a species diversity index.
Optimizing Land Conservation and Environmental Quality ◾ 163
* The authors would like to acknowledge the generous assistance of Wayne Kober, Robb Piper, and
Barry Scheetz associated with the state of Pennsylvania’s Dirt and Gravel Roads program.
164 ◾ Computational Intelligent Data Analysis for Sustainable Development
effect of uncertainty about the actual cost of each project and in-k ind con-
tributions on the optimal solution, and investigate how the distribution of
funds among counties influences the overall statewide benefit.
Several characteristics of the DGRP are important. First, the program
focuses on local administration, with all allocation and auditing adminis-
tration done at the county level. The state is involved only in training county
program managers, reporting overall program results, and establishing
basic project standards. The state also sets environmentally sustainable
method (ESM) standards that dictate the methods and materials that may
be used to ensure that each project represents an environmentally sustain-
able fix over the long term. All grant recipients must have an employee
who has received ESM training within the previous 5 years to be eligible.
Sites are ranked in terms of potential environmental impact based on a
set of 12 criteria relating to road topology, proximity to a stream, stabil-
ity of the drainage infrastructure, and the amount of canopy cover, among
other considerations. The individual scores from each criterion are totaled
to generate an overall environmental rank that ranges from 0 to 100, and
the ranking is used to assign funding priorities. Additionally, in-k ind
contributions from grant applicants play a significant role in the program.
Those contributions, typically in the form of in-k ind equipment and labor,
are not required by the state, but counties may enact minimum contri-
bution levels or use the level of these in-k ind contributions to adjust the
ranking scores. Over the past several years, in-k ind support for the pro-
gram has been in the range of 40% ton 50% of program money spent. Also,
counties often enact regulations that limit the total number of contracts
or projects that can be awarded to a township simultaneously, or stipulate
that all applicants must have received at least one funded project before
any applicant can be considered for multiple projects. Further details on
state-level administration of the program are available from Penn State
University Center for Dirt and Gravel Road Studies (CDGRS), which was
created to provide technical services for the program (CDGRS, 2009).
The inclusion of in-k ind cost sharing and matching funds is a common
practice in conservation programs and is designed to leverage additional
resources from partner agencies and individuals to achieve their conser-
vation objectives (Kotani, Messer, and Schulze, 2010). Implementation
schemes vary, but most require the participating organization to cover
some percentage of the project’s cost. The numerous other federal, state,
Optimizing Land Conservation and Environmental Quality ◾ 165
* For an in-depth discussion of matching grants and their implementation, see Boadway and Shah
(2007).
166 ◾ Computational Intelligent Data Analysis for Sustainable Development
The project selection literature generally does not address issues asso-
ciated with the incorporation of in-k ind requirements into the selection
process. In-k ind contributions are implicitly accounted for by the reduc-
tion in the project cost to the government, but that measure fails to take
full advantage of the additional information potentially provided by
the size of the in-k ind cost share, such as the degree of commitment of the
partner organization, and the political benefits of using program funds to
leverage resources from other organizations, agencies, and individuals. In
this chapter, we incorporate that additional information into an optimiza-
tion approach by developing a MOLP model that seeks to optimize both
conservation outcomes and partner in-k ind cost sharing contributions.
Results from the models show that MOLP offers results that are superior
to approaches currently used by the DGRP while yielding cost-effective
outcomes that are likely to be more practical than solutions generated by
the standard BLP approach.
For the DGRP case study, datasets were available for three Pennsylvania
counties: Bradford, Cambria, and Carbon (displayed in Figure 6.1).
Bradford County is the largest and most rural of the three. It has the sec-
ond largest dirt road network in the state and has been granted an average
of $246,000 per year in funding over the course of the program. Carbon
and Cambria Counties both have a smaller area and are more densely
populated. Carbon County has received, on average, $24,000 per year over
Bradford County
Carbon County
Cambria County
Bradford County
DGRP Projects
Potential project
Funded project
the course of the program, while Cambria County has received $17,000
annually. These three counties were recommended by state program per-
sonnel for analysis because the local officials were most willing to cooper-
ate by sharing their data, and these counties were the most representative
of the range of counties participating in the program. Data on the com-
pleted projects and fund distributions were provided by the Center for
Dirt and Gravel Road Studies (CDRGS, 2009). Data on the submitted proj-
ects and procedures for Bradford County by year were obtained from the
Bradford County Conservation District (BCCD) (M. Lovegreen, BCCD
Director, personal communication, November 2, 2009). A map of poten-
tial and funded projects is displayed in Figure 6.2.
Because Cambria and Carbon Counties receive limited funds each year
(less than $30,000, while many projects cost more than $50,000), their
project choices are limited and they use less-complex decision criteria.
Cambria County issued a call for projects when the program was first
initiated and has since worked its way through the initial applicants and
funded all of the projects that met the state’s criteria. Carbon County does
not officially advertise the program, and it funds any appropriate project
that the conservation district manager comes across during the course of
his duties. These processes are difficult to approach with a binary program
as the counties have maintained no data on projects that were not funded.
That is not to say that the selection processes they use are in any way opti-
mal; but because there is little data on alternative projects, it is impossible
to say how much opportunity was missed by not soliciting new projects
on a regular basis.
168 ◾ Computational Intelligent Data Analysis for Sustainable Development
Max: Z = cT x
Subject to: Ax ≤ b
xi ∈{0, 1}
* Projects are occasionally specially designated as “Trout Unlimited” based on being located in
sensitive watersheds. Those projects always get first priority for all funds. Recall that the Trout
Unlimited organization originally helped organize the DGRP. However, Trout Unlimited projects
are sufficiently rare (there was only one for 2002 through 2008) so they are left out of this analysis.
Optimizing Land Conservation and Environmental Quality ◾ 169
Max: Z = ∑x e
i =1
i i
Subject to: ∑x p ≤ B
i =1
i i
∑x
i =1
i
j
≤1
xi ∈{0,1}
Subject to: ∑x e + d
i =1
i i
−
e − de+ = E
∑x c + d − d
i =1
i i
−
c
+
c =C
∑x p ≤ B
i =1
i i
∑x
i =1
i
j
≤1
7000
λ = 0.0 λ = 0.1
λ = 0.2
6500
λ = 0.3
λ = 0.4
λ = 0.5
Total Environmental Benefit
6000
λ = 0.6
λ = 0.7
5500
λ = 0.8
5000
λ = 0.9
4500
λ = 1.0
4000
$600,000 $700,000 $800,000 $900,000 $1,000,000
Total In-Kind Cost Share
below the original quote, with an average of 129% above. From a planning
perspective, the potential for budget overruns is great.
Traditional sensitivity analysis is not helpful in this case because binary
programming is used. Ideally, costs and in-k ind contributions could be
systematically varied independently over a range of percentages of over-
and under-achievement and the results compared. This is not feasible,
however, because 30 projects with only three such distortions would
require more than 4 × 1014 separate optimizations.
As an alternative sensitivity analysis, we analyzed this situation using
a bootstrap type of approach that evaluates the persistence of projects
in optimal solutions over a series of random variations in price and in-
kind contributions. All costs and in-k ind contributions were indepen-
dently varied by a factor within the 10th to 90th percentile range: −28%
to +140% for costs and −57% to +307% for in-k ind funding. The opti-
mal solution was recorded for several random samples, and the “per-
sistence” or percentage of times a particular project was recommended
was calculated. This persistence score can be used to identify projects
for which the expense is a “sure thing” versus those that are particularly
sensitive to cost. This analysis was performed for 2002 using 55 observa-
tions. Table 6.2 offers a selection of results for projects that were actu-
ally funded. Projects that have persistence scores at or near 1.0 are very
likely to be efficient in granting environmental benefits even under a fair
degree of cost uncertainty. Projects with a large variance, such as A606
TABLE 6.2 Selection of Persistence Scores
Multiple-Objective Linear
Project No. Rank Programming Value Persistence
Z010 1 1 1.00
Z006 2 0 0.51
X313 3 1 1.00
A398 4 0 1.00
A438 5 1 0.98
A300 6 1 1.00
A872 7 0 0.65
A518 8 1 0.96
A241 9 1 0.15
X353 10 1 1.00
A606 11 1 0.07
A949 12 1 1.00
174 ◾ Computational Intelligent Data Analysis for Sustainable Development
2100
2050
2000
Total Benefits Achieved
1950
1900
1850
1800
1750
0 20000 40000 60000 80000 100000 120000 140000 160000
Amount of Budget Transfer
from Bradford County to Cambria County, while the y-a xis represents
the total benefit score for the two counties. The maximum occurs at an
environmental benefit score of 2,070, with a $30,000 redistribution from
Bradford County to Cambria County.
At the maximum benefit level, Bradford County still receives $270,000
(90%) while Cambria’s budget grows significantly to $45,000 (300%). The
total environmental benefit, however, begins to decline steadily after that
point. So while a minor redistribution to counties receiving fewer funds
could have a significant advantage, the bulk of the DGRP funds should
continue to go to the counties with larger dirt road systems.
6.3 CONCLUSIONS
This chapter has applied optimization through mathematical programming
to techniques to the problem of water quality conservation in the state of
Pennsylvania. It demonstrated how Binary Linear Programming can be
extended to deal with a variety of challenges encountered in an on-the-
ground conservation situation. The Pennsylvania Dirt and Gravel Roads
176 ◾ Computational Intelligent Data Analysis for Sustainable Development
REFERENCES
Aldrich, R., and Wyerman, J. (2006). 2005 National Land Trust Census Report.
Washington, DC: The Land Trust Alliance.
American Farmland Trust (2010). Status of State PACE Programs. Online at www.
farmland.org (accessed on March 17, 2010).
Babcock, B.A., Lakshminarayan, P.G., Wu, J., and Zilberman, D. (1997). Targeting
tools for the purchase of environmental amenities. Land Economics, 73,
325–339.
Optimizing Land Conservation and Environmental Quality ◾ 177
Baker, M., Payne, S., and Smart, M. (1999). An empirical study of matching grants:
The “cap on the CAP.” Journal of Public Economics, 72, 269–288.
Ballestero, E., Alarcon, S., and Garcia-Bernabeu, A. (2002). Establishing politically
feasible water markets: A multi-criteria approach. Journal of Environmental
Management, 65, 411–429.
Boadway, R., and Shah, A. (2007). Intergovernmental Fiscal Transfers. Washington,
DC: The World Bank.
Borge, L.-E., and Rattsø, J. (2008). Local Adjustment to Temporary Matching
Grant Programs: A Dynamic Analysis. Mimeo, Department of Economics,
Norwegian University of Science and Technology.
Bucovetsky, S., Marchand, M., and Pestieau, P. (1998). Tax competition and revela-
tion of preferences for public expenditure. Journal of Public Economics, 44,
367–390.
Center for Dirt and Gravel Road Studies, University of Pennsylvania (2009).
Center for Dirt and Gravel Road Studies: Better Roads, Cleaner Streams.
Website at www.dirtandgravel.psu.edu.
Chernick, H. (1995). Fiscal effects of block grants for the needy: A review of the
evidence. In Proceedings of the National Tax Association Annual Conference
on Taxation, pp. 24–33.
Claassen, R. (2009). Conservation Policy Briefing Room, Economic Research
Service, U.S. Department of Agriculture. Online at www.ers.usda.gov/
briefing/conservationpolicy/background.htm (accessed February 2010).
Dantzig, G.B. (1957). Discrete-variable extremum problems. Operations Research,
5(2), 266–288.
Drynan, R.G., and Sandiford, F. (1985). Incorporating economic objectives in
goal programs for fishery management. Marine Resource Economics, 2(2),
175–195.
European Union Directorate-General for Agriculture and Rural Development
(2009). Rural Development in the European Union: Statistical and Economic
Information Report 2009. Brussels: The European Union.
Fooks, J., and Messer, K.D. (2012). Maximizing Conservation and In-Kind Cost
Share: Applying Goal Programming to Forest Protection, Forest Economics,
18: 207–217.
Kaiser, H.M., and Messer, K.D. (2011). Mathematical Programming for Agricultural,
Environmental, and Resource Economics. Hoboken, NJ: Wiley.
Kotani, K., Messer, K.D., and Schulze, W.D. (2010). Matching grants and chari-
table giving: Why some people sometimes provide a helping hand to fund
environmental goods. Agricultural and Resource Economics Review, 39(2),
324–343.
Mardle, S., and Pascoe, S. (1999). A review of applications of multiple criteria
decision-making techniques to fisheries. Marine Resource Economics, 14,
41–63.
Mardle, S., Pascoe, S., Tamiz, M., and Jones, D. (2000). Resource allocation in
the North Sea demersal fisheries: A goal programming approach. Annals of
Operations Research, 94, 321–342.
178 ◾ Computational Intelligent Data Analysis for Sustainable Development
179
Chapter 7
CONTENTS
7.1 Introduction 182
7.2 The Energy Market Today 188
7.2.1 Actors in the Energy Market 188
7.2.2 Energy Trading 190
7.2.2.1 Long-Term Trading 191
7.2.2.2 Short-Term Trading 191
7.2.3 Energy Balancing 192
7.3 Future Energy Scenarios 193
7.4 Data Analysis Challenges 200
7.4.1 Data Management 201
7.4.2 Data Preprocessing 204
7.4.3 Predictions, Forecasts, and Classifications 206
7.4.3.1 Time-Series Forecasting 206
7.4.3.2 Predicting and Classifying User Behavior 210
7.4.3.3 Predicting and Classifying Consumption Events 212
7.4.4 Detection on Unknown Patterns 214
7.4.4.1 Clustering 216
7.4.4.2 Outlier Mining 218
181
182 ◾ Computational Intelligent Data Analysis for Sustainable Development
7.1 INTRODUCTION
1. Fossil and nuclear resources are limited, and their exploitation will
become more expensive (not economically sustainable).
2. The combustion of fossil sources leads to CO2 emissions, which drive
the greenhouse effect (not environmentally sustainable).
3. Nuclear power plants bear certain risks in their operation and produce
nuclear waste, which needs to be protected from unauthorized access.
same entity (e.g., a generator might also act as a retailer who sells energy
to consumers). The number of (new) actors in the energy market is also
interesting from a data-analysis point of view: Most of these actors have
access to potentially interesting data or could profit from data provided by
different actors. Investigating the actors and their data leads to interesting
opportunities for data analysis and maybe even to new roles, such as ana-
lytic service providers. In the following, we introduce the most common
actors and roles that are relevant for the remainder of this chapter [2]:
help of the TSOs. The BRP financially regulates for any imbalances
that arise.
• Retailer. A company that buys electrical energy from generators and
sells it to consumers. The retailer also has to interact with DSOs and
possibly metering operators to provide grid access to the consumers.
• Metering operator. Provides, installs, and maintains metering equip-
ment and measures the consumption and/or generation of electrical
energy. The readings are then made accessible (possibly in an aggre-
gated manner) to the retailer, to the consumer/prosumer, and/or to
other actors. Frequently, the role of the metering operator is taken
on by DSOs.
• Energy market operator. An energy market may be operated for dif-
ferent kinds of energy to facilitate the efficient exchange of energy or
related products such as load-shifting volumes (demand response).
Typical markets may involve generators selling energy on the whole-
sale market and retailers buying energy. Energy market operators
may employ different market mechanisms (e.g., auctions, reverse
auctions) to support the trade of energy in a given legal framework.
• Value-added service providers. Such providers can offer various ser-
vices to different actors. One example could be to provide analytic
services to final customers, based on the data from a metering operator.
7.2.2 Energy Trading
In order to supply their customers (consumers) with electrical energy,
retailers must buy energy from generators. In the following, we do not
consider how prosumers sell their energy, as this varies in different
countries. While consumers traditionally pay a fixed rate per consumed
kilowatt-hour (kWh) of energy to the retailers (typically in addition to a
fixed monthly fee), the retailers can negotiate prices with the generators
(directly or at an energy exchange). A procurement strategy for a retailer
may be to procure the larger and predictable amount of their energy need
in advance on a long-term basis. This requires analytic services and suf-
ficiently large collections of consumption data. The remaining energy
demand, which is difficult to predict in the long run, is procured on a
short-term basis for typically higher prices. Similarly, generators of elec-
tricity need to predict in advance what amounts of energy they can sell.
Data Analysis Challenges in the Future Energy Domain ◾ 191
7.2.2.1 Long-Term Trading
While electric energy has traditionally been traded by means of bilat-
eral over-the-counter (OTC) contracts, an increasing amount of energy is
nowadays traded at power exchanges. Such exchanges trade standardized
products, which makes trading easier and increases the comparability of
prices. While there are different ways of trading energy, double auctions
as known from game theory [54] are the dominant means for finding the
price [137].
As one example, the European Energy Exchange AG (EEX) in Leipzig,
Germany, trades different kinds of standardized energy futures and options.
These products describe the potential delivery of a certain amount of energy at
certain times in the future. The delivery must be within one of the transporta-
tion grids. In addition to the traded products, they also provide clearinghouse
services for OTC contracts. The volume traded at the EEX derivate market
for Germany amounted to 313 terawatt-hours (TWh) in 2010, 1,208 TWh
including OTC transactions [6]. The latter number roughly corresponds to
two times the energy consumed in Germany in the same time frame.
7.2.2.2 Short-Term Trading
Short-term trading becomes necessary as both consumption and produc-
tion cannot be predicted with 100% accuracy. Therefore, not all needs for
energy can be covered by long-term trading. In particular, fluctuating
renewable energies make correct long-term predictions of production vir-
tually impossible. As one example, wind energy production can only be
predicted with sufficient accuracy for a few hours in advance. Therefore,
energy exchanges are used for short-term trading, again making use of
different kinds of auctions. At such exchanges, retailers can buy electrical
energy for physical delivery in case their demand is not covered by futures.
Again, the delivery must be within one of the transportation grids.
The EPEX Spot SE (European Power Exchange) in Paris, France, trades
energy for several European markets. There, trading is divided as follows:
7.2.3 Energy Balancing
To ensure a reliable and secure provision of electrical energy without any
power outages, the energy grids must be stable at any point in time. In par-
ticular, there must be assurance that the production always equals the con-
sumption of energy. In practice, avoiding imbalances between generation
and demand is challenging due to stochastic consumption behavior, unpre-
dictable power-plant outages, and fluctuating renewable production [137].
On a very coarse temporal granularity, a balance is achieved by means
of energy trading (see Section 7.2.2) and data-analysis mechanisms, in
particular prediction and forecast: Retailers buy the predicted demand of
their customers, and generators sell their predicted generation. As men-
tioned in Section 7.2.1, the BRPs make sure that the scheduled supply of
energy corresponds to the expected consumption of energy. This expected
consumption is also derived using data-analysis techniques. The TSOs
are responsible for the stability of the grids. In the following, we describe
how they do so.
From a technical point of view, a decrease in demand leads to an
increase in frequency, while a decrease in production leads to a decrease
in frequency (and vice versa for increases). Deviations from the fixed fre-
quency of 50 Hz in electricity grids should be avoided in realtime, as this
might lead to damage to the devices attached to the grid.
Typically, frequency control is realized in a three-stage process: primary,
secondary, and tertiary control. The primary control is responsible for very
short deviations (15 to 30 seconds), the secondary control for short devia-
tions (max. 5 minutes), and the tertiary control for longer deviations (max.
15 minutes) [137]. The control process can be realized by various means,
Data Analysis Challenges in the Future Energy Domain ◾ 193
metering is not only an enabler for other scenarios. Giving users access
to their energy consumption profiles can make them more aware of their
consumption and improve energy efficiency. This is important as many
consumers have little knowledge about their energy consumption.
For purposes of billing, smart-meter data is typically generated in
15-minute intervals. That is, the meter transfers the accumulated con-
sumption of the consumer every 15 minutes to the metering operator.
Technically, smart meters can increase the temporal resolution of con-
sumption data, for example, measure the consumption within every sec-
ond or minute. This allows one to obtain a detailed picture of the energy
consumption—down to the identification of individual devices (e.g., cof-
fee machines), as each device has its typical load curve. Such fine-grained
data could also be transferred to a metering operator. In addition, meter-
ing data at any granularity can be made available within a home network,
for example to be accessed via visualization tools on a tablet computer.
Access to consumption profiles for energy consumers can be more
than pure numbers or simple plots. Respective visualization tools can
easily show comparisons to previous days, weeks, etc. In the case of ser-
vice providers, they can also provide comparisons to peer groups of con-
sumers having similar households (in terms of size and appliances used).
Furthermore, devices can be identified by their load profile [113], which
can be visualized additionally. This leads to an increased awareness of the
energy consumption of each single device.
A number of studies have investigated the effects of giving users access
to smart-metering data. Schleich et al. [126] have carried out controlled
field experiments with 2,000 consumers and came to the conclusion that
giving access to detailed consumption data may lower the total energy
consumption moderately, by about 4%. Other (meta) studies suggest that
savings can be even a little higher [39, 45, 94].
the request. If the retailer or grid operator accepts the assembly of offers
(the retailer alternatively might prefer to buy energy at the exchange if this
is cheaper), the respective demand-side management companies are then
responsible for conducting the demand-response request. The companies
then send priority signals to the smart-home control boxes of their con-
tracted consumers. The control boxes send the signal to intelligent devices
at the consumer’s premises as well as the charging infrastructure of elec-
tric vehicles.
the temperature reaches −16° C and then start over again. An intelligent
system would be able to interrupt the cooling at −18° C or start cooling
already at this temperature without waiting for a further rise, thus shift-
ing the power demand of the cooling device. The same scheme could
be applied to the room temperature, as the comfort range normally lies
around 21°C. Extended knowledge of user preferences could expand this
potential even further. If the resident wants the temperature to be 21°C
upon return in the evening, the system could heat the house up to 25°C in
the afternoon and let it cool down slowly to the desired 21°C. This would
require more energy in total, but could still be feasible in a future energy
system where a lot of solar energy is available (and is thus cheap) during
the day [73].
Profiles of typical user behavior could improve the demand shifting
capabilities of a smart home even further if they were combined with an
electric vehicle (EV). Not only could the charging profile of the vehicle be
matched to the user’s and energy system’s demands, but the battery of the
EV could also be used as temporary energy storage when the vehicle is not
needed. This concept is known as vehicle to grid (see Scenario 7.5) [118].
power. The drawback of battery systems is their relatively high price. With
the anticipated rise in the market share of EVs, this could change soon.
Batteries lose capacity constantly during their lifespan. At a certain point
in time, their power density is not high enough to use them as batteries for
EVs. However, power density is negligible in the context of immobile stor-
age. Thus, battery storage facilities could benefit from a relevant market
share of EVs by reusing their old batteries.
A further storage option would be the V2G concept [118]. As vehicles
spend only about 5% of their lifetime on the road, an EV could potentially
be used 95% of the time as local energy storage. Assuming 1 million EVs
in one country (the German government, for instance, aims at reaching this
number by 2020 [3]), this could sum up to about 15 gigawatt-hours (GWh)
of storage capacity with a peak power of 3 to 20 gigawatts (GW), resem-
bling two to five typical pumped-storage water-power plants.
The profitability of a storage system is, depending on the business
model, correlated with its usage, which nowadays is proportional to the
number of store-drain cycles. As battery-quality factors decrease not only
with their lifetime but also with each use cycle, the utilization of such sys-
tems must be considered with reservations.
the future energy scenarios (see Section 7.3) call for more advanced data
management and data analysis, as has already been used in the traditional
energy system (see Section 7.2). This section describes the data-analysis
challenges in the energy area and presents first solutions. In particular, we
look at data management (Section 7.4.1); data preparation (Section 7.4.2);
the wide field of predictions, forecasts, and classifications (Section 7.4.3);
pattern detection (Section 7.4.4); disaggregation (Section 7.4.5); and inter-
active exploration (Section 7.4.6). Finally, we comment on optimiza-
tion problems (Section 7.4.7) and the emerging and challenging field of
privacy-preserving data mining (Section 7.4.8).
7.4.1 Data Management
Before addressing the actual data-analysis challenges, we present some
considerations regarding data management. As motivated before, the
rise of the smart grid leads to many large and new data sources. The
most prominent sources of such data are smart meters (see Scenario 7.1).
However, there are many more data sources, ranging from dynamic prices
to data describing demand-response measures, to the use of energy stor-
ages and events in smart homes. In the following, we focus on smart-meter
data. In Section 7.4.6.1 we deal with further data-management aspects in
the context of the exploration and comparison of energy datasets.
As described previously, smart meters are able to measure energy con-
sumption and/or generation at high resolution, for example, using inter-
vals of 1 second. Figure 7.1 provides an example of such measurements
Consumed Electricity
0:00
2:00
4:00
6:00
8:00
10:00
12:00
14:00
16:00
18:00
20:00
22:00
TABLE 7.1 Storage Needs for Smart-Meter Data (pure meter readings only)
No. Measurements Storage Need
Metering
Granularity 1 day 1 year 1 day 1 year 1 day 1 year
1 second 86.400 31,536.000 338 kB 120 MB 13 TB 4 PB
1 minute 1.440 525.600 6 kB 2 MB 215 GB 76 TB
15 minutes 96 35.040 384 B 137 kB 14 GB 5 TB
1 hour 24 8.760 96 B 34 kB 4 GB 1 TB
1 day 1 365 4 B 1 kB 153 MB 54 GB
1 month 12 48 B 2 GB
1 year 1 4 B 153 MB
1 smart meter 40 mio. smart meters
* This does not include metadata such as date, time, and location; Schapranow et al. [125] reports
that the size including such data could be much larger, that is, by a factor of 12.
Data Analysis Challenges in the Future Energy Domain ◾ 203
Data from smart meters belongs to the group of time-series data [86].
In addition to compression via regression techniques and the actual stor-
age of such data, many other data-management aspects are of importance.
This includes indices and similarity-based retrieval of time series (surveys
of these techniques can be found by Fink and Pratt [52], Hetland [64], and
Vlachus et al. [134]). Such techniques are of importance for many ana-
lytical applications that are based on such data. For example, indexes and
similarity searches can be used to retrieve consumers with a similar elec-
tricity demand, which is important in classification and clustering (see
Sections 7.4.3.2 and 7.4.4.1, respectively). Investigating the usage of the
mentioned techniques from time-series analysis in the context of energy
data should be promising as they are rarely mentioned in the literature.
7.4.2 Data Preprocessing
Data preprocessing is an essential step in data-analysis projects in any
domain [30]. It deals with preparing data to be stored, processed, or ana-
lyzed and with cleaning it from unnecessary and problematic artifacts.
It has been stated that preprocessing takes 50% to 70% of the total time
of analytical projects [30]. This certainly applies to the energy domain as
well, but the exact value is obviously highly dependent on the project and
the data. In the following, we highlight some preprocessing challenges
that are specific to the energy domain. Many further data-quality issues
are present in many other domains and might be important here as well
(see, e.g., textbooks by Berthold et al. [24], Han et al. [61], and Witten et al.
[140] for further issues and techniques).
Data from smart meters frequently contains outliers. Certain outliers
refer to measurement errors rather than to real consumption, as can be
seen in the raw data visualized in Figure 7.1: The peaks roughly at 04:30
and at 10:00 happening at single seconds are caused by a malfunction of
measurement equipment. The smart meter has malfunctioned for some
seconds, resulting in an accumulated consumption reported at the next
measurement point. Such outliers must be eliminated if certain functions
need to be applied afterward. For example, calculating the maximum con-
sumption of uncleaned data in Figure 7.1 would not be meaningful. Other
outliers might refer to atypical events or days: Consumption patterns of
energy might differ significantly when there is, for example, a World Cup
final on TV or if a weekday becomes a public holiday. (Figure 7.2(b) illus-
trates that load profiles at weekdays and weekends are quite different.)
Such exceptional consumption patterns should not be used as a basis for
Data Analysis Challenges in the Future Energy Domain ◾ 205
Electricity Demand
Winter
Summer
10:00
12:00
14:00
16:00
18:00
20:00
22:00
0:00
2:00
4:00
6:00
8:00
Winter
Summer
0:00
12:00
0:00
12:00
0:00
12:00
0:00
12:00
0:00
12:00
0:00
12:00
0:00
12:00
FIGURE 7.2 Typical aggregated demand curves. (Data from Metered Half-
Hourly Electricity Demands[8].)
granularity does its measurements not exactly on the quarter hour. Both
cases might be negligible in certain situations, but should be tackled in
other situations. While one missing second might be quite meaningless,
ignoring it might be problematic in light of laws on weights and measure-
ments. Billing in the presence of dynamic energy prices (see Scenario 7.2)
might require measurements at exact points in time. If measurements are,
say, 5 minutes delayed, this could make significant differences (e.g., when
the start of energy-intensive processes are scheduled when a cheap time
span starts). A possible solution for the first problem would be to add/
subtract the missing/additional measurements to/from the neighboring
ones. The second problem might be solved using regression techniques
that enable estimations of measurements at arbitrary points in time.
7.4.3.1 Time-Series Forecasting
As mentioned previously, numerical predictions of time- dependent
data—also called time-series forecasting—are crucial in today’s and future
energy scenarios. In the following, we list a number of scenarios where
this is the case:
the respective volumes. This requires knowing how much load can
potentially be shifted.
• Predicting energy storage capacities is helpful in storage scenarios (see
Scenario 7.5). As storage operators typically aim to maximize profit
by means of energy trading (see Section 7.2.2), they need to know the
future capacities. This can be an input for optimization algorithms that
determine the scheduling of filling-up and emptying an energy storage.
• Predicting energy prices is certainly not easy, but there might be some
regularity in energy prices that facilitate forecasting. Concretely,
the following two directions are of interest: (1) If one knows the
future energy prices with a certain probability in energy trading (see
Section 7.2.2), then one can obviously reap large benefits. For exam-
ple, in the presence of demand response (see Scenarios 7.2 and 7.3),
one can shift the loads of customers to cheaper points in time. (2) In
the presence of dynamic prices (see Scenario 7.2) that are not known
long in advance, one can make one’s own predictions of the energy
price and speculatively adjust the consumption. This could be done,
in particular, in highly automated smart homes (see Scenario 7.4).
All these scenarios are different, but they deal with the same problem
from a technical point of view: time-series forecasting. However, the differ-
ent scenarios require different data. Historical generation and consump-
tion data from smart meters—possibly aggregated from many of them—is
the basis for most scenarios. Other scenarios rely on historical storage
capacities, data on demand-response measures conducted in the past, or
energy prices, or they require external data such as weather forecasts for
predicting renewable generation. In the following, we focus on predic-
tions of consumption. The other predictions mentioned previously can be
treated in a similar way with their own specific data.
Predictions and forecasts can generally be done by learning from
historical data. In the case of energy consumption, this is a promising
approach, as there are certain regularities in the data: (1) The consumption
within 1 day is typically similar. People get up in the morning and switch
the light on, they cook at noon and watch TV in the evening. Figure 7.2(a)
illustrates two typical demand curves during two different days, aggre-
gated for all consumers in the United Kingdom (UK). (2) The consump-
tion on weekdays is typically similar, while it is different on weekends
and national holidays where, for example, factories are not running.
Data Analysis Challenges in the Future Energy Domain ◾ 209
of consumption of energy, many of them can also be used for other pre-
dictions. In addition to more general reviews [16, 36, 66], Aggarwal et al.
[14] reviews price-forecasting techniques in particular. Another direction of
work is the forecast of wind-power production [20, 84, 89]. The application
of some of the above-mentioned time-series forecast techniques has been
investigated in this context for both short- and long-term predictions, based
on data from wind-energy production and meteorological observations.
Dannecker et al. [37] is a study on hierarchical distributed forecasting of
time-series of energy demand and supply. Such approaches tackle explic-
itly the huge amounts of data that might need to be considered when mak-
ing forecasts at higher levels, such as a whole country (see Section 7.4.1). In
addition to distributed forecasting, the authors also deal with the impor-
tant problem of forecast model maintenance and reuse previous models
and their parameter combinations [38].
Time-series forecasting seems to be quite a mature field, but it is still
a challenge for the future energy domain. It has been applied to forecast-
ing demand, generation, and prices, but there is little literature available
regarding the other future energy scenarios listed above. Particularly in
light of dynamic pricing (see Scenario 7.2), other demand-response mea-
sures (e.g., Scenario 7.3), energy storage (see Scenario 7.5) and distributed
and volatile small-scale generation (see Section 7.1), predictions of con-
sumer demand, grid usage, etc., has become much more challenging. This
is because many more factors than pure historical time series are needed
to make accurate predictions. Section 7.4.3.2 sheds some light on the
human factor, but many further factors must be integrated in an appropri-
ate way to achieve the high-quality forecasts that are needed in the smart
grid. (Many future energy scenarios require extremely high accuracies of
predictions; that is, even small deviations from the optimal result may
cause huge costs.) This calls for more research on the question of which
factors are useful in which situation and which forecast model (or ensem-
ble thereof) to use for which task when certain data are available. These
questions can certainly not be answered in general and must be addressed
individually. However, some guidance and experience would be of high
practical relevance for new smart-grid scenarios.
Massively distributed generation and new loads (see Section 7.1) can
lead to problematic grid situations. Detecting and predicting such events
is a major topic in smart grids. Chapter 8 in this book elaborates this in a
comprehensive way.
From a technical point of view, the mentioned challenges can be divided
into two parts: (1) prediction of events and (2) classification of consump-
tion patterns. Abundant research has been conducted in the field of pattern
detection from smart-meter data. This has been published partly in the pri-
vacy domain [96, 113] (see Section 7.4.8). Pattern detection is also a basic
block for disaggregation techniques, which we describe in more detail along
with the techniques in Section 7.4.5. Early works have already shown that
the electricity consumption of a whole house can be disaggregated, with
high accuracy< into the major appliances [49]. Nizar et al. [108] is a sur-
vey of load profiling methods. Event prediction has received less attention
in the context of energy. While traditional techniques such as sequence
mining [62] can be used in principle to predict discrete events [46], fur-
ther techniques from machine learning have been adapted recently. For
instance, Savio et al. [124] performs event prediction in the field of electric-
ity consumption with neural networks and support vector machines.
To summarize, there is a huge need for the prediction of events and
for the classification and prediction of consumption patterns. On the
214 ◾ Computational Intelligent Data Analysis for Sustainable Development
one side, quite a bit of research has been conducted in pattern detection
(classification of patterns), partly in the context of disaggregation (see
Section 7.4.5). On the other side, techniques for predicting consumption
patterns and events of user behavior can still be improved for applica-
tion in the field of future energy. As the demand for accurate techniques
becomes a given, respective research would be an opportunity to support
the developments of the smart grid significantly.
7.4.4.1 Clustering
Let us now abstract from these individual scenarios and discuss some
well-k nown techniques in pattern detection. Clustering is an unsuper-
vised data-mining task for the grouping of objects based on their mutual
similarity [61]. Algorithms can detect groups of similar objects. Thus, they
separate pairs of dissimilar objects into different groups. A large variety of
approaches have been proposed for convex clusters [43, 92], density-based
clusters [48, 65], and spectral clustering [106, 107]. Further extensions
have been proposed for specific data types such as time series [76, 114]. All
these approaches differ in their underlying cluster definitions. However,
they have one major property in common: they all output a single set of
clusters, that is, one partitioning of the data that assigns each object to a
single cluster [102].
We now discuss this single clustering solution for customer segmenta-
tion based on smart-meter data. One has a given database of customers
(objects) that are described by a number of properties (attributes). These
attributes can be various types of information derived from smart-meter
measurements (see Scenario 7.1). For example, each customer has a certain
set of devices. For each device, one might detect its individual energy con-
sumption and additional information about the time points when these
devices are used in the household [78] (see Sections 7.4.3.3 and 7.4.5 for
further details about the identification of devices). Obviously, one can
detect groups of customers owning different types of devices. This group-
ing can be used to separate customers into different advertisement cam-
paigns (expensive devices, low-budget devices, energy-efficient devices,
and many more). However, in contrast to this simple partitioning, one
might be interested in several other groupings: Each customer is part of
groups with respect to the daily profile (early leaving, home office, part-
time working), or with respect to a current living situation (single house-
hold; family without children, with children, elderly people). This example
Data Analysis Challenges in the Future Energy Domain ◾ 217
7.4.4.2 Outlier Mining
In contrast to clusters (groups of similar objects), outliers are highly devi-
ating objects. Outliers can be rare, unexpected, and suspicious data objects
in a database. They can be detected for data cleaning, but in many cases
they provide additional and useful knowledge about the database. Thus,
pattern detection considers outliers as very valuable patterns hidden in
today’s data. In our previous example, suspicious customers might be
detected that deviate from the residual customers. Considering the neigh-
boring households, one might observe very high energy consumption for
heating devices. While all other households in this neighborhood use oil
or gas for heating, the outlier is using electric heating. There have been
different outlier detection paradigms proposed in the literature to detect
such outliers. Techniques range from deviation-based methods [119], to
distance-based methods [80], to density-based methods [27]. For exam-
ple, density-based methods compute a score for each object by measuring
its degree of deviation with respect to a local neighborhood. Thus, one is
able to detect local density variations between low-density outliers and their
high-density (clustered) neighborhood. Note that in our example, the neigh-
borhood has been literally the geographic neighborhood of the household.
However, it can be an arbitrary neighborhood considering other attributes
(e.g., similarity to other customers with respect to the set of devices used).
Similar to the clustering task, we observe open challenges in online
stream analysis for outlier detection [9], the detection of local outliers
in subspace projections [12, 103], and the scalability to large and high-
dimensional databases [40]. An additional challenge is the description
of such outlier patterns. Most approaches focus only on the detection of
highly deviating objects. Only a few consider their description [18, 79].
Similar to subspace clustering, it seems very promising to select relevant
Data Analysis Challenges in the Future Energy Domain ◾ 219
7.4.5 Disaggregation
For achieving energy efficiency, deep knowledge concerning the dis-
tribution of the consumed power among the devices within a facility is
important (see Scenarios 7.1 and 7.6). In practice, this is often achieved by
installing metering devices directly on single devices, which is expensive,
time-consuming, and usually not exhaustive. It would be easier to derive
the power distribution from the metered data at the interface to the grid
(see also Section 7.4.3.3).
Smart metering, that is, high-resolution metering and remote trans-
mission of metered data, promises to provide that deep look into the
infrastructure at all metering points. Techniques for achieving this are
commonly called nonintrusive (appliance) load monitoring (NILM, some-
times also NALM) or disaggregation of power data. This has potential
applications in achieving better energy efficiency (see Scenarios 7.1 and 7.6)
and in facilitating demand response (see Scenarios 7.2 and 7.3) and load
management (e.g., in a smart home, see Scenario 7.4). Thus, the topic has
recently sparked increased interest [31, 53, 55, 78, 82, 91, 142] after further
research (including, e.g., [49, 87]) has been done since the first paper was
published in 1992 [63].
Common smart meters in residential and industrial environments are
placed at the interface to the distribution grid. They measure the active and
reactive energy used by all the devices connected to the electric circuit that
originates at the meter. Additional values can be measured, such as peak
loads. Multiple meters can be installed at a single facility, which is usually
the case if separated billing of the consumed energy is required. For billing
purposes, such meters pick up the consumed energy typically in intervals
of fifteen minutes. However, an interface with a higher temporal resolution
is usually provided at the meter itself, which can be accessed locally.
As these meters are increasingly available, it is tempting to also use the
metered data for analytical purposes. In a residential setting, transparency
of energy consumption may lead to energy conservation (see Scenario 7.1).
NILM has also been proposed as a tool for verifying the effectiveness
of demand-response measures. In industrial or commercial settings, an
energy audit is a valuable tool for identifying potentials for energy effi-
ciency (see Scenario 7.6). Such audits can be executed more thoroughly
the more detailed information is available. The (temporary) installation
of sub-meters is therefore commonly practiced and could be, at least par-
tially, substituted by NILM.
Data Analysis Challenges in the Future Energy Domain ◾ 221
Given only the resulting sum value over time, we are looking for a state
matrix that contains the state of each device at any discrete point in time.
The state spaces of the devices are independent of each other. For most
practical devices, there exist several constraints on the possible states and
the state transitions that are caused by their internal electrical structure
and their usage modes. For example, all practical devices are operating
between a minimum load and a maximum load, and they have finitely
many operating states.
There are two fundamental steps to be made for load disaggregation.
The first step is feature recognition, which extracts features from the
observed meter data. The second step is the application of an optimization
algorithm that assigns values to the state matrix.
Pattern recognition is being applied to the observed values (see
Section 7.4.3.3), in its simplest form to a change in the real power load.
The objective of this step is to identify a set of devices that may exhibit
the observed pattern. A naïve algorithm could map a load change to the
device that exhibits the closest step size of the observed change. An ideal
algorithm would perfectly identify the cause of an observed event as either
a fluctuation not caused by a state change, or the very device and its state
change that caused the change. However, no such perfect algorithm exists
today and false positive matches and false negatives are unavoidable.
There are a variety of features that can be used to find a valid disaggre-
gation. The most basic feature of a device is its load variance, which was
used by Hart [63]. Based on this feature, four classes of devices can be
identified: permanent, on-off, multi-state, and variable. Permanent devices
are single-state and are consuming the same load at all times (e.g., alarm
systems that are never switched off). On-off devices have an additional off-
state, where consumption is (near-)zero. Multi-state devices have multiple
operating modes that are usually executed in certain patterns; for example,
a washing machine has certain modes, such as heating water, pumping, or
spinning. Variable load devices expose arbitrary, irregular load patterns that
may depend on their actual usage mode. It is important to note that most
practical devices cannot be fully characterized by one of these classes alone.
Usually, a device exhibits behavior that is a complex mixture of these classes.
The challenge of disaggregating such loads is complicated by the fact that, of
Data Analysis Challenges in the Future Energy Domain ◾ 223
course, the complex load profiles of devices are superimposed on each other,
which makes an accurate, nonambiguous disaggregation difficult to achieve.
Because basic features, which are also referred to as macroscopic fea-
tures, such as consumption or real and reactive power, have their limita-
tions, features on the microscopic level have been studied in order to obtain
more accurate results [142]. Microscopic features refer to characteristics of
the underlying electrical signal, which can be measured at frequencies of
at least in the kilohertz range. This allows for identification of waveform
patterns and the harmonics of the signal. Using these features yields bet-
ter results than disaggregation based on basic features alone. However,
such measurements require dedicated hardware and additional process-
ing capacities, which limits their practical use.
The optimization step (which is a common task in data analysis; see
Section 7.4.7) tries to find an assignment to the state matrix that best
matches the observed values. This answers the question of which device
was active during which period and at which power level.
A common approach to finding the state matrix is to create a hidden
Markov model (HMM) of the system [31, 78, 82]. Each device is repre-
sented by an HMM, which is a flexible structure that can capture complex
behavior. Roughly, a sharp change in power consumption corresponds to
a state change within a device HMM. The challenge is to extract the HMM
parameters from the observed meter data. This is often supported by a
supervised training phase where known features are being used.
7.4.5.2 Practical Applications
Accurate load disaggregation could replace sub-metering, at least for some
applications. But even with the currently available level of accuracy, use-
ful applications seem feasible. For example, Chen et al. [31] is using meter
data from water consumption to identify activities such as showering or
washing. This work improves results by evaluating the specific context in
which load disaggregation is being used. Usage patterns depending on
the time of day, household size, and demographics help to derive statisti-
cal information about appliance use, such as the distribution of washing
machine usage. Reportedly, it also helped people make decisions about
more efficient resource usage, for example, by replacing appliances with
more efficient ones.
It remains a challenge to improve the accuracy of NILM for practi-
cal applications. Many studies assume that the features of the involved
devices are known in advance. In such supervised settings, it is necessary
224 ◾ Computational Intelligent Data Analysis for Sustainable Development
For example, if we look at the energy production in each month, one could
detect a high peak in August, which deviates from the residual months
due to some unexpected high-energy production. The same statistics can
be applied for all August months over several years and highlight a spe-
cific year. This leads to a very promising selection of attribute combina-
tions, each with a high deviation in energy production. Overall, these
unexpected measures can be seen as candidates for manual exploration.
One can provide some of these attribute combinations to the user, and he
or she will be able to refine these selections.
Further techniques have been proposed for guided OLAP [121, 123]; they
focus more on the interaction with the users and provide additional means
for step-by-step processing through the OLAP cube and additional descrip-
tions on the deviation of data. However, all these techniques are expensive
in terms of computation. Similar to other automatic data-analysis tech-
niques, they do not scale to energy databases with many attributes and mil-
lions of measurements on the very fine-grained level. Applications of such
techniques are always limited by efficiency, and energy data pose one of the
most problematic application areas with respect to scalability issues.
7.4.7 Optimization
In the context of future energy and smart grids, there are a large number
of different optimization problems that must be solved. As elaborating on
all these problems would be beyond the scope of this chapter, we limit
ourselves to highlight the most important problems.
Optimization problems in the field of electricity can be roughly parti-
tioned in the demand side and the supply side:
230 ◾ Computational Intelligent Data Analysis for Sustainable Development
which persons are at home and at what times, if they prepare cold or warm
breakfast, when they are cooking, and when they watch TV or go to bed
[96]. Jawurek et al. [69] furthermore show that consumption curves of a
household are typically unique and can be used to identify a household.
There is a myriad of work that identifies the different scenarios of pri-
vacy risks and attacks in the field of energy; an overview can be found,
for example, by Khurana et al. [77]. A smaller number of studies propose
particular solutions, mostly for specific problems such as billing in the
presence of smart meters [68]. However, this field is still quite young, and
effective methods to provide privacy protection are still needed, ones that
can easily be applied in the field. In addition to the privacy of consum-
ers, such methods need to ensure that all actors in the energy market can
obtain the data they need in order to efficiently fulfill their respective role
in the current and future energy scenarios. This calls for further develop-
ments and new techniques in the fields of security research and privacy-
preserving data mining [13, 131, 133], for which future energy systems and
markets are an important field of application.
7.5 CONCLUSIONS
The traditional energy system relying on fossil and nuclear sources is not
sustainable. The ongoing transformation to a more sustainable energy
system relying on renewable sources leads to major challenges and to a
paradigm shift from demand-driven generation to generation-driven
demand. Further influential factors in the ongoing development are lib-
eralization and the effects of new loads, such as electric vehicles. These
developments in the future energy domain will be facilitated by a number
of techniques that are frequently referred to as the smart grid. Most of
these techniques and scenarios lead to new sources of data and to the chal-
lenge to manage and analyze them in appropriate ways.
In this chapter we highlighted the current developments toward a sus-
tainable energy system. We provided an overview of the current energy
markets and described a number of future energy scenarios. Based on
these elaborations, we derived the data-analysis challenges in detail. In
a nutshell, the conclusion is that there has been a lot of research but that
there are still many unsolved problems and there is a need for more data-
analysis research. Existing techniques can be applied or need to be further
developed for use in the smart grid. Thus, the future energy domain is an
important field for applied data-analysis research and has the potential to
contribute to sustainable development.
Data Analysis Challenges in the Future Energy Domain ◾ 233
ACKNOWLEDGMENTS
We thank Pavel Efros for his assistance, Anke Weidlich and many col-
leagues at SAP Research, and Acteno Energy for fruitful discussions and
proofreading (parts of) the chapter.
REFERENCES
1. Directive 2009/72/EC of the European Parliament and of the Council of
13 July 2009 Concerning Common Rule for the Internal Market in Electricity.
Official Journal of the European Union, L 211: 56–93, 2009.
2. E-Energy Glossary. Website of the DKE—Deutsche Kommission Elektrotechnik
Elektronik Informationstechnik im DIN und VDE, Germany: https://
teamwork.dke.de/specials/7/Wiki_EN/Wiki Pages/Home.aspx, 2010.
3. Energy Concept for an Environmentally Sound, Reliable and Affordable
Energy Supply. Publication of the German Federal Ministry of Economics
and Technology and the Federal Ministry for the Environment, Nature
Conservation and Nuclear Safety, September 2010.
4. MeRegio—Project Phase 2. Homepage of the MeRegio project: http://www.
meregio.de/en/?page=solution-phasetwo, 2010.
5. Annual Report 2010. Publication of the German Federal Motor Transport
Authority, 2011.
6. Connecting markets. EEX Company and Products brochure, European
Energy Exchange AG, October 2011.
7. Federal Environment Minister Röttgen: 20 Percent Renewable Energies Are
a Great Success. Press Release 108/11 of the German Federal Ministry for the
Environment, Nature Conservation and Nuclear Safety, August 2011.
8. Metered Half-Hourly Electricity Demands. Website of National Grid, UK:
http://www.nationalgrid.com/uk/Electricity/Data/Demand+Data/, 2011.
9. Charu C. Aggarwal. On abnormality detection in spuriously populated data
streams. In International Conference on Data Mining (SDM), 2005.
10. Charu C. Aggarwal, Editor. Data Streams: Models and Algorithms, Volume
31 of Advances in Database Systems. Berlin and New York: Springer, 2007.
11. Charu C. Aggarwal. On High Dimensional Projected Clustering of Uncertain
Data Streams. In International Conference on Data Engineering (ICDE), 2009.
12. Charu C. Aggarwal, and Philip S. Yu. Outlier detection for high dimensional
data. In International Conference on Management of Data (SIGMOD), 2001.
13. Charu C. Aggarwal, and Philip S. Yu, Editors. Privacy-Preserving Data Mining:
Models and Algorithms, Volume 34 of Advances in Database Systems. Berlin
and New York: Springer, 2008.
14. Sanjeev Kumar Aggarwal, Lalit Mohan Saini, and Ashwani Kumar. Electricity
Price Forecasting in Deregulated Markets: A Review and Evaluation.
International Journal of Electrical Power and Energy Systems, 31(1): 13–22, 2009.
15. Rakesh Agrawal, Johannes Gehrke, Dimitrios Gunopulos, and Prabhakar
Raghavan. Automatic subspace clustering of high dimensional data for data
mining applications. In International Conference on Management of Data
(SIGMOD), 1998.
234 ◾ Computational Intelligent Data Analysis for Sustainable Development
30. Pete Chapman, Julian Clinton, Randy Kerber, Thomas Khabaza, Thomas
Reinartz, Colin Shearer, and Rüdiger Wirth. CRISP-DM 1.0. Step-by-Step
Data Mining Guide, SPSS, Chicago, USA. August 2000.
31. Feng Chen, Jing Dai, Bingsheng Wang, Sambit Sahu, Milind Naphade, and
Chang-Tien Lu. Activity Analysis Based on Low Sample Rate Smart Meters. In
International Conference on Knowledge Discovery and Data Mining (KDD), 2011.
32. Jiyi Chen, Wenyuan Li, Adriel Lau, Jiguo Cao, and Ke Wang. Automated
Load Curve Data Cleansing in Power Systems. IEEE Transactions on Smart
Grid, 1(2): 213–221, 2010.
33. Chun- Hung Cheng, Ada Waichee Fu, and Yi Zhang. Entropy- Based
Subspace Clustering for Mining Numerical Data. In International Conference
on Knowledge Discovery and Data Mining (KDD), 1999.
34. Robson Leonardo Ferreira Cordeiro, Agma J.M. Traina, Christos Faloutsos,
and Caetano Traina Jr. Finding Clusters in Subspaces of Very Large, Multi-
Dimensional Datasets. In International Conference on Data Engineering
(ICDE), 2010.
35. Marco Dalai and Riccardo Leonardi. Approximations of One-Dimensional
Digital Signals under the l∞ Norm. IEEE Transactions on Signal Processing,
54(8): 3111–3124, 2006.
36. Lars Dannecker, Matthias Böhm, Ulrike Fischer, Frank Rosenthal,
Gregor Hackenbroich, and Wolfgang Lehner. State-of-the-Art Report on
Forecasting—A Survey of Forecast Models for Energy Demand and Supply.
Public Deliverable D4.1, The MIRACLE Consortium (European Commission
Project Reference: 248195), Dresden, Germany, June 2010.
37. Lars Dannecker, Matthias Böhm, Wolfgang Lehner, and Gregor
Hackenbroich. Forecasting Evolving Time Series of Energy Demand
and Supply. In East-European Conference on Advances in Databases and
Information Systems (ADBIS), 2011.
38. Lars Dannecker, Matthias Schulze, Robert Böhm, Wolfgang Lehner, and
Gregor Hackenbroich. Context-Aware Parameter Estimation for Forecast
Models in the Energy Domain. In International Conference on Scientific and
Statistical Database Management (SSDBM), 2011.
39. Sarah Darby. The Effectiveness of Feedback on Energy Consumption:
A Review for DEFRA of the Literature on Metering, Billing and Direct
Displays. Technical report, Environmental Change Institute, University of
Oxford, UK, April 2006.
40. Timothy de Vries, Sanjay Chawla, and Michael E. Houle. Finding Local
Anomalies in Very High Dimensional Space. In International Conference on
Data Mining (ICDM), 2010.
41. Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing
on Large Clusters. In Symposium on Operating Systems Design and
Implementation (OSDI), 2004.
42. Kalyanmoy Deb, Samir Agrawal, Amrit Pratap, and T. Meyarivan. A Fast
Elitist Non- Dominated Sorting Genetic Algorithm for Multi- Objective
Optimization: NSGA-II. In International Conference on Parallel Problem
Solving from Nature (PPSN), 2000.
236 ◾ Computational Intelligent Data Analysis for Sustainable Development
57. Jessica Granderson, Mary Piette, and Girish Ghatikar. Building Energy
Information Systems: User Case Studies. Energy Efficiency, 4: 17–30, 2011.
58. Ulrich Greveler, Benjamin Justus, and Dennis Löhr. Multimedia Content
Identification through Smart Meter Power Usage Profiles. In International
Conference on Computers, Privacy and Data Protection (CPDP), 2012.
59. Stephan Günnemann, Hardy Kremer, Charlotte Laufkötter, and Thomas
Seidl. Tracing Evolving Subspace Clusters In Temporal Climate Data. Data
Mining and Knowledge Discovery, 24(2): 387–410, 2011.
60. Duy Long Ha, Minh Hoang Le, and Stéphane Ploix. An approach for home load
energy management problem in uncertain context. In International Conference
on Industrial Engineering and Engineering Management (IEEM), 2008.
61. Jiawei Han, Micheline Kamber, and Jian Pei. Data Mining: Concepts and
Techniques. Morgan Kaufmann, Burlington, USA, 2011.
62. Jiawei Han, Jian Pei, and Xifeng Yan. Sequential Pattern Mining by Pattern-
Growth: Principles and Extensions. In W. Chu and T. Lin, Editors, Studies in
Fuzziness and Soft Computing, Volume 180 of Foundations and Advances
in Data Mining, pages 183–220. Berlin and New York: Springer, 2005.
63. George W. Hart. Nonintrusive Appliance Load Monitoring. Proceedings of
the IEEE, 80(12): 1870–1891, 1992.
64. Magnus L. Hetland. A Survey of Recent Methods for Efficient Retrieval of
Similar Time Sequences. In Last et al. [86], Chapter 2, pages 27–49.
65. Alexander Hinneburg and Daniel Keim. An Efficient Approach to Clustering
in Large Multimedia Databases with Noise. In International Conference on
Knowledge Discovery and Data Mining (KDD), 1998.
66. Henrique Steinherz Hippert, Carlos Eduardo Pedreira, and Reinaldo Castro
Souza. Neural Networks for Short-Term Load Forecasting: A Review and
Evaluation. IEEE Transactions on Power Systems, 16(1): 44–55, 2001.
67. Vikramaditya Jakkula and Diane Cook. Outlier Detection in Smart
Environment Structured Power Datasets. In International Conference on
Intelligent Environments (IE), 2010.
68. Marek Jawurek, Martin Johns, and Florian Kerschbaum. Plug-In Privacy for
Smart Metering Billing. In International Symposium on Privacy Enhancing
Technologies (PETS), 2011.
69. Marek Jawurek, Martin Johns, and Konrad Rieck. Smart Metering De-
Pseudonymization. In Annual Computer Security Applications Conference
(ACSAC), 2011.
70. Andrej Jokić, Mircea Lazar, and Paul P.J. van den Bosch. Price-Based Control
of Electrical Power Systems. In Negenborn et al. [105], Chapter 5, pages
109–131.
71. Ian Joliffe. Principal Component Analysis. Berline and New York: Springer, 1986.
72. Karin Kailing, Hans-Peter Kriegel, and Peer Kröger. Density-Connected
Subspace Clustering for High-Dimensional Data. In International Conference
on Data Mining (SDM), 2004.
238 ◾ Computational Intelligent Data Analysis for Sustainable Development
89. Shuhui Li, Donald C. Wunsch, Edgar O’Hair, and Michael G. Giesselmann.
Comparative Analysis of Regression and Artificial Neural Network
Models for Wind Turbine Power Curve Estimation. Journal of Solar Energy
Engineering, 123(4): 327–332, 2001.
90. Xiaoli Li, Chris P. Bowers, and Thorsten Schnier. Classification of Energy
Consumption in Buildings with Outlier Detection. IEEE Transactions on
Industrial Electronics, 57(11): 3639–3644, 2010.
91. Jian Liang, Simon K.K. Ng, Gail Kendall, and John W.M. Cheng. Load
Signature Study. I. Basic Concept, Structure, and Methodology. IEEE
Transactions on Power Delivery, 25(2): 551–560, 2010.
92. J. MacQueen. Some methods for classification and analysis of multivari-
ate observations. In Berkeley Symposium on Mathematical Statistics and
Probability, 1967.
93. Paras Mandal, Tomonobu Senjyu, Naomitsu Urasaki, and Toshihisa
Funabashi. A Neural Network Based Several-Hour-Ahead Electric Load
Forecasting Using Similar Days Approach. International Journal of Electrical
Power and Energy Systems, 28(6): 367–373, 2006.
94. Friedemann Mattern, Thorsten Staake, and Markus Weiss. ICT for green:
How computers can help us to conserve energy. In International Conference
on Energy-Efficient Computing and Networking (E-Energy), 2010.
95. Tom Mitchell. Machine Learning. New York: McGraw Hill, 1997.
96. Andrés Molina-Markham, Prashant Shenoy, Kevin Fu, Emmanuel Cecchet,
and David Irwin. Private memoirs of a smart meter. In Workshop on
Embedded Sensing Systems for Energy-Efficiency in Building (BuildSys), 2010.
97. Sanaz Mostaghim and Jürgen Teich. Strategies for finding good local guides
in multi- objective particle swarm optimization (MOPSO). In Swarm
Intelligence Symposium (SIS), 2003.
98. Emmanuel Müller, Ira Assent, Stephan Günnemann, and Thomas Seidl.
Scalable density-based subspace clustering. In International Conference on
Information and Knowledge Management (CIKM), 2011.
99. Emmanuel Müller, Ira Assent, Ralph Krieger, Stephan Günnemann, and
Thomas Seidl. DensEst: Density estimation for data mining in high dimen-
sional spaces. In International Conference on Data Mining (SDM), 2009.
100. Emmanuel Müller, Ira Assent, and Thomas Seidl. HSM: Heterogeneous sub-
space mining in high dimensional data. In Scientific and Statistical Database
Management (SSDBM Conference Proceedings), 2009.
101. Emmanuel Müller, Stephan Günnemann, Ira Assent, and Thomas Seidl.
Evaluating clustering in subspace projections of high dimensional data. In
International Conference on Very Large Data Bases (VLDB), 2009.
102. Emmanuel Müller, Stephan Günnemann, Ines Färber, and Thomas Seidl.
Discovering multiple clustering solutions: Grouping objects in different
views of the data. In Internatinal Conference on Data Mining (ICDM), 2010.
103. Emmanuel Müller, Matthias Schiffer, and Thomas Seidl. Statistical selec-
tion of relevant subspace projections for outlier ranking. In International
Conference on Data Engineering (ICDE), 2011.
240 ◾ Computational Intelligent Data Analysis for Sustainable Development
118. Ulrich Reiner, Thomas Leibfried, Florian Allerding, and Hartmut Schmeck.
Potential of electrical vehicles with feed-back capabilities and controllable
loads in electrical grids under the use of decentralized energy management.
In International ETG Congress, 2009.
119. Peter J. Rousseeuw and Annick M. Leroy. Robust Regression and Outlier
Detection. New York: Wiley, 1987.
120. Sebnem Rusitschka, Kolja Eger, and Christoph Gerdes. Smart grid data
cloud: A model for utilizing cloud computing in the smart grid domain. In
International Conference on Smart Grid Communications (SmartGridComm),
2010.
121. Sunita Sarawagi. User-adaptive exploration of multidimensional data. In
International Conference on Very Large Data Bases (VLDB), 2000.
122. Sunita Sarawagi, Rakesh Agrawal, and Nimrod Megiddo. Discovery-driven
exploration of OLAP data cubes. In International Conference on Extending
Database Technology (EDBT), 1998.
123. Gayatri Sathe and Sunita Sarawagi. Intelligent rollups in multidimensional
OLAP data. In International Conference on Very Large Data Bases (VLDB), 2001.
124. Domnic Savio, Lubomir Karlik, and Stamatis Karnouskos. Predicting
energy measurements of service-enabled devices in the future smartgrid. In
International Conference on Computer Modeling and Simulation (UKSim), 2010.
125. Matthieu-P. Schapranow, Ralph Kühne, Alexander Zeier, and Hasso Plattner.
Enabling real-time charging for smart grid infrastructures using in-memory
databases. In Workshop on Smart Grid Networking Infrastructure, 2010.
126. Joachim Schleich, Marian Klobasa, Marc Brunner, Sebastian Gölz, and
Konrad Götz. Smart Metering in Germany and Austria: Results of Providing
Feedback Information in a Field Trial. Working Paper Sustainability and
Innovation S 6/2011, Fraunhofer Institute for Systems and Innovation
Research (ISI), Karlsruhe, Germany, 2011.
127. Raimund Seidel. Small- Dimensional Linear Programming and Convex
Hulls Made Easy. Discrete & Computational Geometry, 6(1): 423–434, 1991.
128. Kelvin Sim, Vivekanand Gopalkrishnan, Arthur Zimek, and Gao Cong.
A Survey on Enhanced Subspace Clustering. Data Mining and Knowledge
Discovery, 2012.
129. Myra Spiliopoulou, Irene Ntoutsi, Yannis Theodoridis, and Rene Schult.
MONIC: Modeling and monitoring cluster transitions. In International
Conference on Knowledge Discovery and Data Mining (KDD), 2006. Available
online: http://link.springer.com/article/10.1007/510618-012-0258-X.
DOI:10.1001/s10618-012-0258-X.
130. Asher Tishler and Israel Zang. A Min-Max Algorithm for Non-Linear
Regression Models. Applied Mathematics and Computation, 13(1/2): 95–115,
1983.
131. Jaideep Vaidya, Yu Zhu, and Christopher W. Clifton. Privacy Preserving Data
Mining, Volume 19 of Advances in Information Security. Berlin and New
York: Springer, 2006.
242 ◾ Computational Intelligent Data Analysis for Sustainable Development
132. Sergio Valero Verdú, Mario Ortiz García, Carolina Senabre, Antonio Gabaldón
Marín, and Francisco J. García Franco. Classification, Filtering, and
Identification of Electrical Customer Load Patterns through the Use of Self-
Organizing Maps. IEEE Transactions on Power Systems, 21(4): 1672–1682,
2006.
133. Vassilios S. Verykios, Elisa Bertino, Igor Nai Fovino, Loredana Parasiliti
Provenza, Yucel Saygin, and Yannis Theodoridis. State-of-the-Art in Privacy
Preserving Data Mining. SIGMOD Record, 33(1): 50–57, 2004.
134. Michail Vlachos, Dimitrios Gunopulos, and Gautam Das. Indexing Time-
Series under Conditions of Noise. In Last et al. [86], Chapter 4, pages 67–100.
135. Harald Vogt, Holger Weiss, Patrik Spiess, and Achim P. Karduck. Market-
based prosumer participation in the smart grid. In International Conference
on Digital Ecosystems and Technologies (DEST), 2010.
136. Horst F. Wedde, Sebastian Lehnhoff, Christian Rehtanz, and Olav Krause.
Bottom-up self-organization of unpredictable demand and supply under
decentralized power management. In International Conference on Self-
Adaptive and Self-Organizing Systems, 2008.
137. Anke Weidlich. Engineering Interrelated Electricity Markets. Physica Verlag,
Heidelerg, Germany, 2008.
138. Anke Weidlich, Harald Vogt, Wolfgang Krauss, Patrik Spiess, Marek Jawurek,
Martin Johns, and Stamatis Karnouskos. Decentralized intelligence in
energy efficient power systems. In Alexey Sorokin, Steffen Rebennack, Panos
M. Pardalos, Niko A. Iliadis, and Mario V.F. Pereira, Editors, Handbook
of Networks in Power Systems: Optimization, Modeling, Simulation and
Economic Aspects. Berlin and New York: Springer, 2012.
139. Tom White. Hadoop: The Definitive Guide. O’Reilly, Sebastopol, USA, 2009.
140. Ian H. Witten, Eibe Frank, and Mark A. Hall. Data Mining: Practical Machine
Learning Tools and Techniques. Morgan Kaufmann, Burlington, USA, 2011.
141. G. Michael Youngblood and Diane J. Cook. Data Mining for Hierarchical
Model Creation. IEEE Transactions on Systems, Man, and Cybernetics, 37(4):
561–572, 2007.
142. Michael Zeifman and Kurt Roth. Nonintrusive Appliance Load Monitoring:
Review and Outlook. Transactions on Consumer Electronics, 57(1): 76–84, 2011.
143. Roberto V. Zicari. Big Data: Smart Meters—Interview with Markus Gerdes.
ODBMS Industry Watch Blog: http://www.odbms.org/ blog/2012/06/big-
data-smart-meters-interview-with-markus-gerdes/, 2012.
Chapter 8
Electricity Supply
without Fossil Fuels
John Boland, Peter Pudney, and Jerzy Filar
CONTENTS
8.1 Introduction 244
8.2 Approach 246
8.2.1 Renewable Generation 246
8.2.2 Overcoming Variability 247
8.2.2.1 Storage 247
8.2.3 Demand Management 248
8.2.4 Mechanism Design for Demand Management 249
8.3 Forecasting Wind and Solar Power 250
8.3.1 Estimating Hourly Solar Irradiation 252
8.3.2 Box-Jenkins or ARMA Modeling 254
8.3.3 A Resonating Model 254
8.3.4 Estimating Volatility 256
8.3.4.1 Using Results from Modeling
High-Frequency Wind Data to Formulate a
Model at Lower Frequency 260
8.3.4.2 Summary for Single Sources 262
8.3.4.3 Multiple Sources 263
8.4 Portfolio Analysis 264
8.5 Conclusion 268
References 269
243
244 ◾ Computational Intelligent Data Analysis for Sustainable Development
8.1 INTRODUCTION
United
States
20
Developed
Country Average
15
EU25 China
10
Global Average
Other Developing
5
India
0
10
15
20
25
30
35
40
45
50
20
20
20
20
20
20
20
20
20
Penny Wong stated that, “We accept the science and the advice that putting
a price on carbon is the best way to reduce emissions.” This action sets up the
environment for a transformation of the electricity supply sector in Australia.
According to an article in Climate Spectator online on December 7, 2011:
The LRET requires that by 2020, 20% of electricity will be supplied from
renewable sources. This can be done in an ad hoc fashion, as is the pres-
ent situation, with wind farms dominating large-scale installations and an
intensification of photovoltaic installations on domestic houses (so much
so that in NSW, for example, there is 300 megawatts (MW) installed).
This is widely purported to be driving up the costs of electricity for the
remainder of the population. However, this conjecture is disputed by the
Australian Energy Market Commission, who estimate that the combined
costs of feed-in tariffs and renewable energy schemes will make up about
14% of future price increases, whereas the cost of reinforcing transmission
and distribution systems to cope with growth in peak demand will make
up 49% of electricity price rises over the next 3 years* [3]. To maximize
greenhouse gas reduction, a more coordinated approach is necessary.
Maximizing the penetration of renewable energy sources for supplying
electricity can be achieved through two disparate techniques: (1) building
a supergrid, or (2) mixing the renewable sources with demand manage-
ment and storage. In Europe, for example, there are proposals such as the
Desertec initiative to create a supergrid interconnecting installations of
renewable generation across Europe, North Africa, and the Middle East [21].
The underlying principle is that if you have solar farms distributed east-
west along a latitude, there is an elongation of the solar day for providing
* Another challenge for today’s grid is the growth in air-conditioning penetration. In Western
Sydney, more than 80% of homes now have air conditioning. This growth is driving energy suppli-
ers such as Integral to spend approximately $3 billion over the next 5 years on grid infrastructure,
to meet the increased peak loads. But this extra infrastructure will only be needed for a few days a
year; it’s like building a twenty-seven-lane freeway so that we never have peak-hour traffic jams.—
The Climate Spectator, 14 July 2010.
246 ◾ Computational Intelligent Data Analysis for Sustainable Development
energy for the grid. In addition, wind farms distributed north-south along
a longitude can take advantage of weather systems moving cohesively at
different latitudes. This concept of using meteorological and astronomi-
cal attributes to enhance the diversity of sources requires significant aug-
mentation of the grid, including provision of high-voltage, direct-current
transmission systems to minimize distance-related losses. In this configu-
ration, even augmented by other renewable sources, there is a necessity
for some energy storage for backup and load balancing—Denmark stores
excess wind power in Norway’s hydroelectric dams through pumping.
An alternative approach is to develop a protocol for combining a renew-
able supply of diverse sources (both in technologies and locations) with
sophisticated demand management (load follows the supply rather than
the reverse) and energy storage to maximize the penetration of renew-
ables. This entails a variety of supply options, more directly embedded in
the transmission and distribution networks than the supergrid option, as
well as a variety of storage options. The task arising from this more subtle
approach is deciding on the quantities of each supply option, as well as a
storage option, where they are positioned in the system, as well as how to
best use the demand-side management. This results in a portfolio optimi-
zation problem.
In summary, in this chapter we present the mathematical modeling
tools we argue are necessary to perform a rational, sophisticated analysis
of the problem of moving to a high penetration of renewable sources in the
provision of the electricity supply.
8.2 APPROACH
8.2.1 Renewable Generation
There are many renewable energy technologies that can generate electric-
ity without consuming finite resources and without producing CO2 emis-
sions, including concentrating solar thermal plants, solar photovoltaic
panels, wind turbines, tidal generators, and biofuel generators.
Zero Carbon Australia [41] describes an ambitious plan to meet
Australia’s electricity needs using a combination of 60% concentrating
solar thermal power (with molten salt storage), 40% wind, and backup
power from hydroelectricity and biofuels. Elliston et al. [13] have shown
that it is technically feasible to supply electricity from renewables with the
same reliability as the present system.
Electricity Supply without Fossil Fuels ◾ 247
8.2.2 Overcoming Variability
The key challenge with renewable energy sources is the variability of the
supply. Delucci and Jacobson [10] describe seven ways to cope with vari-
ability in an electricity grid powered from wind, water, and the sun:
8.2.2.1 Storage
Options for storing energy include the following:
8.2.3 Demand Management
The current electricity system is designed to supply whatever power is
demanded, with little or no coordination of loads. As a result, the system
is sized for peak demands that occur for only a few hours each year. The
mean demand for power in the Australian National Electricity Market is
70% of the peak demand; in South Australia, the mean demand is just 50%
of the peak demand. Figure 8.2 shows hourly demand for power in the
National Electricity Market for each day of 2010.
Appliances are becoming more efficient, but we are using more of
them. Increases in peak demand require upgrades to the transmission and
distribution infrastructure, which adds significantly to the cost of elec-
tricity. Furthermore, fixed retail prices isolate residential and commercial
consumers from supply constraints and the associated variations in gen-
eration costs. This problem can be partly addressed by time-of-use pric-
ing, where electricity is cheaper during periods when demand is usually
low; or by critical peak pricing, where the price to consumers varies with
the cost of supply. An energy price that increases with power use would
35
30
25
20
GW
15
10
5 J F M A M J J A S O N D
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0
Hour
FIGURE 8.2 (See color insert.) Hourly demand profile in the NEM for 2010 (data
from Australian Energy Market Operator [5]).
Electricity Supply without Fossil Fuels ◾ 249
Our efforts to date have focused on wind farm output for specific loca-
tions. We need to investigate the accumulation of wind farm output over
wide areas, considering the effects of spatial coherence. How can geo-
graphic diversity reduce variability and improve predictability? For solar
energy forecasting, we have focused on global solar irradiance, and at
single locations. The methods should be extended to include forecasting
of direct solar irradiance, necessary for photovoltaic and thermal concen-
trated solar plants (CSPs). We also need to investigate the spatial diversity
of solar irradiance, and the mix of solar and wind.
The Australian Energy Market Operator provides 5-minute energy gen-
eration data for each wind farm in Australia. Hourly wind speed data are
available from the Bureau of Meteorology (BOM) for nearby automatic
weather stations. By combining this data, we can create an empirical energy
curve for the output of existing wind farms. The results can then be used
to estimate power output, given the wind speed for proposed wind farms.
In a similar manner, we can access solar radiation values for any location
in Australia (on a 5 kilometer by 5 kilometer grid), wherein hourly irra-
diation is estimated from satellite images. This data can then be used to
estimate the energy output from various types of solar collectors, includ-
ing photovoltaic (PV) with varying orientations, concentrating PV, and
concentrating solar thermal. The BOM publishes estimates of global solar
irradiation on a horizontal plane and direct normal irradiation. Global
solar irradiation is a combination of direct beam irradiation and diffuse
irradiation. For estimating the output of concentrating PV and thermal
solar plants, we need to estimate the direct beam component. This can
be done by first using the BRL model [36] to estimate the diffuse irradia-
tion on a horizontal surface. The difference between the global and diffuse
irradiation gives the direct irradiation on a horizontal surface, from which
a straightforward trigonometric calculation gives the direct irradiance on
a surface normal to the beam. This is precisely the chain of modeling that
the BOM uses to infer solar components. They will use this procedure in
work for the Solar Flagships Program to inform industry on the siting of
solar farms. The direct beam and diffuse components can be used to esti-
mate the output of fixed, tracking, and concentrating solar generators.
For the purpose of determining the optimal mix of generation tech-
nologies and sites, it is not necessary to develop detailed forecast mod-
els for each generator. However, we will need to rigorously determine the
stochastic nature of the output from individual sites and also of sets of
spatially diverse as well as platform-diverse sources.
252 ◾ Computational Intelligent Data Analysis for Sustainable Development
2πt 2πt
St = α 0 + α1 cos + β1 sin
8760 8760
4 πt 4 πt
+α 2 cos + β 2 sin
8760 8760
11 3 1
Here, α0 is the mean of the data; α1, β1 are coefficients of the yearly cycle;
α2, β2 of twice yearly; and αi , βi are coefficients of the daily cycle and its
harmonics and associated beat frequencies. An inspection of the power
spectrum would show that we need to include the harmonics of the daily
cycle (n = 2, 3, as well as n = 1) and also the beat frequencies (m = −1, 1).
The latter modulate the amplitude to fit the time of year—in other words,
describe the beating of the yearly and daily cycles.
Figure 8.3 shows the daily variation over the year for an example site.
Figure 8.4 shows 5 days of hourly solar radiation and the Fourier series
model for that variation. In Figure 8.5 we see the worth of the partic-
ular frequencies variously termed “beat frequencies” or “sidebands,”
which modulate the amplitude of the daily harmonic to suit the time of
Electricity Supply without Fossil Fuels ◾ 253
4000
3500
3000
2500
2000
1500
1000
500
0
1
8
15
22
29
36
43
50
57
64
71
78
85
92
99
106
113
120
127
134
141
148
155
162
169
176
183
190
197
204
211
218
225
232
239
246
253
260
267
274
281
288
295
302
309
316
323
330
337
344
351
358
365
FIGURE 8.3 Daily solar radiation.
1200
Series
Model
1000
800
600
400
200
0
0 20 40 60 80 100 120 140
–200
1200
1000
800
600
400
200
0
0 20 40 60 80 100 120 140
–200
Data
Fourier series model
Fourier series model without beat frequencies
year. Note that in the examples we have tested, the amount of variance
explained by the Fourier series is approximately 80% to 85%.
Xt − φ1 Xt −1 − φ2 Xt −2 −…− φ p Xt − p = Zt + θ1 Zt −1 − θ2 Zt −2 + … + θq Zt −q
where {Xt } are random variables with Xt ~ (0,σX2 ) and {Zt } is white noise,
independent and identically distributed with Zt ~ (0,σZ2 ).
Figure 8.6 shows results from an AR(2) model.
200
Deseasoned
AR (2)
150
100
50
0
0 20 40 60 80
–50
–100
–150
t
z i +1 = z i + κ(z i + fi ) − λ(3 fi 2 z i + fi 3 ) − εz i − γ fi − b (8.2)
ε
Here, ωt and at are noise terms, and Δt is the time step. Equation (8.2)
aims to exploit the fact that the current value of zt is useful to predict the
future value Rt+1. The parameters κ, λ, ε, γ, and b can be estimated using
the method of ordinary least squares (OLS). For our deseasoned data, esti-
mated values for the parameters are κ = −2.1, λ = −6 × 10−8, ε = 0.09,
γ = 0.5, and b = 2. λ is virtually zero, which indicates that the deseasoned
residuals Rt behave linearly. Further to this, a negative value of κ assures
the stability of the inherent damped oscillator in Equation (8.2).
256 ◾ Computational Intelligent Data Analysis for Sustainable Development
200
Deseasoned
Lucheroni model
150
100
50
0
0 20 40 60 80
–50
–100
–150
8.3.4 Estimating Volatility
Traditional methods for forecasting volatility are indirect by necessity because
the volatility is unobservable in a time series. These indirect methods include
generalized autoregressive conditional heteroscedastic (GARCH) [9] and
Hidden Markov Models (HMMs) [40]. We have developed a method for esti-
mating the volatility using high-frequency data [1], and then use the resonat-
ing model [26] for forecasting the volatility at the 5-minute and 30-minute
timescales required by the electricity market. These forecasting methods can
also be used to delineate the level and variability of output. We are thus able to
make these estimations for any proposed individual wind or solar installation.
Electricity Supply without Fossil Fuels ◾ 257
180
160
140
120
100
Deseasoned data
80
AR(2)+Luc
60 AR(2)+Luc+fixed com
40
20
0
0 5 10 15 20 25
–20
–40
FIGURE 8.8 (See color insert.) When fixed components add into combination of
Lucheroni and AR(2) model.
Figures 8.9 through 8.11 show wind farm output at two different time
scales and an AR(3) model fitted to the 5-minute output.
The noise is uncorrelated but dependent. This phenomenon is preva-
lent in financial markets—it is called volatility clustering. Periods of high
volatility are followed by periods of low volatility. Engle [14] developed the
autoregressive conditional heteroscedastic (ARCH) model to cater to this.
Figure 8.11 indicates that the model will have to be a long-lag AR model.
For this lack of parsimony and other reasons, Bollerslev developed the
generalized ARCH (or GARCH) model, where we replace the long-lag AR
model with a short-lag ARMA model. The residuals of the AR(3) model of
wind farm output display this conditional volatility. Often, an ARMA(1,1)
for the residuals squared is sufficient and the GARCH model is derived
from that. For this example, the GARCH model for conditional volatility
is σt2 = 0.006 + 0.122a2t–1 + 0.821σt–1
2
.
We developed a method to estimate volatility when high-frequency
data follow an AR(p) process [1, 2]. Many researchers have made use of
high-frequency data to estimate the volatility. Their approach involved
computation of covariance, etc. Our approach is different, as we use a
model of high-frequency data to estimate the volatility. The following is
a description of how to use 10-second wind farm output to estimate the
volatility on a 5-minute timescale.
–0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
–0.005
0
0.005
0.01
0.015
0.02
0.025
0.03
0.035
12:05:00 AM
12:00:03 AM
5:05:00 AM
4:00:03 AM
8:00:03 AM 10:05:00 AM
12:00:03 PM
3:05:00 PM
4:00:03 PM
8:05:00 PM
8:00:03 PM
12:00:03 AM 1:05:00 AM
4:00:03 AM 6:05:00 AM
8:00:03 AM
11:05:00 AM
4:02:23 PM 4:05:00 PM
8:02:43 PM 9:05:00 PM
12:02:43 AM
2:05:00 AM
4:02:43 AM
7:05:00 AM
8:02:43 AM
12:02:43 PM 12:05:00 PM
4:02:43 PM
5:05:00 PM
8:02:43 PM
258 ◾ Computational Intelligent Data Analysis for Sustainable Development
10:05:00 PM
Electricity Supply without Fossil Fuels ◾ 259
60.0
Data
Model
50.0
40.0
Energy (MwH)
30.0
20.0
10.0
0.0
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59
Time
Xt = α1 Xt −1 + α 2 Xt −2 + α 3 Xt −3 + Zt
or equivalently,
φ( B)Xt = Zt
where ϕ(B) = 1 − α1B − α2B2 − α3B3, and B denotes the backshift operator,
that is, BXt = Xt−1.
As ϕ(B) is invertible, the process is equivalent to an infinite moving
average process.
Xt = ψ( B)Zt
0.040
0.035
0.030
0.025
0.020
0.015
Model Actual
0.010
Xt = ψ 0 Zt + ψ 1 Zt −1 + ψ 2 Zt −2 + ψ 3 Zt −3 +…, (8.3)
ψ j = α1ψ j −1 + α 2 ψ j −2 + α 3ψ j −3 (8.4)
Yt = Xt + X 1 +X 2 +…+ X 29
t− t− t− (8.5)
30 30 30
Electricity Supply without Fossil Fuels ◾ 261
X i
t−
30
represents the wind energy output at the i-th 10-second prior to time t, so
that t − 1 remains the consistent notation for 5 minutes prior to t.
Yt = ψ 0 Zt + (ψ 0 + ψ 1 )Z 1 + (ψ 0 + ψ 1 + ψ 2 )Z 2
t− t−
30 30
+ (ψ 0 + ψ 1 + ψ 2 + ψ 3 )Z 3 +…
t−
30
+ (ψ 0 + ψ 1 +…+ ψ 29 )Z 29
t−
30
+ (ψ 1 + ψ 2 + ψ 3 +…+ ψ 30 )Z t −1 (8.6)
+ (ψ 2 + ψ 3 + ψ 4 +…+ ψ 31 )Z 31 +…
t−
30
+ (ψ 29 + ψ 31 +…+ ψ 58 )Z 59
t−
30
+ (ψ 30 + ψ 32 +…+ ψ 59 )Z t−−2 +…
Note that in Equation (8.6), coefficients up to the 30th term have a differ-
ent form than those after the 30th term.
Variance σ2(Yt ) in terms of ψi values is given below. We assume that
within each 5-minute interval, the Zt values are independent and identi-
cally distributed with zero mean. The variance of Yt is thus
σ 2 (Yt ) = [ψ 02 + (ψ 0 + ψ 1 )2 + (ψ 0 + ψ 1 + ψ 2 )2
+ (ψ 0 + ψ 1 + ψ 2 + ψ 3 )2 +…
+ (ψ 0 + ψ 1 + ψ 2 +…+ ψ 29 )2 ]σ 2 ( Zt )
(8.7)
+ [(ψ 1 + ψ 2 + ψ 3 + …+ ψ 30 )2
+ (ψ 2 + ψ 3 + ψ 4 +…+ ψ 31 )2 +…
+ (ψ 30 + ψ 31 +…+ ψ 59 )2 ]σ 2 ( Zt−1 ) +…
262 ◾ Computational Intelligent Data Analysis for Sustainable Development
Rearranging gives
29
n
2
σ (Yt ) =
2
∑∑
n=0 i =0
ψi
σ 2 ( Zt ) +
59 n n−30
2
∑∑ ∑
n=30 i =0
ψi −
i =0
ψi
σ 2 (Z
Zt −1 ) + (8.8)
89 n n−30
2
∑∑ ∑
n=60 i =0
ψi −
i =0
ψi
σ 2 ( Zt −2 ) +…
∑ψ
i =0
i
∑ ψ = ∑∑
i =0
i
k =0
(n1 ,n2 ,n3 )∈A
(n1 + n2 + n3 )! n1 n2 n3
n1 !n2 !n3 !
α1 α 2 α 3
2.5
5 min data Estimated standard deviation
1.5
Energy MwH
0.5
0
10
6:
2:
10
6:
2: 3 P
10 53 P
6: 53 A
2: 3 A
10 3 A
6: 53 P
2: 3 P
10 53 P
6: 53 A
2: 3 A
10 3 A
6: 03 P
2: 3 P
10 03 P
6: 03 A
2: 3 A
10 3 A
6: 03 P
54
54
54
54 M
54
54
54
54 M
54
54
55
55 M
55
55
55
:5
:5
:5 M
:5
:5 M
:5
:5 M
:5
4:
:5
:5
4:
:5
:
4:
:5 M
:5 M
4: M
:5 M
:
4:
:5 M
:5 M
5: M
:0 M
:
5:
:0 M
:0 M
5: M
:0 M
53
3
53
3
A
PM
A
PM
M
M
M
–0.5
Time
FIGURE 8.13 (See color insert.) Output and estimated standard deviation.
8.3.4.3 Multiple Sources
Increasing the spatial diversity of the renewable sources will help smooth
the volatility of the overall input, subject to there being enough intercon-
nectivity in the grid. One can easily compute pairwise correlation between
two time series, but how does one evaluate a group connectivity? Getz
[20] developed the concept of correlative coherence to analyze the overall
connectedness of movements of individual elephants in a herd. A similar
method can be used to determine an overall correlation between the out-
puts of multiple wind farms [7]. First, take the correlation matrix R con-
taining the pairwise correlations between the n sources. Its eigenvalues λi,
i = 1, …, n have the properties that 0 ≤ λi/n ≤ 1, and
∑
n
λi n = 1
i =1
n
C( X ) = 1 −
1
ln(1 / n) ∑ λn ln λn (8.9)
i =1
i i
264 ◾ Computational Intelligent Data Analysis for Sustainable Development
If all the off-diagonal elements have the same value r ∈ [0,1] and the diago-
nal elements are all unity, then the eigenvalues of the correlation matrix
R(r) are λ1 = 1 + (n − 1)r and λi = 1 − r for i = 2, …, n. In this case, we have
Cn (r ) =
(1 + (n − 1)r )ln (1 + (n − 1)r ) + (n − 1)(1 − r )ln(1−− r ) (8.10)
nlnn
The other option comes from Hoff and Perez [23], who provide a mea-
sure for the short-term power output variability resulting from an ensem-
ble of equally spaced, identical photovoltaic systems. They construct the
Relative Output Variability measure, defined as the ratio of the Output
Variability for the ensemble to the Output Variability of the same PV fleet
concentrated in a single location. The output variability is
N
σ Nt =
1
C
Var ∑
n=1
P nt (8.12)
where C is the total installed peak power of the ensemble and ΔPΔtn is a
random variable that represents the time series of changes in power in the
nth installation using a sampling time interval of Δt.
8.4 PORTFOLIO ANALYSIS
Delucchi and Jacobson [10] note that
We require
x1 + x 2 + x3 + x 4 = 1
εr : x r1D x r 2 D x rp D
1 1 1
P(εr = x rk D ):
p p p
Er := E(εr ) =
D
p ∑x rk r = 1, 2, 3, 4
k =1
Thus, if our “allocation portfolio strategy” is xT = (x1, x2, x3, x4), that means
that the total energy supplied during this period is the random variable
ε := ∑ ε x (8.13)r r
r =1
E := E( ε ) = ∑ E(ε )x (8.14)
r =1
r r
minVar(ε)
Electricity Supply without Fossil Fuels ◾ 267
subject to
4
∑x = 1
r =1
r
x1 , x 2 , x 3 , x 4 ≥ 0
2
1
p 4
Var (ε) = ∑ D
p r =1 ∑
x rk x r − E (8.15)
k=1
σ ij = Cov(εr , ε s ) = ∑ 1p (Dx
k =1
rk − Er )(Dx sk − Es )
Ei = E(εi ) = ∑ Dxp
k =1
ik
σ ij = Cov(εi , ε j ) =
1
p ∑(Dx
k =1
ik − Ei )(Dx jk − E j )
2
σ n1 σn2 σn
subject to
n
∑E x ≥ L
j =1
j j (EPVM)
∑x = 1
j =1
j
x j ≥ 0, ∀j
8.5 CONCLUSION
In this chapter we canvassed a number of the tools that are needed for
proper analysis of increasing the proportion of renewable energy for deliv-
ery of electricity supply. This is only for a macro-level study, as we have not
investigated the power engineering aspects of the problem, nor indeed the
financial questions.
The benefits of strong, early action considerably outweigh the costs. From
the Stern Review [37], we have the following from the Executive Summary:
REFERENCES
1. M.R. Agrawal, J. Boland, and B. Ridley, Volatility of Wind Energy Using High
Frequency Data, Proceedings of IASTED International Conference; Modelling,
Identification and Control (AsiaMIC 2010), November 24–26, 2010, Phuket,
Thailand, 1–3, 2010.
2. M.R. Agrawal, J. Boland, J. Filar, and B. Ridley, Analysis of wind farm output:
Estimation of volatility using high frequency data, Environmental Modeling
and Assessment, 2013.
3. Australian Energy Market Commission, Future Possible Retail Electricity
Price Movements: 1 July 2010 to 30 June 2013, Technical Report, 2010.
4. Australian Energy Market Operator, An Introduction to Australia’s National
Electricity Market, AEMO, Melbourne, Australia, July 2010.
5. Australian Energy Market Operator, http://www.aemo.com.au, 2011.
6. A. Ben-Tal and A. Nemirovski, Robust optimization–methodology and
applications, Mathematical Programming, 92: 453–480, 2002.
7. J. Boland, K. Gilbert, and M. Korolkowicz, Modelling wind farm output vari-
ability, MODSIM07, Christchurch, New Zealand, 10–13 December, 2007.
8. J.W. Boland, Time series and statistical modelling of solar radiation, in Recent
Advances in Solar Radiation Modelling, Viorel Badescu (Ed.), Berlin and New
York, Springer-Verlag, pp. 283–312, 2008.
9. T. Bollerslev, Generalised autoregressive conditional heteroskedasticity,
Journal of Econometrics, 31: 307–327, 1986.
10. M.A. Delucchi and M.Z. Jacobson, Providing all global energy with wind,
water and solar power. II. Reliability, system and transmission costs, and
policies, Energy Policy, 39: 1170–1190, 2010.
11. Department of Climate Change and Energy Efficiency, Quarterly Update of
Australia’s National Greenhouse Gas Inventory, December Quarter 2010,
Technical report, 2011.
12. R. Doherty, H. Outhred, and M. O’Malley, Establishing the role that wind
generation may have in future generation portfolios, IEEE Transactions on
Power Systems, 21: 1415–1422, 2006.
13. B. Elliston, M. Diesendorf, and I. MacGill, Simulations of scenarios with
100% renewable electricity in the Australian National Electricity Market,
Solar2011, the 49th AuSES Annual Conference, 2011.
270 ◾ Computational Intelligent Data Analysis for Sustainable Development
CONTENTS
9.1 Introduction 274
9.2 Some Research Problems in the Power Grid System 275
9.3 Detection and Visualization of Inter-Area Oscillatory Modes 277
9.3.1 Signal Preprocessing 277
9.3.1.1 Signal Decomposition 280
9.3.2 Identification and Extraction of Dominant Oscillatory
Mode 281
9.3.3 Windowing Fit of Dominant Mode to Full Dataset 283
9.3.3.1 Identification of Coherent Groups 284
9.3.4 Visualization of Modal Extraction Results 285
9.4 Classification of Power Grid Frequency Data Streams Using
k-Medians Approach 288
9.4.1 k-Medians Approach Detecting Disruptive Events 288
9.4.2 Using k-Medians to Determine Pre-Event and Post-
Event Operating Points 290
9.4.3 Selection of Window Data 291
9.4.4 Determination of Decision Boundary 292
9.4.5 Evaluation of Decision Boundary 293
273
274 ◾ Computational Intelligent Data Analysis for Sustainable Development
9.1 INTRODUCTION
data; (2) need for algorithms that can not only handle multidimensional
nature of the data, but also model both spatial and temporal dependen-
cies in the data, which, for the most part, are highly nonlinear; (3) need
for algorithms that can operate in an online fashion with streaming data.
As stated above, one element of the modernized power grid system is
the installation of a wide-area frequency measurement system on the elec-
tric poles in the streets for conditions monitoring of the distribution lines.
This would provide frequency measurements that reflect the status of the
electric grid and possible information about impending problems before
they occur. The timely processing of these frequency data could elimi-
nate impending failures and their subsequent cascading into the entire
system. The ability to monitor the distribution lines is just one facet of the
proposed smart grid technology. Other elements include the installation
of advanced devices such as smart meters, the automation of transmis-
sion lines, the integration of renewable energy technologies such as solar
and wind, and the advancement of plug-in hybrid electric vehicle technol-
ogy. The overall objective then is to make the electric grid system more
robust in view of impending national and global operational challenges.
A wide-area frequency disturbance recorder (FDR) is already in use
at both the transmission and distribution levels of the power grid system
[1]. These recorders are used to monitor and record the changes in voltage
frequency in real time at various locations. The FDRs perform local GPS
synchronized frequency measurements and send data to a central server
via the Internet, and the information management system handles data
collection, storage, communication, database operations, and a Web ser-
vice. There are currently more than 50 FDRs deployed around the United
States. Each FDR measures the voltage phasor and frequency at the distri-
bution level using a 110V outlet and streams ten data points per second,
with future models expected to have higher streaming rates. One imme-
diate challenge with the massive amount of data streams collected from
FDRs is how to detect and classify an impending failure of the grid from
multiple high-speed data streams in real t ime while minimizing false alarms
and eliminating missed detection, and then how to identify and evaluate the
impacts of the detected failures.
In the next three sections we describe three different applications of data
mining for addressing two electric grid related problems. The first problem
deals with identifying a specific type of pattern in the data (pattern dis-
covery), while the second problem deals with identifying disruptive events
Data Analysis for Real-Time Identification of Grid Disruptions ◾ 277
from grid data. Section 9.3 addresses the first problem, while Sections 9.4
and 9.5 address the second problem.
9.3.1 Signal Preprocessing
In order to provide a better fit of the oscillatory content, the measurement
data needs to be properly conditioned first. An example of measured fre-
quency data is given in Figure 9.1(a). It is desired to extract the properties
of this oscillation.
278 ◾ Computational Intelligent Data Analysis for Sustainable Development
Selected FDR Traces for Example Oscillation Detrended, Denoised Example Data
60.02 0.05
Bismarck, ND Bismarck, ND
60.01 Bangor, ME 0.04 Bangor, ME
Madison, WI Madison, WI
60 0.03
Frequency (Hz)
Frequency (Hz)
59.99 0.02
59.98 0.01
59.97 0
59.96 –0.01
59.95 –0.02
59.94 –0.03
120 125 130 135 140 145 150 120 125 130 135 140 145 150
Elapsed Time Since 07/26/2009 19:03:20 (sec) Elapsed Time Since 07/26/2009 19:03:20 (sec)
(a) (b)
Intrinsic Mode Functions Computed from Bangor,
ME Example Data Frequency Spectra of Intrinsic Mode Functions
0.05
1.2 Interarea Band
0.04 0.1 Hz – 0.8 Hz
0.03
Power (% of maximum) 1
Frequency (Hz)
0.02 0.8
0.01
0.6
0
0.4
–0.01
0.2
–0.02
–0.03 0
120 125 130 135 140 145 150 10–3 10–2 10–1 100 101
Elapsed Time Since 07/26/2009 19:03:20 (sec) Frequency (Hz)
(c) (d)
Input Filter Output for Example Data
0.015
Bismarck, ND
Bangor, ME
0.01 Madison, WI
0.005
Frequency (Hz)
–0.005
–0.01
–0.015
120 125 130 135 140 145 150
Elapsed Time Since 07/26/2009 19:03:20 (sec)
(e)
FIGURE 9.1 (See color insert.) Sample FDR traces and post-processing results.
The data for this example is drawn from FDR measurements that capture
the system response to a generation trip. The resulting system frequency
drop is seen in Figure 9.1 as a sharp decline from about 60.05 Hertz (Hz) to
59.96 Hz. During this drop period, a strong oscillation is also observed with
power grid systems in Maine oscillating 180 degrees out of phase with sys-
tems in North Dakota. This example dataset will be used throughout this
section to demonstrate the operation of the modal identification procedure.
Data Analysis for Real-Time Identification of Grid Disruptions ◾ 279
9.3.1.1 Signal Decomposition
Empirical Mode Decomposition is a data-driven method that decomposes
a signal into a set of Intrinsic Mode Functions (IMFs). Each IMF is an oscil-
latory signal that consists of a subset of frequency components from the
original signal. As opposed to Fourier, wavelet, and similar methods, EMD
constructs these component signals directly from the data by identifying
local extrema and setting envelopes around the signal in an iterative pro-
cess. A fit is then performed on the local extrema to create an IMF. After
the creation of an IMF, it is subtracted from the original signal and the pro-
cess repeats to identify the next IMF. This identification and removal pro-
cess continues until the original signal has been completely described by a
set of IMFs. The output of the EMD process is the set of IMFs; generally, this
set is a small number of signals (usually less than ten for the data considered
here) that, when summed together, completely match the original signal.
The EMD algorithm does not explicitly compute oscillation frequencies,
amplitudes, or phase angles as with other signal decomposition techniques;
instead, the IMFs are derived directly from the input signal based on its
local extrema. A complete mathematical description of the EMD algorithm
is beyond the scope of this chapter but can be found in [9–12]. The EMD and
associated Hilbert-Huang Transform have also been recently proposed as
methods for isolating inter-area modes in power systems [16, 17].
Performing an EMD on the Bangor trace of Figure 9.1(b) extracts the
seven IMFs given in Figure 9.1(c). The first and second IMFs extracted in
this process are given by the blue and green traces in Figure 9.1(c); these
capture the high-frequency components and noise of the input signal. The
next three IMFs given by the red, cyan, and violet traces capture the mid-
dle frequencies present in the signal. Finally, the last two IMFs extracted,
those represented by the mustard and black lines, define the low-frequency
components of the input, representing the event drop itself in this case.
The Fourier transform of the IMF signals in Figure 9.1(c) is given in
Figure 9.1(d). The frequency variable is plotted on a log scale to better dem-
onstrate the frequencies in the lower range. The individual Fast Fourier
Transforms (FFTs) have been scaled by their maximum values so that the low-
frequency components do not dominate the plot. Inspection of IMF signals
confirms that each IMF is capturing one specific band of the original signal.
The first IMF extracted, which is plotted in blue, is centered around 3 to 4 Hz,
and each subsequent one picks up successively lower-frequency components.
Because the inter-area band is 0.1 Hz to 0.8 Hz, we would like to pre-
serve only those IMFs that have a significant portion of their power in
Data Analysis for Real-Time Identification of Grid Disruptions ◾ 281
this band, and discard the others. The final implementation of this filter
computes the IMFs and then performs a Fourier transform of each one.
Using the Fourier transform results, the percentage of power within the
inter-area band is computed for each IMF. If this percentage exceeds a
given threshold, the IMF is retained; otherwise, it is discarded. The final
filter output is the summation of the retained IMFs. Through testing
on several datasets, it was determined that cut-off frequencies of 0.1 Hz
and 0.8 Hz with a power percentage threshold of 0.75 provided the best
response. These settings gave the best preservation of the inter-area band
while removing most of the other frequency components.
The total input filtering process for this modal identification applica-
tion consists of three stages: first, the median detrending stage; followed
by a moving median filter; and finally the EMD-based filter. This process
serves to isolate only the frequency components within the inter-a rea
band so that further processing can extract specified modes within this
region. Applying this multistage filtering process to the set of example
data results in the signals presented in Figure 9.1(e). Comparing this
plot with that of Figure 9.1(a), it is seen that the low-f requency trend is
completely removed, leaving a zero-centered signal. Additionally, the
higher frequency components and noise have also been removed from
the raw data vectors. The oscillation in the inter-a rea band observed
during the event drop is now the most prominent feature of the data,
making it easier to extract from a computational viewpoint.
This filtering procedure was found to function well for several types of
data. It is demonstrated here on frequency measurement derived from the
FDR system but it performs similarly for angle measurements from FDR
and Phasor Measurement Unit (PMU) datasets. Given any of these differ-
ent types of data, the filtering process returns a zero-centered signal with
the isolated inter-area modes similar to that of Figure 9.1(e). The ability of
this filtering process to work for various different types of datasets makes
the final modal extraction procedure compatible with all these forms of
input data. The remainder of the modal extraction procedure is tuned to
handle data vectors similar to those of Figure 9.1(e); thus, any dataset that
the input filter can reduce to that form can be processed.
Matrix Pencil Results for Data Window 500, Distribution of Candidate Mode Frequencies
Grand Rapids for Example
×10–3
2.5 8
Original Signal
1 5
0.5 4
0 3
–0.5 2
–1 1
–1.5 0
0 0.2 0.4 0.6 0.8 1 1.2 14 1.6 1.8 2.0 0 0.5 1 1.5 2 2.5 3
Time (sec) Mode Frequency (Hz)
(a) (b)
FIGURE 9.2 Matrix pencil results and analysis for one subset of the example
data.
y = Ae ∝ tcos(2π ft + θ) (9.1)
Data Analysis for Real-Time Identification of Grid Disruptions ◾ 283
The power of this signal is given by Equation (9.2). Note that this equation
does not give a true electrical power as it is derived from a frequency signal
instead of a current or voltage.
Py = y2 (9.2)
The total energy of Equation (9.1) can then be expressed as the summa-
tion over time of the power as stated in Equation (9.3). Once again, this
is not a true physical energy quantity, merely an analogous metric of the
data signal.
Ey = ∑t Py (9.3)
vector and window instance, the damping, angle, and phase are derived,
which provide a best fit to the given data. The moving window is sized such
that it covers approximately one period of the dominant oscillation fre-
quency. The data window then starts at the first data point to be included
in the output. It then moves across the full dataset, shifting one time-step
at a time until the end time is reached. For each instance of the window,
each measurement vector is filtered according to the process of Section
9.3.1. After filtering the data, vectors are resampled to increase the discrete
sampling rate. As before, this resampling serves to increase the stability and
accuracy of the final fit.
Once these conditioning steps have been performed, the fit is ready to
be executed. In this case, we want to fit a damped sinusoid of a speci-
fied frequency. A least squares fit of a damped sinusoid function was per-
formed. This function is of the form in Equation (9.4). Here, the variable
fit parameters are the amplitude A, the damping factor α, the phase θ,
and a DC offset C. In Equation (9.4), f0 is the dominant modal frequency
determined previously.
90 0.008
120 60
0.006
150 0.004 30
0.002
180 0
210 330
240 300
270
Figure 9.3 gives the results of the clustering algorithm for a set of mode
phasors drawn from the example data. Here, each measurement device
yields one phasor. The clustering algorithm proceeds to identify those form-
ing coherent groups by classifying them according to the direction in
which they point. In Figure 9.3, each group is identified by a separate color,
with the group centroid given by the dashed line of the same color.
Freq (Hz)
Freq (Hz)
60 60
59.98 59.98
59.96 59.96
0 5 10 15 20 25 30 35 40 45 0 5 10 15 20 25 30 35 40 45
Elapsed Time Since 07/26/2009 19:05:10 (sec) Elapsed Time Since 07/26/2009 19:05:10 (sec)
(a) (b)
Freq (Hz)
Freq (Hz)
60 60
59.98 59.98
59.96 59.96
0 5 10 15 20 25 30 35 40 45 0 5 10 15 20 25 30 35 40 45
Elapsed Time Since 07/26/2009 19:05:10 (sec) Elapsed Time Since 07/26/2009 19:05:10 (sec)
(c) (d)
Freq (Hz)
60
59.98
59.96
0 5 10 15 20 25 30 35 40 45
Elapsed Time Since 07/26/2009 19:05:10 (sec)
0.428 Hz Mode
Avg Damping = –0.0511
(e)
FIGURE 9.4 (See color insert.) Identification of a grid event using phasor visu-
alization: (a) pre-event condition; (b) initial cycle of oscillation; (c) oscillation
reaches largest magnitude; (d) damping of oscillation; and (e) system approaches
steady state.
the frame gives the raw measurement data for several FDRs. The time
range highlighted in red is the region to which this frame corresponds.
This plot is intended to provide a timeline of the oscillation event under
study, as well as give the time point currently being displayed. The main
body of the frame of Figure 9.4(a) consists of geographic region spanned
by the power system with each mode phasor plotted at the location of its
corresponding measurement point. The phasors in this frame are also
color-coded by the coherent group to which they belong. In the lower-left
Data Analysis for Real-Time Identification of Grid Disruptions ◾ 287
A third group oscillating halfway between the other two has been identi-
fied by the clustering algorithm and is shown in red and green.
The oscillation has almost completely died out by the time the movie
approaches the time span of Figure 9.4(e). At all the measurement points,
the mode phasors have little to no amplitude. In addition, the computed
dampings do not expect them to be growing substantially over the next
cycle. The average system damping is still negative but approaching zero;
this is not necessarily an alarm but due mainly to the fact that the system
is not demonstrating any appreciable oscillation. From this it is obvious
that the oscillation has run its course and the system has achieved its new
steady-state value.
The visualizations and movie creation described in this section were
performed in MATLAB. The individual frames are rendered as MATLAB
figures and then imported into to an *.avi file to create the movie. The
geographic mapping and coastlines of the displays were achieved with the
use of M_Map [18], which is a set of MATLAB functions implementing a
mapping toolbox. The approaches presented in this section can be used by
engineers and control room operators to identify, extract, and visualize
inter-area modes within the system in real t ime.
Frequency (Hz)
Difference =
60.01 59.962
0.0025 Hz
60 59.96
59.99 59.958
59.98 59.956
59.97 59.954
59.96 59.952
10 20 30 40 50 60 70 0 5 10 15 20 25 30
Time (seconds) Time (seconds)
(a) (b)
k-Medians Results for FDR 2 Data During Response of Decision Metric to Window
Test Event 543 Size for Selected Events
60.06 0.06
Cluster 1
60.05 Cluster 2
Cluster 1 Median
Cluster 2 Median 0.05
60.04
k-Medians Difference (Hz)
Input Data
60.03 0.04
Frequency (Hz)
60.02
60.01 0.03
Medians
60 Difference =
0.049 Hz 0.02
59.99
59.98 0.01
59.97
0
59.96 0 50 100 150 200 250 300 350 400 450 500
0 5 10 15 20 25 30
Data Window Size (number of points)
Time (seconds)
(c) (d)
Value of Decision Metric During an Event
Results of k-Medians Testing on Training Set 60.03
FDR 2
Percentage of Window Belonging to Cluster 1
60.01 FDR11
0.8 60
59.99
0.7
59.98
0.6 59.97
59.96
0.5 59.95
0 20 40 60 80 100 120
Decision Metric (Hz)
0.4 FDR 2
FDR 3
0.3 FDR 6
FDR11
No event
0.2 Portion of event 0.0165
Full events
0.1 0
0 0.01 0.02 0.03 0.04 0.05 0.06 0 20 40 60 80 100 120
Difference of Cluster Medians (Hz) Time (seconds)
(e) (f )
it is for this reason that k-Medians was chosen as opposed to k-Means. The
k-Medians algorithm [4] consists of the following steps:
( ) ∑∑ x − c
k
Q {π j } j =1 =
k
j
1
j =1 x ∈π j
Given a regularly sampled time series T, such that, each value Tt,
is the measurement at time t, for a given sensor, identify spans of
the form [t1,t2], such that the underlying system is in an anomalous
state from time t1 to t2.
St– = max(0,St–1
–
+ (ωt – Tt ))
The quantity ωt is the weight assigned at each time instance. While this
can be dependent on t, it is set to
ωt = μ0 + k∙μ0 – μ∙
The S+ statistic monitors the changes in the positive direction (also some-
times referred to as “high-side” CUSUM), and the S- statistic monitors the
changes in the negative direction (also sometimes referred to as “low-side”
CUSUM). For monitoring grid disruptions, the low-side CUSUM is rel-
evant because one is interested in detecting events in which the frequency
of the power falls down.
Figure 9.6 denotes a simple example that illustrates the CUSUM-based
anomaly detection on a synthetically generated time-series dataset. The
first 100 and last 100 points in the time series are generated from a normal
distribution with mean 0 and standard deviation 1. The points from time
101 to 200 are generated from a normal distribution with mean −0.25 and
standard deviation 1. It can be seen that although the anomalous region is
indistinguishable to the naked eye, the CUSUM-based approach can still
identify the anomalous region. During the anomalous state, the CUSUM
score increases and starts falling down once the time series is in the nor-
mal state.
Computationally, this approach is fast because it requires a constant
time operation at every time second, and it is also memory efficient, as we
296 ◾ Computational Intelligent Data Analysis for Sustainable Development
10 30
Data
8 CUSUM scores
Anomalous
25
6 Region
4
20
CUSUM Score
2
0 15
Y
–2
Threshold (high) 10
–4
–6 Threshold (low)
5
–8
–10 0
0 50 100 150 200 250 300
Time
FIGURE 9.6 A random time series with anomalous region and CUSUM scores.
only need to maintain the value of the CUSUM statistic of the previous
time instance.
The key issue with this approach is that the output of CUSUM requires a
threshold to declare when the system is in an anomalous state. If the thresh-
old is set very low, the false positive rate is high; and when the threshold is
set high, it might result in a delay in identifying an event. To alleviate this
problem, the quality control literature provides a way to set the threshold
based on the Average Run Length (ARL) metric. Typically, the threshold is
set based on the number of expected events in an in-control process.
series are sampled at the rate of 10 Hertz. The length of the time series
for each day is 864,000. The data was analyzed separately for each month.
The frequency data are preprocessed using the k-Medians approach with
k set to 5.
The allowance value (k) for the CUSUM algorithm is set to 1, the mini-
mum shift (μ) to be detected is set to 0.05, and the in-control distribu-
tion mean (μ0) is 0.* An anomalous event is defined as a subsequence of a
month-long time series in which the CUSUM statistic is greater than zero.
Based on an understanding of the domain, an anomalous event is consid-
ered significant if it lasts for at least 2 seconds (20 observations).
9.5.2.2 Raw Results
Table 9.1 summarizes the number of significant anomalous events identi-
fied for each of the sensors for May 2008 and June 2008. The results show
that for all the FDRs, the fraction of time in which the system is in an
anomalous state is a small fraction of the total time, but that fraction itself
can be a large number. For example, FDR 11 in Grand Rapids, MI, was
in an anomalous state, 0.0082 fraction of the total time, approximately
200,000 observations. However, the number of significant anomalous
events for each sensor for a month is not more than 119.
Evaluation of the output of the anomaly detector is a challenge, given
the lack of ground truth data about the grid-related events for that time
period. A possibility is to examine the local news sources for the specific
days of the events, although that process can be expensive, as well as not
guaranteed to cover every grid event. To further consolidate the output, a
spatial co-location constraint can be applied, as discussed below.
* The data have already been centered using the k-Medians approach by using the medians as cen-
ters to shift the subset of data.
298 ◾ Computational Intelligent Data Analysis for Sustainable Development
9.6 CONCLUSIONS
Data mining has immense significance in terms of addressing several key
power grid problems, specifically in the arena of rapid event detection, as
discussed in this chapter, which have the potential of going a long way in
terms of realizing the promised benefits of the smart grid. The key chal-
lenges associated with this domain, in terms of data analysis, are the mas-
sive nature of the data and the short reaction time allowed (on the order of
a few seconds) for allowing adequate response. The analytic solutions pro-
posed in this chapter focus primarily on simple analyses that can be scaled
to the data sizes and the high sampling rate of the incoming power signal.
In the future, as the synchrophasors become more and more advanced
(both in terms of sampling rate as well as the number of deployed sen-
sors across the country), more research will be required to make the data
analysis solutions scalable.
300 ◾ Computational Intelligent Data Analysis for Sustainable Development
REFERENCES
1. J.N. Bank, O.A. Omitaomu, S.J. Fernandez, and Y. Liu, Visualization and
classification of power system frequency data streams, Proceedings of the
IEEE International Conference on Data Mining Workshops, p. 650–655, 2009.
2. P. Domingos and G. Hulten, Mining high-speed data stream, Proceedings of
the Sixth ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining, pp. 71–80, 2000.
3. G. Hulten, L. Spencer, and P. Domingos, Mining time-changing data streams,
Proceedings of KDD 2001, pp. 97–106, 2001.
4. D. Kifer, S. Ben-David, and J. Gehrke, Detecting change in data streams,
Proceedings of the 30th VLDB Conference, p. 180–191, 2004.
5. L. O’Callaghan, N. Mishra, A. Meyerson, S. Guha, and R. Motwani,
Streaming data algorithms for high-quality clustering, Proceedings of the
18th International Conference on Data Engineering, 2002.
6. B. Qiu, L. Chen et al. Internet based frequency monitoring network (FNET),
IEEE Power Engineering Society, Winter Meeting, 2001. Vol. 3, 28 Jan.–1 Feb.
2001, pp. 1166–1171.
7. J.K. Wang, R.M. Gardner, and Y. Liu, Analysis of system oscillations using
wide-area measurements, IEEE Power Engineering Society, General Meeting,
2007. 25 June–28 June 2006.
8. J.N. Bank, O.A. Omitaomu, S.J. Fernandez, and Y. Liu, Extraction and visual-
ization of power system inter-area oscillatory modes, Proceedings of the IEEE
Power & Energy Society Annual Meeting, Minneapolis, MN, July 25–29, 2010.
9. N.E. Huang, Z. Shen, S.R. Long, M.C. Wu, S.H. Shih, Q. Zheng, C.C. Tung,
and H.H. Liu. The empirical mode decomposition method and the Hilbert
spectrum for non-stationary time series analysis, Proceedings of the Royal
Society of London, A454: 903–995, 1998.
10. H. Liang, Z. Lin, and R.W. McCallum. Artifact reduction in electrogas-
trogram based on the empirical model decomposition method. Medical
Biological Engineering and Computing, 38: 35–41, 2000. A.K. Jain and R.C.
Dubes, Algorithms for Clustering Data: Prentice-Hall, Englewood Cliffs, NJ,
1981.
11. N.E. Huang, M.-L. Wu, W. Qu, S.R. Long, and S.P. Shen. Applications of
Hilbert–Huang transform to non-stationary financial time series analysis,
Applied Stochastic Models in Business and Industry, 19: 245–268, 2003.
12. S.T. Quek, P.S. Tua, and Q. Wang. Detecting anomalies in beams and plate
based on the Hilbert–Huang transform of real data, Smart Material Structure,
12: 447–460, 2003.
13. Z. Zhong, C. Xu, et.al. Power system Frequency Monitoring Network (FNET)
implementation, IEEE Trans. on Power Systems, 20(4): 1914–1921, 2005.
14. T.K. Sarkar and O. Pereira. Using the matrix pencil method to estimate
the parameters of a sum of complex exponentials, IEEE Antennas and
Propagation Magazine, 37(1): 48–55, 1995.
Data Analysis for Real-Time Identification of Grid Disruptions ◾ 301
15. Y. Hua and T.K. Sarkar. Matrix pencil method for estimating parameters of
exponentially damped/undamped sinusoids in noise, IEEE Transactions on
Acoustics, Speech and Signal Processing, 36(2): 228–240, 1988.
16. D.S. Laila, A.R. Messina, and B.C. Pal. A refined Hilbert-Huang transform
with applications to interarea oscillation monitoring, IEEE Transactions on
Power Systems, 24(2): 610–620, May 2009.
17. N. Senroy. Generator coherency using the Hilbert-Huang transform, IEEE
Transactions on Power Systems, 23(4), 1701–1708, November 2008.
18. R. Pawlowicz, M_Map: A Mapping Package for Matlab, http://www.eos.ubc.
ca/~rich/map.html, Ver 1.4e, Oct 2009.
19. F. Gustafsson. Adaptive Filtering and Change Detection. New York: Wiley, 2000.
20. J.M. Lucas and M.S. Saccucci. Exponentially weighted moving average control
schemes: Properties and enhancements, Technometrics, 32(1): 1–29, 1990.
21. S. Boriah, V. Kumar, M. Steinbach, C. Potter, and S. Klooster, Land cover
change detection: A case study, Proceeding of the 14th KDD, pp. 857–865, 2008.
22. V. Chandola and R.R. Vatsavai. A scalable Gaussian process analysis algo-
rithm for biomass monitoring, Statistical Analysis and Data Mining, 4(4):
430–445, 2011.
23. E.S. Page. On problems in which a change can occur at an unknown time,
Biometrika, 44(1-2): 248–252, 1957.
Chapter 10
CONTENTS
10.1 Introduction 304
10.2 Measure-Correlate-Predict 307
10.3 Methodology for Wind Speed Estimation 310
10.3.1 Multivariate Normal Model 313
10.3.2 Nonparametric Multivariate Model 313
10.3.3 Graphical Model with Naive Structure 314
10.3.4 Graphical Model with Structure Learning 316
10.3.5 Multivariate Copulas 316
10.4 Evaluation Setup 319
10.5 Results and Discussion 321
10.5.1 Comparison of Algorithms 321
10.5.2 Increasing the Data Available for Modeling 322
10.6 Conclusions and Future Work 323
Acknowledgments 326
References 326
Appendix 10A 327
303
304 ◾ Computational Intelligent Data Analysis for Sustainable Development
10.1 INTRODUCTION
30°–45° 15°–30°
0.2
10.80 (6.1%)
0.15 W 3.6% E
8.23 (27.6%)
0.1 0°–15° 2%
4%
6%
0.05 8% 5.14 (35.0%)
10%
12% 3.09 (22.6%)
0 2 4 6 8 10 12 14 16 18 1.54 (0.0%)
Calm-> 0.00 (3.6%)
Wind Speed S
FIGURE 10.1 (See color insert.) A wind resource estimation is expressed as a bivar
iate (speed and direction) statistical distribution (left) or a “wind rose” (right).
Statistical Approaches for Wind Resource Assessment ◾ 305
The assessment can also be visualized via a wind rose; see Figure 10.1
(right). The span of the entire 360° is oriented in the north-south com-
pass direction to inform its alignment to the site. Figure 10.1 (left) shows
16 direction intervals, each as a discrete “slice” with coloring that depicts
wind speed. The length and width of the slice conveys probability.
There are multiple methodologies that derive a wind resource assess-
ment. All are subject to great uncertainty. When a wind resource
assessment is based upon wind maps and publicly available datasets from
the closest locations, it tends to overestimate the wind speed because the
maps are so macroscopic. Even when the resource estimated by the wind
map for a geographical location is improved upon by utilizing a model that
accounts for surface roughness and other factors, significant inaccuracies
persist because specific details of the site remain neglected. Alternatively,
a computational fluid dynamics (CFD) model can be used to achieve a
better resource assessment. However, CFD also has limitations. It is very
difficult to incorporate all the local attributes and factors related to tur-
bulence into the simulation. While the wind industry has started to com-
bine CFD and wind map approaches, the current methods are ad-hoc, not
robust, and more expensive than desired.
In this chapter we provide new techniques for the only assessment
methodology that takes into account as many years of historical data as
possible (although those data are remote from the site itself), while also
integrating site-specific information, albeit short term and relatively noisy.
We consider the Measure- Correlate-
Predict assessment methodology,
abbreviated as MCP, which exploits anemometers, and/or other sensing
equipment that provide site-specific data [1–4]. The Measure step involves
measuring wind speed and direction at the site for a certain duration of
time. In the Correlate step, these data are then associated with simultane-
ous data from nearby meteorological stations, so-called historical sites that
also have long-term historical data. A correlation model is built between
the time-synchronized datasets. In the Predict step, the model is then used
along with the historical data from the meteorological stations to predict
the wind resource at the site. The Prediction is expressed as a bivariate
(speed and direction) statistical distribution or a “wind rose” as shown in
Figure 10.1.
While MCP does incorporate site-specific data, these data are based
upon very inexpensive sensors, that is, anemometers, which are conse-
quently very noisy. Additionally, anemometers are frequently moved on
the site and not deployed for any significant length of time. Thus, the key
306 ◾ Computational Intelligent Data Analysis for Sustainable Development
measurements for a site at the Boston Museum of Science when the avail-
ability of site data varies between 3, 6, and 8 months while correlating with
data from 14 airports nearby; see Figure 10.4 and Table 10.1.
We proceed as follows: Section 10.2 presents a detailed description of
MCP. Section 10.3 presents statistical techniques that can be used in an MCP
framework. Section 10.4 presents the means by which we evaluate the
techniques. Section 10.5 presents the empirical evaluation. Finally, Section
10.6 states our conclusions and outlines future work.
10.2 MEASURE-CORRELATE-PREDICT
We consider wind resource estimation derived by a methodology known
as Measure-Correlate-Predict; see Figure 10.2. In terms of notation, the
wind at a particular location is characterized by speed denoted by x and
direction θ. Wind speed is measured by anemometers, and wind direction
is measured by wind vanes. The 360° direction is split into multiple bins
with a lower limit (θl ) and an upper limit (θu). We give an index value of J
= 1…j for the directional bin. We represent the wind speed measurement
at the test site (where wind resource needs to be estimated) with y and the
308 ◾ Computational Intelligent Data Analysis for Sustainable Development
Direction Speed
Model Model
FIGURE 10.2 MCP generates a model correlating site wind directions to those
simultaneously at historical sites. For a directional bin, it generates a model cor-
relating simultaneous speeds.
other sites (for whom the long term wind resource is available) as x, and
we index these other sites with M = 1…m.
The three steps of MCP are:
one historical site, k and n are time indices and m denotes the total
number of historical sites. Historical data that are not simultaneous
in time to the site observations used in modeling will be used in the
Predict step.
Correlate: A single directional model is first built correlating the
2.
wind directions observed at the site with simultaneous historical
site wind directions. Next, for each directional interval, called a
(directional) bin, of a 360° radius, a model is built correlating the
wind speeds at the site with simultaneous speeds at the historical
sites, that is, Yti = fθ j ( xt1i…m ) , where k ≤ i ≤ n. The data available from
the site at this stage are expected to be sparse and noisy.
Predict: To obtain an accurate estimation of long-term wind condi-
3.
tions at the site, we first divide the data from the historical sites (which
is not simultaneous in time to the site observations used in model-
ing) into subsets that correspond to a directional bin. Prediction of
the long-term site conditions follows two steps:
a. We use the model we developed for the direction fθj and the
data from the historical sites corresponding to this direction
xt11……mtk −1 |θ j to predict what the wind speed Yp = yt1…tk −1 at the site
would be.
b. With the predictions Yp, from Step 3a above, we estimate param-
eters for a Weibull distribution. This distribution is our answer
to the wind resource assessment problem. We generate a distri-
bution for each directional bin. A few example distributions for
different bins are shown in Figure 10.1 (left). Alternatively, these
distributions can be summarized via a wind rose also shown in
Figure 10.1 (right).
f X ,Y ( x k , y )
fY|X = x k ( y | x k ) = (10.1)
∫ y
f X ,Y ( x k , y )dy
Step 3: We now can make a point prediction of ŷk by finding the value
for y that maximizes the conditional:
Step 4: All the predictions for ŷ1…K are estimated to a Weibull density
function, which gives an estimate of the long-term wind resource at
the test site.
X1 X2 X3 X14
X1 X2 X3 X4
X5 X6 X7 X8
FIGURE 10.3 Top: Naive Bayes structure. This structure is assumed and no
learning is required. x1…x14 represents the 14 variables from the airports and
y represents the variable at the test site in Boston. Bottom: Structure is learned
using the K2 algorithm. A maximum of two parents is specified. x2 emerges as a
parent for most of the nodes.
1
f Z (z ) = (2π )− (m+1)/2 det( Σ )−1/2 exp − (z − µ )T Σ −1 (z − µ ) (10.3)
2
=1
Σ
n ∑(z − z )(z − z ) (10.4)
i =1
i i
T
Once we estimate the parameters for the joint density given the train-
ing data, using closed-form expressions, we use the joint density function
to derive the conditional density for y given x k samples in the testing
data. This density is also Gaussian and has a mean μy|x k and variance σy|x k.
The value μy|x k is used as the point prediction for
y k for the given x k. The
variance σy|x k provides the uncertainty around the prediction. If σy|x k is
high, the uncertainty is high.
L m
f X ,Y ( x , y ) = ∑∏ K (x − x
l =1 j =1
j j ,l )K ( y − yl ) (10.5)
314 ◾ Computational Intelligent Data Analysis for Sustainable Development
For a test point x k, for which we do not know the output, a prediction
is made by finding the expected value of the conditional density function
f Y (y|x k) given by
E(Y | X = x k ) =
∫ y. f ( y | x )dy (10.6)
y
Y k
1
∑ ∏
L m
K ( x j ,k − x j ,i )K ( y − yi )
∫
= y. L
i =1 j =1
dy (10.7)
1
∑ ∏
L m
y
K ( x j ,k − x j ,i )K ( y − yi )
L i =1 j =1
∑ y .∏ K (x − x ) (10.8)
L m
l j ,k j ,i
i =1 j =1
=
∑ ∏ K (x − x )
L m
j ,k j ,i
i =1 j =1
In this model there are no parameters and there is no step for esti-
mation. Although Equation (10.5) presents the density function, it is not
evaluated unless we see a new testing point. For us to be able to evaluate
the expected value of the wind speed at the test site, we need to store all
the training points and use them in Equation (10.8) to make predictions.
In Equation (10.8), given a test point x k = {x1,k…xm,k}, the kernel value is
evaluated for difference between a training point x l and this test point.
This value is multiplied with the corresponding value of yl. This is repeated
for all the training points 1…L and summed. This forms the numera-
tor. The summation, when done without the multiplication of yl , forms
the denominator in Equation (10.8). This approach has a few drawbacks.
The designer must choose the kernel. Then the parameters for the kernel
should be tuned via further splitting of the training data. It also requires
retaining all the training data in order to make predictions. The evalua-
tion of the kernel is done for L times for each test point.
f X ,Y ( x , y ) = ∏f i =1
Xi ( xi | y ) fY ( y ) (10.9)
P( X1 ,…, Xn ) = ∏ P(X | Pa
i =1
i Xi )
∏
m
f X i ( x i | y ) fY ( y )
fY|X = x k ( y | x k ) = i =1
(10.10)
∫∏
m
f Xi ( xi | y ) fY ( y )dy
y i =1
conditional; the mean and median for y|x k are the same. This is also
the value that maximizes the conditional. Hence, ŷk is
yˆ k =
∫ yf
y
Y |X = x k ( y | x k )dy (10.11)
10.3.5 Multivariate Copulas
Our previous modeling techniques assume a Gaussian distribution for all
variables and a Gaussian joint for the multivariate. It is arguable, how-
ever, that Gaussian distributions do not accurately represent the wind
speed distributions. In fact, conventionally, a univariate Weibull distribu-
tion [15] is used to parametrically describe wind sensor measurements. A
Weibull distribution is likely also chosen for its flexibility because it can
express any one of multiple distributions, including Rayleigh or Gaussian.
To the best of our knowledge, however, joint density functions for
non-Gaussian distributions have not been estimated for wind resource
assessment. In this chapter, to build a multivariate model from mar-
ginal distributions that are not all Gaussian, we exploit copula functions.
A copula framework provides a means of modeling a multivariate joint
Statistical Approaches for Wind Resource Assessment ◾ 317
∂m+1
f ( x1 … x m , y ) = C ( F ( x1 )…F ( xm ), F ( y ); θ ) (10.14)
∂x1 …∂xm ∂y
where c(.,.) is the copula density. Thus, the joint density function is a
weighted version of independent density functions, where the weight is
derived via copula density. Multiple copulas exist in the literature. In this
chapter we consider a multivariate Gaussian copula to form a statistical
model for our variables given by
318 ◾ Computational Intelligent Data Analysis for Sustainable Development
where FG is the CDF of the multivariate normal with zero mean vector and
Σ as covariance, and F−1 is the inverse of the standard normal.
Estimation of parameters: There are two sets of parameters to estimate. The
first set of parameters for the multivariate Gaussian copula is Σ. The second
set, denoted by Ψ = {ψ,ψy}, consists of the parameters for the marginals of x,y.
Given N i.i.d observations of the variables x,y, the log-likelihood function is
L( x , y ; Σ , Ψ ) = ∑logf (x , y | Σ, Ψ) (10.17)
l =1
l l
N m
= ∑
l =1
log
∏i =1
f ( xil ; ψ i ) f ( yl ; ψ y ) c ( F ( x1 )…F ( xm ), F ( y );Σ ) (10.18)
ˆ=
Ψ (10.19)
N m
arg max ∑ log
∏ f ( xil ; ψ i ) f ( yl ; ψ y ) c ( F ( x1 )…F ( xm ), F ( y ); Σ )
Ψ∈ψ
l =1 i =1
P( x , y )
P( y | x ) = (10.20)
∫ y
P( x , y )dy
Note that the term in the denominator of Equation (10.20) remains con-
stant; hence, for the purposes of finding the optimum, we can ignore its
evaluation. We simply evaluate this conditional for the entire range of Y in
discrete steps and pick the value of y∈Y that maximizes the conditional.
10.4 EVALUATION SETUP
To evaluate and compare our different algorithms, we acquired a variety
of wind data from the state of Massachusetts. We downloaded the data
from the ASOS (Automated Surface Observing System) airport database,
which is public and has wind data from 14 airports in Massachusetts col-
lected over the past 10 to 20 years. These data are frequently used by the
wind industry. The airports’ locations are shown in Figure 10.4. We then
acquired data from an anemometer positioned on the rooftop of Boston’s
Museum of Science where a wind vane is also installed. These anemometers
are inexpensive and, consequently, noisy. The museum is located among
buildings, a river, and is close to a harbor as shown in Figure 10.5. This
provides us with a site that is topographically challenging. At this location
1 13
9 6 3
2
14
Test location
10
12 11
5 7 4
FIGURE 10.4 (See color insert.) Data are referenced from fourteen airport loca-
tions in the state of Massachusetts (United States). See Table 10.1.
320 ◾ Computational Intelligent Data Analysis for Sustainable Development
FIGURE 10.5 (See color insert.) Red circles show location of anenometers on
rooftop of Museum of Science, Boston, Massachusetts.
Yˆ
10.5.1 Comparison of Algorithms
First we compare algorithms when the same amount of data is available
to each one of them for modeling. Results are presented in Figure 10.6
through Figure 10.8 for datasets D3, D6, and D8, respectively. Each plot
shows the KL distance between the ground truth distribution and the dis-
tribution estimated based on the predictions provided by each technique
for the Year 2 dataset per bin. We plot the KL distance for all 12 bins.
We notice that the copula modeling technique consistently performs
better than the other four techniques. The graphical model technique
that assumes a naive variable dependency structure performs second
best, although it demonstrates poor performance on the first bin. Its per-
formance on this bin, however, improves as we increase the size of the
dataset. One would expect the graphical model, which has a learned vari-
able dependency structure, to outperform the one with naive structure
assumptions. Here, except for the first bin, it does not. This may imply
that a better structure learning algorithm is necessary, or that the one
used needs further fine-tuning. The latter possibility is likely because the
structure learning algorithm K2 only looks at a fraction of all possible
structures when it references an order of the variables. A more robust
structure learning algorithm that does not assume order could potentially
yield improvements.
Linear regression is the worst performer of all, but performs well when
8 months of data are available. This is consistent with many studies in the
322 ◾ Computational Intelligent Data Analysis for Sustainable Development
4
Bayesian Network (naive structure) based
Linear regression based
3.5 Copula based
Bayesian Network (Structure learnt via K2)
3 KSDensity based
Multivariate Gaussian based
2.5
1.5
0.5
0
1 2 3 4 5 6 7 8 9 10 11 12
Bin Index
wind energy area, where it has been found that for an accurate estimation
of long-term distribution, 8 months worth of data is needed.
4
Bayesian Network (naive structure) based
Linear regression based
3.5 Copula based
Bayesian Network (Structure learnt via K2)
KSDensity based
3
Multivariate Gaussian based
2.5
1.5
0.5
0
1 2 3 4 5 6 7 8 9 10 11 12
Bin Index
• Estimate the wind speed density with as minimal site collected data
as possible.
• Estimate as accurately as possible with minimal cost to support inex-
pensive site sensing.
4
Bayesian Network (naive structure) based
Linear regression based
3.5 Copula based
Bayesian Network (Structure learnt via K2)
KSDensity based
3
Multivariate Gaussian based
2.5
1.5
0.5
0
1 2 3 4 5 6 7 8 9 10 11 12
Bin Index
4 4
With 3 month data With 3 month data
With 6 month data With 6 month data
3 3
With 8 month data With 8 month data
KL Distance
KL Distance
2 2
1 1
0 0
2 4 6 8 10 12 2 4 6 8 10 12
Bin Index Bin Index
4 4
With 3 month data With 3 month data
With 6 month data With 6 month data
3 3
With 8 month data With 8 month data
KL Distance
2 KL Distance 2
1 1
0 0
2 4 6 8 10 12 2 4 6 8 10 12
Bin Index Bin Index
FIGURE 10.9 Top (left): Estimation accuracy using graphical models with naive
structure. Top (right): Estimation accuracy using linear regressions. Bottom (left):
Estimation accuracy using copulas. Bottom (right): Estimation using Bayesian
networks where structure is learned.
ACKNOWLEDGMENTS
We thank Steve Nichols (IIT Project Manager) and Marian Tomusiak
(Wind Turbine Lab Analyst), of the Museum of Science, Boston, for
assisting us in data acquisition and assessment. We thank the MIT Energy
Initiative for its sponsorship of Adrian Orozco, who prepared and syn-
chronized the data collected from the Museum of Science as an under-
graduate research assistant.
REFERENCES
1. Gross, R.C., and Phelan, P. Feasibility Study for Wind Turbine Installations
at Museum of Science, Boston. Technical report, Boreal Renewable Energy
Development (October, 2006).
2. Bass, J., Rebbeck, M., Landberg, L., Cabré, M., and Hunter, A. An improved
Measure-Correlate-Predict algorithm for the prediction of the long term
wind climate in regions of complex environment. (2000). Available on
Internet at www.res-group.com/media/234621/jor3-ct98-0295-finalreport.
pdf. Research funded in part by The European Commission in the Framework
of the Non-Nuclear Energy Programme JOULE 111.
3. Bailey, B., McDonald, S., Bernadett, D., Markus, M., and Elsholz, K. Wind
Resource Assessment Handbook: Fundamentals for Conducting a Successful
Monitoring Program. Technical report, National Renewable Energy Lab,
Golden, CO; AWS Scientific, Inc., Albany, NY, (1997).
4. Lackner, M., Rogers, A., and Manwell, J. The round robin site assessment
method: A new approach to wind energy site assessment. Renewable Energy,
33(9), 2019–2026, 2008.
5. Encraft: Warwick Wind Trials Final Report. Technical report, p. 28, Encraft
(2009). Available at www.warwickwindtrials.org.uk/resources/Warwick
+Wind+Trials+Final+Reports.pdf.
6. Shaw, S. Progress Report on Small Wind Energy Development Projects
Receiving Funds from the Massachusetts Technology Collaborative (MTC).
Cadmus Group Inc., Waltham, MA, USA. (2008).
7. Wagner, M., Veeramachaneni, K., Neumann, F., and O’Reilly, U. Optimizing
the layout of 1000 wind turbines. In Scientific Proceedings of European Wind
Energy Association Conference (EWEA 2011) (2011).
8. Rogers, A., Rogers, J., and Manwell, J. Comparison of the performance of
four measure-correlate-predict algorithms. Journal of Wind Engineering and
Industrial Aerodynamics, 93(3), 243–264, 2005.
9. Chan, C., Stalker, J., Edelman, A., and Connors, S. Leveraging high per-
formance computation for statistical wind prediction. In Proceedings of
WINDPOWER 2010 (2010).
10. Frank, E., Trigg, L., Holmes, G., and Witten, I. Technical note: Naive Bayes
for regression. Machine Learning, 41(1), 5–25, 2000.
11. Koller, D. and Friedman, N. Probabilistic Graphical Models: Principles and
Techniques. The MIT Press, Cambridge, MA (2009).
Statistical Approaches for Wind Resource Assessment ◾ 327
12. Heckerman, D., Geiger, D., and Chickering, D. Learning Bayesian networks:
The combination of knowledge and statistical data. Machine Learning, 20(3),
197–243, 1995.
13. Cooper, G.F. and Herskovits, E. A Bayesian method for the induction of
probabilistic networks from data. Machine Learning, 9(4), 309–347, 1992.
14. Murphy, K. et al. The Bayes net toolbox for MATLAB. Computing Science
and Statistics, 33(2), 1024–1034, 2001.
15. Burton, T., Sharpe, D., Jenkins, N., and Bossanyi, E. Wind Energy: Handbook.
Wiley Online Library (2001).
16. Iyengar, S. Decision-Making with Heterogeneous Sensors—A Copula Based
Approach. Ph.D. dissertation (2011). Syracuse University, Syracuse, NY, USA.
17. Nelsen, R. An Introduction to Copulas. Springer Verlag, Berlin and New York
(2006).
18. Elidan, G. Copula Bayesian networks. Ed. J. Lafferty, C.K.I. Williams,
J. Shawe-Taylor, R.S. Zemel and A. Culotta. Advances in Neural Information
Processing Systems, 24, 559–567, 2010.
19. Eaton, M., Euclid, P., Library, C.U., and Press, D.U. Multivariate Statistics: A
Vector Space Approach. Wiley, New York (1983).
APPENDIX 10A
Below we describe how to derive the conditional density function param-
eters for y given x k under the assumption that the joint is modeled as a
normal. We first partition the mean and the covariance matrix for the
joint distribution of z as follows:
µ= (10A.1)
µ y
µ x
with sizes
(10A.2)
1×1
m × 1
Σ= (10A.3)
Σ yy Σ yx
Σ xy Σ xx
328 ◾ Computational Intelligent Data Analysis for Sustainable Development
with sizes
(10A.4)
1×1 1× m
m × 1 m × m
329
Chapter 11
Spatio-Temporal Correlations
in Criminal Offense Records
Jameson L. Toole, Nathan Eagle,
and Joshua B. Plotkin
CONTENTS
11.1 Introduction 331
11.2 Data 333
11.2.1 Conditioning the Data 335
11.3 Methods 338
11.3.1 Basic Analysis and Statistics 339
11.3.2 Auto- and Cross-Correlation 343
11.3.3 Correlation Matrices 345
11.3.4 The Eigenvalue Spectrum and Comparison to Random
Matrices 347
11.3.5 Daily Drug-Related Crime Rates in 1999 348
11.3.6 Weekly Theft-Related Crime Rates from 1991 to 1999 350
11.4 Summary and Conclusion 357
Acknowledgments 358
References 358
11.1 INTRODUCTION
331
332 ◾ Computational Intelligent Data Analysis for Sustainable Development
modeled across distances of many orders of magnitude [20], and the dif-
fusion of information can be measured for large populations [10]. Armed
with this flood of data, there are very real and important opportunities
to study and ultimately facilitate sustainability in human-built systems.
Insight into these systems can help inform the social sciences from eco-
nomics to sociology as well as provide policy makers with critical answers
that may be used to better allocate scarce resources or implement benefi-
cial social programs.
In order to generate these new and hopefully better solutions, however,
it has become necessary to use broader combinations of tools to analyze
the immense and rich stream of information. The goal is to use this data to
gain a better understanding of the systems that generate it. In this chapter
we present a novel application of tools and analytical techniques developed
from a variety of disciplines that identify patterns and signals that capture
fundamental dynamics of a social system. To explore relationships in both
space and time, cross- and auto-correlation measures are combined with
autoregressive models and results from random matrix theory to analyze
patterns in behavioral data. Similar techniques have been applied recently
to partition space based on patterns observed in mobile phone data or
Wi-Fi activity [2, 14]. We show that these techniques can also be applied
to criminal activity.
The dataset used for this study consists of criminal events within the
city of Philadelphia from the year 1991 through 1999. It contains nearly
1 million individual criminal offense reports detailing the time, place, and
police response to theft, robbery, and burglary-related crimes. In addi-
tion to these minor offenses, for the year 1999, the dataset includes major
offenses as well, covering crimes from petty theft through homicide. With
these reported crimes, we examine spatial, temporal, and incident infor-
mation. The goal of our analysis is to explore the spatio-temporal dynam-
ics of criminal events with the hope of identifying patterns that may be
useful in predicting and preventing future criminal activity. Beyond
applications to criminology, however, we feel that these techniques can
be applied to a wide range of systems that exhibit complex correlations in
many dimensions and on multiple scales.
Early work in criminology, sociology, psychology, and economics
explored relationships between criminal activity and socio economic
variables such as education, community disorder, ethnicity, etc. [8, 21].
Constraints on the availability of data limited these studies to aggregate
statistics for large populations and vast geographic regions. Wilson and
Spatio-Temporal Correlations in Criminal Offense Records ◾ 333
11.2 DATA
The dataset is analyzed in two parts. The first contains nearly 1 million theft-
related crimes from the year 1991 through 1999, while the second consists
334 ◾ Computational Intelligent Data Analysis for Sustainable Development
of almost all 200,000 reported crimes within the city of Philadelphia dur-
ing 1999 across all types, from petty theft to homicide. In total, crimes
were reported at 42,000 unique locations across the city and were time
stamped with the hour they occurred. In addition to time and place, a
detailed description of the crime (e.g., theft under $200, aggravated assault
with a handgun, etc.) is provided. Table 11.1 shows an example of one such
report (note that these data have been generated randomly for anonym-
ity purposes and are not actual reports). With this data, the primary goal
of this research is to better understand the spatio-temporal structure of
criminal events.
The spatial resolution of this data is high enough that a block/
neighborhood analysis of crime is possible. Plotting the geocoded events
reveals features of the city such as the street-grid, parks, bridges, rivers,
etc. (Figure 11.1). While the time of each report is known to within the
hour, offenses within a geographic area are generally aggregated to daily,
weekly, or monthly counts, ensuring that time series are sufficiently popu-
lated. A time series displaying citywide theft-related crimes for different
levels of aggregation and time windows reveals features on multiple scales.
Seasonal trends, such as increases in crime during hot summer months as
well as singular events such as holidays, are visible (Figure 11.2). Finally,
when applicable, offense reports are aggregated by type (Figure 11.3) so
relationships between crimes can be tested. Although data on crime other
than theft are only available for 1999, the 200,000 crimes reported that
year still represent a very rich and detailed dataset with which we can
examine interactions between different types of crime.
Spatio-Temporal Correlations in Criminal Offense Records ◾ 335
40.1
40.05
Latitude
40
39.95
39.9
FIGURE 11.1 All crimes, major and minor, are plotted on an overlay of census
tracts in Philadelphia county during the year 1999. Geographic features of the
city such as rivers, parks, bridges, etc., are immediately visible.
700
600
Crimes
500
Thanksgiving
400
Christmas
300
0 50 100 150 200 250 300 350
Day
9000
Crimes
8000
7000
6000
0 12 24 36 48 60 72 84 96 108
Month
FIGURE 11.2 A time series plot of citywide theft-related crimes at different time
scales. The top figure shows daily theft crimes for the year 1999 where individual
events such as holidays are visible. The bottom figure aggregates further into
monthly counts, revealing seasonal trends such as increases in thefts during the
hot summer months.
The flexible mesh grid allows us to use larger spatial bins for rarer crime
types, such as those that are drug related. The second spatial aggregation
used were the 381 census tracts from the 2000 U.S. Census for the City of
Philadelphia. Census tracts have the nice feature of scaling inversely with
population density; thus, areas of the city with small numbers of people
(and similarly small amounts of crime) will not be washed out by much
higher crime counts in the more dense city center. In addition to conve-
nient scaling properties, the use of census tracts also allows us to incorpo-
rate various socioeconomic and demographic data that might be related to
crime. Finally, census tracts provide an immediately recognizable unit of
analysis for policy makers.
Spatio-Temporal Correlations in Criminal Offense Records ◾ 337
False 9%
Assault 12%
Traffic 7%
Auto 7%
Terror 3%
Arson < 1%
Harassment 5%
Vandalism 8% Graffiti < 1%
Disord. Cond. 1%
Robbery 4%
Theft 20%
11.3 METHODS
Using the conditioned data, we develop analytical tools to achieve the
following:
250
40.1
200
40.05
Latitude
150
40
100
39.95
39.9 50
39.85 0
–75.25 –75.2 –75.15 –75.1 –75.05 –75 –74.95
Longtitude
FIGURE 11.4 The left plot shows the density of drug-related crimes for the year
1999. Hotspots are located mostly in central Philadelphia, whereas many theft-
related crimes (on the left) are mostly located in the southern areas of the city.
340 ◾ Computational Intelligent Data Analysis for Sustainable Development
40.1 800
700
40.05
600
Latitude
500
40
400
39.95
300
200
39.9
100
39.85
–75.25 –75.2 –75.15 –75.1 –75.05 –75 –74.95
Longtitude
FIGURE 11.4 (continued) The left plot shows the density of drug-related crimes for
the year 1999. Hotspots are located mostly in central Philadelphia, whereas many
theft-related crimes (on the left) are mostly located in the southern areas of the city.
In the time domain, we can identify general periodicities within the data
through basic Fourier analysis. Ignoring space for a moment, we consider
the city in its entirety, creating a time series of citywide crime sampled
hourly (Figure 11.5). From these methods we can quantify distinct seasonal
trends. Cycles exist from hourly to yearly scales. These periodicities visually
coincide with time series showing increases in crime during hot summer
months, or decreases in certain types of crime on weekends that produce
weekly trends. Cycles on smaller scales such as days may come from dif-
ferences in day and night crime rates and hourly frequencies may be due to
police procedure. This analysis, however, is blind to any measure of auto- or
cross-correlation that may occur between locations within the city.
Running basic regressions on these citywide time series shows a num-
ber of interesting results. Regressing citywide drug offenses on day of the
week, for example, reveals significant correlation. Considering only the day
of the week, we are able to account for nearly 60% of the variance in daily
drug offenses (Table 11.3). With Sunday being the omitted group, coef-
ficients on dummy variables corresponding to Monday through Saturday
Spatio-Temporal Correlations in Criminal Offense Records ◾ 341
10
Once per
week
|Y(f)|
0
10–1 100 101 102 103
Frequency (cycles/year)
FIGURE 11.5 The frequency spectrum of the citywide theft crime time series
sampled hourly from 1991 through 1999. Many periodicities appear on time
scales ranging from just a few hours to years.
TABLE 11.3 Regression of Citywide Drug
and Violent Offenses on Day of the Week
Day Drugs Violence
Sunday 16.55a 76.87a
Monday 4.61b −13.20a
Tuesday 22.52a −10.63a
Wednesday 23.99a −10.71a
Thursday 21.63a −11.70a
Friday 15.52a −4.47b
Saturday 8.08a 2.57
Note: (R2drug = .59, R2viol = .30).
a pval < 0.001, b pval < 0.05, c pval < 0.1.
weekdays. We also note that violent crimes are four to five times more
frequent than drug crimes.
While in absolute terms most crime occurs during weekdays, observa-
tion of these inverse relationships for certain types of crimes reveals the
need to carefully choose the amount of aggregation applied to analysis.
It remains unclear, however, if these relationships exist because of some
fundamental differences in those committing drug offenses versus violent
offenses, or if they are some artifact of police strategy or organization.
We also note that environmental factors have small, but statistically
significant impacts on crime rates. Using daily weather records in 1999 as
kept by the National Oceanic and Atmospheric Administration (NOAA),
we regress crime rates on environmental factors. While these effects are
not overwhelmingly strong, they are statistically significant. We find that
temperature increases can be associated with an increase in crime and
that precipitation leads to a decrease. Comparing the coefficients of these
effects for different crime types, we find interesting differences.
To compare coefficients between crimes that occur with differing fre-
quency, we regress the log of occurrences on both temperature and pre-
cipitation. The coefficient then represents the percentage change in crime
rates due to an increase of 1°F or 1 inch of precipitation, respectively.
Table 11.4 shows the results of this regression. We find that drug-related
crimes, which may be driven by psychological or physiological needs, are
not affected by weather, while violent crimes, which are more likely to be
driven by passion and environment, respond significantly to increases in
temperature or precipitation.
Although these basic statistics provide insight into the types of rela-
tionships that exist within the data, they remind us that complex relation-
ships exist on multiple scales in both space and time. We continue with
more advanced methods, capable of teasing out these relationships despite
noisy data.
∑
n
r1,2 = E [ 〈 y 1 , y 2 〉 ] = y1 (t ) y 2 (t )
t =1
∑
n
r1,2 (m) = y1 (t + m) y 2 (t )
t =1
20
10
Lags (days)
–10
–20
–30
5 10 15 20 25 30
Node Number
Significant Insignificant
significant correlation between the two crime types at exactly zero lag.
The lack of significant correlation for other time lags indicates no other
significant relationships where theft in one location leads to violence in
that same location at a later time. We find very little significant correlation
between the two types of crime, suggesting that, at the very least, types of
crime are not related on temporal scales of less than a month.
11.3.3 Correlation Matrices
Having established a measure of correlation and corresponding null
model to assess significance, we seek to couple this analysis with spatial
dimensions. We would like to detect correlations not only in time, but also
in space.
To do this, we form a K × T matrix, Y, where K is the number of loca-
tions across the city and T is the length of each time series. Keeping track
of which location each time series corresponds to, we can associate real
city locations with correlations. The delayed correlation matrix for a spe-
cific lag m, C(m), is then constructed by matrix multiplication
1
C(m) = YYT (m)
T
∑
T
Cij (m) = yi (t ) y j (t + m)
t =1
5 0.9
0.8
10 0.7
0.6
15
0.5
20 0.4
0.3
25
0.2
30 0.1
0
35
5 10 15 20 25 30 35
350
0.8
300
0.6
250
0.4
200
150 0.2
100 0
50 –0.2
FIGURE 11.7 (a) The zero-lag correlation matrix for drug-related crimes aggre-
gated to a spatial lattice. There appears to be little spatial correlation and a lack
of high correlated locations. (b) Display of zero-lag correlation matrix showing
theft-related crimes sampled weekly from 1991 through 1999, aggregated by cen-
sus tracts. A stronger correlation signal is seen.
Spatio-Temporal Correlations in Criminal Offense Records ◾ 347
0
0 1 2 3
Eigenvalue Density
1.5
0.5
0
0 0.5 1 1.5 2 2.5
1.5
0.5
0 5 10 15 20 25 30
Lag (days)
FIGURE 11.8 (a, top) Only one eigenvalue, λ1 = 3, can be differentiated from the
noise indicated by the solid line. (a, bottom) The solid curve is the eigenvalue
density of the actual matrix spectra, while the dashed curve is the theoretical
prediction from Equation (11.1). (b) We plot the maximum eigenvalue of the
delayed correlation matrix for each of 30 lags. For drug-related crimes, we see a
very clear periodicity at a frequency of 7 days (1 week).
350 ◾ Computational Intelligent Data Analysis for Sustainable Development
∑
K
IPR( νi ) = |uij | 4
j =1
[1]. A large IPR implies that only a few components contribute to the
eigenvector, while a small IPR indicates participation of many compo-
nents. It is possible to determine clustering structure from such analysis.
For example, in financial data, the eigenvector corresponding to the large
“market” eigenvalue has a low IPR, identifying itself as a force that affects
all stocks equally. Other eigenvectors, with larger IPRs, have components
that are concentrated in various sectors of the market [1]. For crime data,
these components correspond to locations across the city so a cluster of
eigenvector components would correspond to a cluster of neighborhoods.
Examining the IPRs for significant eigenvectors in lagged correlation
matrices, our results show that the eigenvector corresponding to the larg-
est eigenvalue has a low IPR and can thus be interpreted as a “market”
force. For the remaining significant eigenvectors, we find that they too
have low IPRs, suggesting there is little clustering or community structure
(Figure 11.9).
Remembering the strong correlation found between day of the week
and the number of reported crimes from basic regressions outlined above,
we find that it is possible to recreate this eigenvalue signal by constructing
artificial time series using the regression coefficients from Table 11.3 and a
Poisson random number generator. Beyond this single, large market eigen-
value, we cannot distinguish any other significant structure in the spec-
trum of correlation matrices generated by daily drug-related crime rates.
For drug-related crimes, it appears as though little signal exists beyond
the weekly rise and fall of offenses reported from weekend to weekday. It
is unclear if this periodicity is due to some universal truth of drug crimes
or simply to police procedure.
1 4
2
IPR
0
0.5 0 0.05 0.1 0.15 0.2
0
0 0.5 1 1.5 2 2.5 3 3.5
|λi|
FIGURE 11.9 A plot of the IPR of the eigenvectors of the delayed correlation
matrix for drug crimes with a 7-day lag.
Eigenvalue Density
1
0.8
Density
0.6
0.4
0.2
0
0 5 10 15
Eigenvalue Magnitude
FIGURE 11.10 The eigenvalue spectrum of the zero-lag correlation matrix from
weekly counts of theft-related crimes, aggregated at the census tract level. The
solid line represents the RMT prediction, while the dashed line is a fit to the
actual distribution using the min and max eigenvalues as fitting parameters.
0.02
0.015
IRP
0.01
0.005
0
0 5 10 15
Eigenvalue
(a)
FIGURE 11.11 (a) IPR values for eigenvalues associated with the zero-lag corre-
lation matrix for weekly theft time series. Points outside the gray area represent
significance outside of the bounds predicted by RMT. There are more significant
eigenvalues, and these results suggest larger eigenvalues with low IPRs, corre-
sponding to global forces that affect crime across the city, while small eigenvalues
are associated with signals generated from just a few locations. (b) Component
density distributions of significant eigenvalues (λ1, λ2, λ4) confirm our interpreta-
tion of a “market” eigenvalue acting on nearly all locations with the same bias.
Selecting an eigenvector from the random part of the distribution (i.e., λ87) shows
good agreement with theoretical predictions.
λ108 λ107
1.4 0.8
1.2
1 0.6
0.8
0.4
0.6
0.4 0.2
0.2
0 0
–4 –2 0 2 4 –4 –2 0 2 4
λ101 λ87
0.5 0.5
0.4 0.4
0.3 0.3
Density
0.2 0.2
0.1 0.1
0 0
–4 –2 0 2 4 –4 –2 0 2 4
Component Value
(b)
FIGURE 11.11 (continued). (a) IPR values for eigenvalues associated with the
zero-lag correlation matrix for weekly theft time series. Points outside the gray
area represent significance outside of the bounds predicted by RMT. There are
more significant eigenvalues, and these results suggest larger eigenvalues with
low IPRs, corresponding to global forces that affect crime across the city, while
small eigenvalues are associated with signals generated from just a few locations.
(b) Component density distributions of significant eigenvalues (λ1, λ2, λ4) con-
firm our interpretation of a “market” eigenvalue acting on nearly all locations
with the same bias. Selecting an eigenvector from the random part of the distri-
bution (i.e., λ87) shows good agreement with theoretical predictions.
For the correlation matrices of weekly theft time series from 1991 to 1999
for the city’s 381 census tracts, Figure 11.11(b) compares the component
distribution to the theoretical distribution. We find that the eigenvector
associated with the largest eigenvalue has nearly all positive components.
Despite the modest IPR, it is still possible to identify it as a market force.
It has a large positive bias across all locations of the city. Other significant
eigenvectors also have nonuniform distributions. We compare these to an
eigenvector from the random part of the spectrum that shows good agree-
ment with RMT results.
To accurately quantify and interpret structure in the remaining sig-
nificant eigenvalues, we must first remove the dominating influence of the
largest. Because the strong “market” eigenvector acts on all locations across
the city with the same bias, we can recreate this global influence by project-
ing the citywide crime time series onto the eigenvector u1 (this procedure is
outlined by Plerou et al. [13]). This time series can be viewed as an estimate
of citywide crime based on the most prominent factor. Denoting the origi-
nal normalized time series as Y(t), we construct the projection
381
Y 1 (t ) = ∑u Y (t ) (11.3)
j =1
1
j j
Comparing this time series with the original weekly citywide crime, we
find strong agreement over nearly 10 years of weekly data with the correla-
tion coefficient of 〈Y(t) Y 1(t)〉 = .95.
Having established a reasonable proxy for the market forces acting on
crime rates at all locations, we regress the location time series on this global
force and use residuals that are free from its influence. For the time series
associated with each location yi(t), we perform the following regression:
Yi (t ) = α i + βiY 1 (t ) + εi (t ) (11.4)
where αi and βi are location-specific fit parameters. The residual time series
ε(t) are then used to compute the same correlation matrices and spectral anal-
ysis as described previously, but this time with the absence of global trends.
We now take the significant eigenvectors of the residual correlation
matrices and examine their component structure. Large components of
specific eigenvectors correspond to locations across the city that are all
similarly biased by whatever force is associated with the vector. When we
Spatio-Temporal Correlations in Criminal Offense Records ◾ 355
plot the largest 10% of components for the remaining significant eigen-
vectors geographically, we find that large components of each vector are
strongly correlated spatially.
Figure 11.12 shows the spatial distribution of the largest components
for different residual eigenvectors. The vector associated with larger eigen-
values act primarily on neighborhoods in high crime areas near central
Philadelphia. Other eigenvalues produce similar spatial clusters, although
interpretation of why these locations are clustered is left as an open ques-
tion. This analysis suggests that weekly time scales reveal much richer
spatial structure. We found that performing similar procedures using
monthly time series reduces the amount of correlation, suggesting that
the weekly time scale is the correct choice for analysis of neighborhood
crime trends.
The problem of scale selection is one that can be dealt with naturally
given the algorithms applied in this chapter. While Fourier analysis,
regressions, and cross-correlation measures all give indications as to the
amount of signal in the data, none provided our analysis with a satisfactory
selection of scale. For example, the largest frequency in the Fast Fourier
Transform (FFT) (Figure 11.5) of citywide theft counts is found at once-
per-3-hours. This result most likely has something to do with police proce-
dures such as shift changes, but selecting the hourly scale for all subsequent
analysis would surely return time series too noisy and unpopulated for use.
By using results from Random Matrix Theory as a null model, we
can easily measure how much signal can be distinguished from noise by
observing eigenvalues above predicted maxima. Furthermore, changes in
both spatial and temporal scales affect this spectrum in the same way—
increasing or decreasing the number and magnitude of significant eigen-
values. Thus, we can select an appropriate scale by sweeping over time from
hours to months, and space from a small lattice to a large one. Selecting the
combination that maximizes these deviations allows us to extract the most
signal from noisy data, aggregating as much, but not more, than necessary.
We can refine these statements even further by noting that whatever
social phenomena are behind these significant eigenvectors (or principle
factors), they are independent in some way. Whatever force is driving
crime rates in the locations corresponding to large components of the
second eigenvector is separate enough either in cause or manifestation
from the force behind the third vector. This is a level of interpretation
currently not offered by the majority of tools aimed at explaining causal
relationships. The underlying social dynamics of these factors still remain
356 ◾ Computational Intelligent Data Analysis for Sustainable Development
FIGURE 11.12 Plotting the geographic location of the largest 10% of components
for the first nine significant eigenvectors reveals strong spatial correlation. We
conclude that these eigenvectors correspond to neighborhoods or sets of neigh-
borhoods and represent the forces that affect crime rates there.
Spatio-Temporal Correlations in Criminal Offense Records ◾ 357
ACKNOWLEDGMENTS
The authors would like to thank the Santa Fe Institute and the wonderful
community of people that has helped us with their thoughts and ideas. We
would also like to thank NSF Summer REU program for partially funding
this research.
REFERENCES
1. Christoly Biely and Stefan Thurner. Random matrix ensembles of time-
lagged correlation matrices: Derivation of eigenvalue spectra and analysis of
financial time-series. Quantitative Finance, 8(7): 705–722, October 2008.
2. Francesco Calabrese, Jonathan Reades, and Carlo Ratti. Eigenplaces:
Segmenting space through digital signatures. IEEE Pervasive Computing,
9(1): 78–84, January 2010.
Spatio-Temporal Correlations in Criminal Offense Records ◾ 359
19. Akihiko Utsugi, Kazusumi Ino, and Masaki Oshikawa. Random matrix
theory analysis of cross correlations in financial markets. Physical Review E,
70(2): 026110+, August 2004.
20. Duncan J. Watts, Roby Muhamad, Daniel C. Medina, and Peter S. Dodds.
Multiscale, resurgent epidemics in a hierarchical metapopulation model.
Proceedings of the National Academy of Sciences of the United States of
America, 102(32): 11157–11162, August 2005.
21. David Weisburd, Gerben J.N. Bruinsma, and Wim Bernasco. Units of
analysis in geographic criminology: Historical development, critical issues
and open questions. In David Weisburd, Wim Bernasco, and Gerben J.N.
Bruinsma, Editors, Putting Crime in its Place: Units of Analysis in Geographic
Criminology. New York: Springer, 2009.
22. Per-Olof Wikstrm, Vania Ceccato, Beth Hardie, and Kyle Treiber. Activity
fields and the dynamics of crime. Journal of Quantitative Criminology, 26:
55–87, 2010. 10.1007/s10940-009-9083-9.
23. Jeffrey Wooldridge. Introductory Econometrics: A Modern Approach (with
Economic Applications, Data Sets, Student Solutions Manual Printed Access
Card). South-Western College Publishing, Mason, Ohio, 4th edition, March
2008.
Chapter 12
CONTENTS
Symbol Definitions 361
12.1 The Problem 363
12.1.1 Regional Planning and Impact Assessment 365
12.2 Why Constraint-Based Approaches? 367
12.2.1 A CLP Model 368
12.3 The Regional Energy Plan 370
12.4 The Regional Energy Plan 2011–2013 371
12.5 Added Value of CLP 378
12.6 Conclusion and Future Open Issues 379
Acknowledgments 380
References 380
SYMBOL DEFINITIONS
A Vector of activities. For energy sources, it is measured in mega-
watts (MW).
Na Number of activities: Na = |A|.
ai Element of the A vector: i ∈ {1, …, Na}.
P Vector of pressures.
361
362 ◾ Computational Intelligent Data Analysis for Sustainable Development
12.1 THE PROBLEM
∀j ∈ AS G j = ∑d G
i∈A P
ij i
Given a budget BPlan available for a given plan, we have a constraint limit-
ing the overall plan cost as follows:
Na
∑G c ≤ B
i =1
i i Plan (12.1)
Na
∑G o ≥ o
i =1
i i Plan
For example, in an energy plan, the outcome can be to have more energy
available in the region, so oPlan could be the increased availability of elec-
trical power (e.g., in kilo-TOE, Tonnes of Oil Equivalent). In such a case,
oi will be the production in kTOE for each unit of activity ai.
Concerning the impacts of the regional plan, we sum up the contribu-
tions of all the activities and obtain an estimate of the impact on each
environmental pressure:
Na
∀j ∈{1,…, N p } p j = ∑m G (12.2)
i =1
i
j i
Np
∀j ∈{1,…, N r } rj = ∑n p (12.3)
i =1
i
j i
can maximize, say, the air quality, or the quality of the surface water. In
this case, the produced plan decisions are less intuitive and the system
we propose is particularly useful. The link between decisions on primary
and secondary activities and consequences on the environment are far too
complex to be manually considered. Clearly, more complex objectives can
be pursued by properly combining the above-mentioned aspects.
∀i ∈ A P Gi oi ≥ FT
i
o
To = ∑G o
j ∈A P
j j
∀i ∈ A P Gi oi ≤ U i
∑ Gi oi ≥ Lren
i∈A P ren
For each energy source, the plan should provide the following:
The ratio between installed power and total produced energy is mainly
influenced by the availability of the source: while a biomass plant can
(at least in theory) produce energy 24/7, the sun is available only during
the day, and the wind only occasionally. For unreliable sources an average
for the whole year is taken.
The cost of the plant, instead, depends mainly on the installed power:
a solar plant has an installation cost that depends on the square meters of
installed panels, which on their turn can provide some maximum power
(peak power).
It is worth noting that the considered cost is the total cost of the plant
for the regional system, which is not the same as the cost for the taxpay-
ers of the Emilia-Romagna region. In fact, the region can enforce policies
in many ways, convincing private stakeholders to invest in power pro-
duction. This can be done with financial leverage, or by giving favorable
conditions (either economic or other) to investors. Some power sources
are economically profitable, so there is no need for the region to give sub-
sidies. For example, currently in Italy, biomass is economically advan-
tageous for investors, so private entities are proposing projects to build
biomass plants. On the other hand, biomass also produces pollutants; they
are not always sustainable (see [4] for a discussion) so local committees
are rather likely to spawn against the construction of new plants. For these
reasons, there is a limit on the number of licenses the region gives to pri-
vate stakeholders for building biomass-based plants.
Technicians in the region estimated (considering current energy require-
ments, growth trends, foreseen energy savings) the total energy requirements
for 2020; out of this, 20% should be provided by renewable sources. They
also proposed for this amount a percentage to be provided during the plan
2011 to 2013: about 177 kTOE of electrical energy and 296 kTOE of ther-
mal energy.
Starting from these data, they developed a plan for electrical energy
and one for thermal energy.
Constraint and Optimization Techniques for Supporting Policy Making ◾ 373
9000
Experts’ plan
Biomasses 8000
Photovoltaic
Hydroelectric 7000
Wind generators 6000
Thermodyn. solar pow.
Cost (M€)
5000
4000
3000
2000
1000
0
–1000 –800 –600 –400 –200 0 200 400 600 800 1000
Quality of Air
FIGURE 12.1 Plot of the extreme plans using only one energy source, compared
with the plan by the region’s experts.
3400
Cost (M€)
3000
Same Cost
Same Air Qual.
2600
Min
Cost Intermediate
2200
–500 –400 –300 –200 –100 0 100 200
Quality of Air
Table 12.2 contains the plan developed by the region’s experts, while
Table 12.3 shows the plan on the Pareto curve that has the same air qual-
ity as the plan of the experts. The energy produced by wind generators is
almost doubled (as they provide a very convenient ratio (air quality)/cost;
see Figure 12.1); we have a slight increase in the cheap biomass energy,
while the other energy sources reduce accordingly.
Concerning the environmental assessment, we plot in Figure 12.3 the
value of the receptors in significant points of the Pareto front. Each bar
represents a single environmental receptor for a specific plan on the Pareto
frontier of Figure 12.2. In this way it is easy to compare how receptors are
impacted by different plans. In the figure, the white bar is associated to
the plan on the frontier that has the highest air quality, while bars with
dark colors are associated to plans that have a low cost (and, thus, a low
quality of the air). Notice that the receptors have different trends: some of
them improve as we move in the frontier toward higher air quality (like
climate quality, mankind wellness, value of material goods), while others
Constraint and Optimization Techniques for Supporting Policy Making ◾ 377
Energy Availability
Water Availability
Wellness of Wildlife
Quality of Climate
Air Quality
Groundwater Quality
Embankments Stability
Subsidence Limitation
ACKNOWLEDGMENTS
This work was partially supported by EU project ePolicy, FP7-ICT-2011-7,
grant agreement 288147. Possible inaccuracies of information are the
responsibility of the project team. The text reflects solely the views of its
authors. The European Commission is not liable for any use that may be
made of the information contained in this chapter.
REFERENCES
1. Krzysztof R. Apt and Mark Wallace. Constraint Logic Programming Using
Eclipse. Cambridge (UK) and New York: Cambridge University Press, 2007.
2. Maurizio Bruglieri and Leo Liberti. Optimal running and planning of a
biomass-based energy production process. Energy Policy, 36(7): 2430–2438,
July 2008.
3. Paolo Cagnoli. VAS Valutazione Ambientale Strategica. Dario Flaccovio,
Palermo, Italy, 2010.
4. Massimiliano Cattafi, Marco Gavanelli, Michela Milano, and Paolo
Cagnoli. Sustainable biomass power plant location in the Italian Emilia-
Romagna region. ACM Transactions on Intelligent Systems and Technology,
2(4), article 33, 1–19, July 2011.
5. Damiana Chinese and Antonella Meneghetti. Design of forest biofuel sup-
ply chains. International Journal of Logistics Systems and Management, 5(5):
525–550, 2009.
Constraint and Optimization Techniques for Supporting Policy Making ◾ 381
383
384 ◾ Index
B Bias, 8
Bias cancellation, 137, 139
Backshift operator, 259
Bias criterion, and model skill, 140
Balance-responsible party (BRP), 187
Bilateral flows, in Eora MRIO, 28
energy market roles, 189–190
Binary Integer Programming (BIP), 161
Balanced energy plans, 374
Binary Linear Programming (BLP)
Balancing conditions, 38, 40
method, 160–161
Balancing constraints, 37, 38, 41
black-box perception of, 162
as hard constraints, 41 concerns about transparency, 162
reformulating, 42 extending, 175
summarizing in single matrix formulation, 168
equation, 41–43 objective function, 161
as vector-by-vector equation, 42 project selection, 169
Baltimore County, Division of Biodiversity, computational intelligent
Environmental Protection and data analysis for, 17–18
Resource Management, 161 Biodiversity footprint, 50
Baseload power, 247 Bioenergy, 19
Basic industries, 33 Biofuel generators, 246
Batteries Biofuel production, unexpected negative
capacity losses, 199 environmental impacts, 19
energy storage using, 247 Biogas, 182
high temperature liquid metal, 247 Bioinformatics, 83
as temporary energy storage, 198 Biomass
vanadium redox flow, 247 environmental assessment using only,
Battery-based chemical storage facilities, 373
198 pollutant production by, 372
Bayes classifiers, 211 Biomass power plant siting, 365, 366, 371
Bayesian frameworks, 106, 108, 138, 143, Bivariate statistical distribution, wind
209, 314 resources, 304
probabilistic graphical model, 315 Black swan theory, 14
REA method, 138 Block aggregation matrix, 58
wind speed estimation accuracy using, Blood tantalum, 4
325 Bootstrap techniques, 136
Bayesian inference, 144 Boston Museum of Science, 319
Bayesian model, 140 airport data, 307
univariate, 141 anemometer locations, 88, 320
Bayesian model averaging, 142 Bottom-up analysis
Bayesian network functions, 325 in climate informatics, 83
Beat frequencies, 252 with LCA, 29
effect in winter, 254 Box-Jenkins modeling, 254
Behavioral data, 331, 357 Bradford County
patterns in, 332 alternative sensitivity analysis, 173
transforming into variables, 335 budget overrun potential, 173
Benefit in-kind frontier, 172 dataset, 166
Benefit targeting, 174 environmental ranking method, 168
Benefit Targeting (BT) selection process, fund redistribution to Cambria
18 County, 174, 175
386 ◾ Index
Probability density function (PDF), 317 Rare events, versus extreme events, 135
for change in regional mean Raw data table, 38, 40, 43
temperature, 141 Raw measurements, 85
wind speed estimation, 88, 310 Reaction-diffusion equations, 333
Probability distribution, 266 Reanalyses.org, 87
for wind speed at directional intervals, Reanalysis products, 87
304 climate dipole applications, 95
Problematic grid situations, 213 Receiver operating characteristic (ROC)
Processed observations, 85–86 curves, 224
Producer responsibility, 29 Receptor values, Pareto front, 377
Producers, linking through IO analysis, 50 Reconstruction interval, 105–106
Product-related tools, 14 Regional aggregation, 61
Production footprint, 50 Regional climate model (RCM) output, 88
Production industries, 33 Regional energy plan, 370–371
Prosumers 2011-2013, 371–378
energy market role, 189 balanced plan, 374
turning consumers into, 184 compliance with European guidelines,
Provenance graphs, 117–118 371
Proxy archives, 86 electrical and thermal energy, 372
Pseudoproxy experiments, 107–108 environmental assessment using
Public administration sector, 60 biomass only, 373
Public policy issues, 363. See also Policy expert plan, 374, 376
making extreme plan with single energy
using constraint reasoning with, 379 source, 374
Pumped-storage water-power plants, 198 retained air quality with lower cost,
Pumped water storage, 247 376
Regional geophysical characteristics, 370
Regional IO tables, 54
Q
Regional mean temperature, PDF for
Quadratic Lagrange function, 44 change in, 141
Quantifying Uncertainty in Model Regional perspective, 364
Predictions (QUMP), 91 Regional planning
activity cost, 366
activity magnitude, 365
R
activity outcome, 366
Rainfall, criteria of extreme, 134 casting as combinatorial optimization
Rainfall extremes, difficult problems, 364
characterization, 147 corrective countermeasures, 367
Random matrices, comparison to dependencies between activities, 366
eigenvalue spectrum, 347–348 extreme plans, 374
Random matrix theory, 332, 338, 347, 357 and impact assessment, 365–367
combining with time-series analysis, installed power, 372
338 lack of feedback, 367
Random permutation, 347 optimal plans, 375
Random time series, CUSUM scores, 296 policy issues, 363
Rapid event detection, 299 primary versus secondary activities,
Rapid throughput arthropod 368
identification, 18 secondary activities, 366
Index ◾ 407
Sub-metering, 224 T
Subsidy values, 34
Tail behavior, 134, 145
Subspace analysis, 214, 217, 219
changes in, 129
Supercapacitors, 247
small probability of sampling, 145
Supergrids, 20, 245
Tail observations, 135
Supervised learning, 223
Talison Minerals, 4, 5
Supervisory control and data acquisition
Tantalum, case study, 4
(SCADA) systems, 274
Supply chains, 16 Target-percentile criteria, 268
effects of lengthening, 15 Tariff schemes, 195
identifying/ranking at-risk, 51 Taxes paid, valuations, 34
linking through IO analysis, 50 Technical coefficients matrix, 48
need for government intervention, 5 Technological coefficients, variation in,
satellite accounts in, 32 54–55
searching top-ranked, 49 Technological evolution, relative speed
Supply side optimization, 229–230 of, 3
Support-vector machines, 209, 211 Temperature, and crime occurrence, 342
Suppression terms, 333 Temperature prediction, cross-validation
Surface Heat Budget of the Arctic variables, 97
(SHEBA), 85 Temperature-related extremes, confidence
Survey design, 9 in projections, 128
Sustainability Temporal analysis, 275
energy system, 182 goals, 13
three pillars, 2 Temporal autocorrelations, 95
through energy efficiency, 182 Temporal persistence, 92
Sustainability analysis Temporal sparsity, 146
extended IO tables for, 30 Ten-second wind farm output, 258
integrated, 14–16 AR(3) model fit, 260
Sustainability certificates, 51 Tertiary frequency control, 192, 193
Sustainability debate, 3 Text data, 9
Sustainability indicator line items, 50 Theft-related crimes, 332, 333
Sustainability labels, 51 in downtown Philadelphia, 339
Sustainability policy, computational eigenvalue spectrum, 351
techniques informing, 51 hourly frequency spectrum, 341
Sustainability reporting, 3 IPR values, weekly time series, 352–353
Sustainable development lagged cross-correlations with
computational intelligent data analysis automobile crimes, 344
for, 1, 10–12 plotted by time scales, 336
introduction, 2–5 relationships to violent crimes, 344
Symbol definitions, constraint and time series, 334
optimization techniques, 361–362 weekly timescales, 350–357
Synchrophasors, 274 Theory and experimentation, 101–102
Synthetic Aperture Radar (SAR) sea ice Thermal energy, 371
imagery, 110 Thermodynamic solar plants, 371
System dynamics, and crime research, Threatened species, satellite accounts for,
333 32
System stability, detrimental effects of Tidal generators, 246
power imbalances, 290 Time-of-use pricing, 248
Index ◾ 411
0°
15°S
30°S
45°S
60°S
75°S
90°S
180°W 150°W 120°W 90°W 60°W 30°W 0° 30°E 60°E 90°E 120°E 150°E 180°E
90°N
75°N
60°N
45°N
30°N
15°N
0°
15°S
30°S
45°S
60°S
75°S
90°S
180°W 150°W 120°W 90°W 60°W 30°W 0° 30°E 60°E 90°E 120°E 150°E 180°E
90°N
75°N
60°N
45°N
30°N
15°N
0°
15°S
30°S
45°S
60°S
75°S
90°S
FIGURE 4.1 The drought regions detected by our algorithm. Each panel shows the drought start-
ing from a particular decade: 1905–1920 (top left), 1921–1930 (top right), 1941–1950 (bottom left),
and 1961–1970 (bottom right). The regions in black rectangles indicate the common droughts found
by [63].
180°W 150°W 120°W 90°W 60°W 30°W 0° 30°E 60°E 90°E 120°E 150°E 180°E
90°N
75°N
60°N
45°N
30°N
15°N
0°
15°S
30°S
45°S
60°S
75°S
90°S
FIGURE 4.1 (continued) The drought regions detected by our algorithm. Each panel shows the
drought starting from a particular decade: 1905–1920 (top left), 1921–1930 (top right), 1941–1950
(bottom left), and 1961–1970 (bottom right). The regions in black rectangles indicate the common
droughts found by [63].
NCEP
AO
NAO
PNA
SOI
AAO ACC
FIGURE 4.2 Climate dipoles discovered from sea-level pressure (reanalysis) data using graph-
based analysis methods (see [42] for details).
35
30
25
20
GW
15
10
5 J F M A M J J A S O N D
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 0
Hour
FIGURE 8.2 Hourly demand profile in the NEM for 2010 (data from Australian Energy Market
Operator [5]).
180
160
140
120
100
Deseasoned data
80
AR(2)+Luc
60 AR(2)+Luc+fixed com
40
20
0
0 5 10 15 20 25
–20
–40
FIGURE 8.8 When fixed components add into combination of Lucheroni and AR(2) model.
0.040
0.035
0.030
0.025
0.020
0.015
Model Actual
0.010
2.5
5 min data Estimated standard deviation
1.5
Energy MwH
0.5
0
10
6:
2:
10
6:
2:
10 53 P
6: 53 A
2: 3 A
10 3 A
6: 53 P
2: 3 P
10 53 P
6: 53 A
2: 3 A
10 3 A
6: 03 P
2: 3 P
10 03 P
6: 03 A
2: 3 A
10 3 A
6: 03 P
54
54
54
54 PM
54
54
54
54 M
54
54
55
55 M
55
55
55
:5
:5
:5 M
:5
:5 M
:5
:5 M
:5
4:
:5
:5
4:
:5 M
:
4:
:5 M
:5 M
4: M
:5 M
:
4:
:5 M
:5 M
5: M
:0 M
:
5:
:0 M
:0 M
5: M
:0 M
53
3
53
3
A
PM
A
P
M
M
M
–0.5
Time
Frequency (Hz)
Frequency (Hz)
59.99 0.02
59.98 0.01
59.97 0
59.96 –0.01
59.95 –0.02
59.94 –0.03
120 125 130 135 140 145 150 120 125 130 135 140 145 150
Elapsed Time Since 07/26/2009 19:03:20 (sec) Elapsed Time Since 07/26/2009 19:03:20 (sec)
(a) (b)
Intrinsic Mode Functions Computed from Bangor,
ME Example Data Frequency Spectra of Intrinsic Mode Functions
0.05
1.2 Interarea Band
0.04 0.1 Hz – 0.8 Hz
0.03 1
Power (% of maximum)
Frequency (Hz)
0.02 0.8
0.01
0.6
0
0.4
–0.01
0.2
–0.02
–0.03 0
120 125 130 135 140 145 150 10–3 10–2 10–1 100 101
Elapsed Time Since 07/26/2009 19:03:20 (sec) Frequency (Hz)
(c) (d)
Input Filter Output for Example Data
0.015
Bismarck, ND
Bangor, ME
0.01 Madison, WI
0.005
Frequency (Hz)
–0.005
–0.01
–0.015
120 125 130 135 140 145 150
Elapsed Time Since 07/26/2009 19:03:20 (sec)
(e)
150 0.004 30
0.002
180 0
210 330
240 300
270
Freq (Hz)
60 60
59.98 59.98
59.96 59.96
0 5 10 15 20 25 30 35 40 45 0 5 10 15 20 25 30 35 40 45
Elapsed Time Since 07/26/2009 19:05:10 (sec) Elapsed Time Since 07/26/2009 19:05:10 (sec)
(a) (b)
Freq (Hz)
Freq (Hz)
60 60
59.98 59.98
59.96 59.96
0 5 10 15 20 25 30 35 40 45 0 5 10 15 20 25 30 35 40 45
Elapsed Time Since 07/26/2009 19:05:10 (sec) Elapsed Time Since 07/26/2009 19:05:10 (sec)
(c) (d)
Freq (Hz)
60
59.98
59.96
0 5 10 15 20 25 30 35 40 45
Elapsed Time Since 07/26/2009 19:05:10 (sec)
0.428 Hz Mode
Avg Damping = –0.0511
(e)
FIGURE 9.4 Identification of a grid event using phasor visualization: (a) pre-event condition;
(b) initial cycle of oscillation; (c) oscillation reaches largest magnitude; (d) damping of oscillation;
and (e) system approaches steady state.
Frequency Recorded During k-Medians Results for FDR 2 During Window
Test Event 543 without an Event
60.04 59.968
FDR 2 Cluster 1
FDR 3 Cluster 2
60.03 FDR 6 59.966 Cluster 1 Median
FDR11 Cluster 2 Median
60.02 59.964 Input data
Medians
Frequency (Hz)
Frequency (Hz)
Difference =
60.01 59.962
0.0025 Hz
60 59.96
59.99 59.958
59.98 59.956
59.97 59.954
59.96 59.952
10 20 30 40 50 60 70 0 5 10 15 20 25 30
Time (seconds) Time (seconds)
(a) (b)
k-Medians Results for FDR 2 Data During Response of Decision Metric to Window
Test Event 543 Size for Selected Events
60.06 0.06
Cluster 1
60.05 Cluster 2
Cluster 1 Median
Cluster 2 Median 0.05
60.04
k-Medians Difference (Hz)
Input Data
60.03 0.04
Frequency (Hz)
60.02
60.01 0.03
Medians
60 Difference =
0.049 Hz 0.02
59.99
59.98 0.01
59.97
0
59.96 0 50 100 150 200 250 300 350 400 450 500
0 5 10 15 20 25 30
Data Window Size (number of points)
Time (seconds)
(c) (d)
Value of Decision Metric During an Event
Results of k-Medians Testing on Training Set 60.03
FDR 2
Percentage of Window Belonging to Cluster 1
60.01 FDR11
0.8 60
59.99
0.7
59.98
0.6 59.97
59.96
0.5 59.95
0 20 40 60 80 100 120
Decision Metric (Hz)
0.4 FDR 2
FDR 3
0.3 FDR 6
FDR11
No event
0.2 Portion of event 0.0165
Full events
0.1 0
0 0.01 0.02 0.03 0.04 0.05 0.06 0 20 40 60 80 100 120
Difference of Cluster Medians (Hz) Time (seconds)
(e) (f )
FIGURE 9.5 Illustration of k-Medians-based approach for identifying events: (a) An example of
frequency response to generator trip; (b) k-Medians result for frequency data without an event sig-
nature; (c) k-Medians result for frequency data with an event signature; (d) effect of window size on
k-Medians metric for select training events; (e) k-Medians metric evaluated on the training data; and
(f) response of decision metric during an event for selected FDRs.
Probability Density Function
N
of Wind Speed Wind Speed
0.3 (m/s)
Probability Density
30°–45° 15°–30°
0.2
10.80 (6.1%)
0.15 W 3.6% E
8.23 (27.6%)
0.1 0°–15° 2%
4%
6%
0.05 8% 5.14 (35.0%)
10%
12% 3.09 (22.6%)
0 2 4 6 8 10 12 14 16 18 1.54 (0.0%)
Calm-> 0.00 (3.6%)
Wind Speed S
FIGURE 10.1 A wind resource estimation is expressed as a bivariate (speed and direction) statisti-
cal distribution (left) or a “wind rose” (right).
1 13
9 6 3
2
14
Test location
10
12 11
5 7 4
FIGURE 10.4 Data are referenced from fourteen airport locations in the state of Massachusetts
(United States). See Table 10.1.
FIGURE 10.5 Red circles show location of anenometers on rooftop of Museum of Science, Boston,
Massachusetts.
Computer Science
K14261