Anselin Luc an Introduction to Spatial Data Science With Geo
Anselin Luc an Introduction to Spatial Data Science With Geo
Key Features:
• Includes spatial perspectives on cluster analysis
• Focuses on exploring spatial data
• Supplemented by extensive support with sample data sets and examples on the GeoDaCenter
website
This book is both useful as a reference for the software and as a text for students and researchers
of spatial data science.
Luc Anselin is the Founding Director of the Center for Spatial Data Science at the University of Chi-
cago, where he is also the Stein-Freiler Distinguished Service Professor of Sociology and the College,
as well as a member of the Committee on Data Science. He is the creator of the GeoDa software and
an active contributor to the PySAL Python open-source software library for spatial analysis. He has
written widely on topics dealing with the methodology of spatial data analysis, including his clas-
sic 1988 text on Spatial Econometrics. His work has been recognized by many awards, such as his
election to the U.S. National Academy of Science and the American Academy of Arts and Science.
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
An Introduction to Spatial
Data Science with GeoDa
Volume 1 – Exploring Spatial Data
Luc Anselin
Designed cover image: Luc Anselin
First edition published 2024
by CRC Press
2385 NW Executive Center Drive, Suite 320, Boca Raton FL 33431
Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot as-
sume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have
attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders
if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged, please
write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or
utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including pho-
tocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission
from the publishers.
For permission to photocopy or use material electronically from this work, access www.copyright.com or contact the
Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. For works that are
not available on CCC, please contact [email protected]
Trademark notice: Product or corporate names may be trademarks or registered trademarks and are used only for iden-
tification and explanation without intent to infringe.
List of Figures xv
Preface xxiii
Acknowledgments xxv
1 Introduction 1
1.1 Overview of Volume 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 A Quick Tour of GeoDa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 Data entry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.2 Data manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.3 GIS operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.4 Weights manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.5 Mapping and geovisualization . . . . . . . . . . . . . . . . . . . . . . 5
1.2.6 Exploratory data analysis . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.7 Space-time analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.8 Spatial autocorrelation analysis . . . . . . . . . . . . . . . . . . . . . 6
1.2.9 Cluster analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Sample Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
vii
viii Contents
3 GIS Operations 37
3.1 Topics Covered . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2 Projections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2.1 Coordinate reference system . . . . . . . . . . . . . . . . . . . . . . . 39
3.2.2 Selecting a projection . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2.3 Reprojection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3 Converting Between Points and Polygons . . . . . . . . . . . . . . . . . . . . 41
3.3.1 Mean centers and centroids . . . . . . . . . . . . . . . . . . . . . . . 42
3.3.2 Tessellations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.4 Minimum Spanning Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.4.1 Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.4.2 Minimum Spanning Tree options . . . . . . . . . . . . . . . . . . . . 49
3.5 Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.5.1 Dissolve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.5.2 Aggregation in table . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.6 Multi-Layer Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.6.1 Loading multiple layers . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.6.2 Automatic reprojection . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.6.3 Selection in multiple layers . . . . . . . . . . . . . . . . . . . . . . . 54
3.7 Spatial Join . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.7.1 Spatial assign . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.7.2 Spatial count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.8 Linked Multi-Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.8.1 Specifying an inter-layer linkage . . . . . . . . . . . . . . . . . . . . 59
3.8.2 Visualizing linked selections . . . . . . . . . . . . . . . . . . . . . . . 59
5 Statistical Maps 87
5.1 Topics Covered . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.2 Extreme Value Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.2.1 Percentile map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.2.2 Box map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.2.3 Standard deviation map . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.3 Mapping Categorical Variables . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.3.1 Unique values map . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.3.2 Co-location map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.4 Cartogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.4.1 Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.4.2 Creating a cartogram . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.5 Map Animation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
8.3 Three Variables: Bubble Chart and 3-D Scatter Plot . . . . . . . . . . . . . 144
8.3.1 Bubble chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
8.3.2 3-D scatter plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
8.4 Conditional Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
8.4.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
8.4.2 Conditional statistical graphs . . . . . . . . . . . . . . . . . . . . . . 153
8.4.3 Conditional maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
8.5 Parallel Coordinate Plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
8.5.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
8.5.2 Clusters and outliers in PCP . . . . . . . . . . . . . . . . . . . . . . 159
VI Epilogue 389
21 Postscript – The Limits of Exploration 391
Bibliography 405
Index 417
List of Figures
xv
xvi List of Figures
7.1 Histogram | Box Plot | Scatter Plot | Scatter Plot Matrix . . . . . . . . . . 116
7.2 Histogram for percent poverty 2020 . . . . . . . . . . . . . . . . . . . . . . 117
7.3 Histogram for percent poverty 2020 – 12 bins . . . . . . . . . . . . . . . . 117
7.4 Bar chart for settlement categories . . . . . . . . . . . . . . . . . . . . . . 118
7.5 Descriptive statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
7.6 Linking between histogram and map . . . . . . . . . . . . . . . . . . . . . 120
7.7 Linking between map and histogram . . . . . . . . . . . . . . . . . . . . . 120
7.8 Box plot for population change 2020–2010 . . . . . . . . . . . . . . . . . . . 121
7.9 Box map for population change 2020–2010 (upper outliers selected) . . . . 122
7.10 Scatter plot of food insecurity on poverty . . . . . . . . . . . . . . . . . . . 124
7.11 Regime regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
7.12 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
7.13 Default LOWESS local regression fit . . . . . . . . . . . . . . . . . . . . . 127
7.14 Default LOWESS local regression fit with bandwidth 0.6 . . . . . . . . . . 128
7.15 Default LOWESS local regression fit with bandwidth 0.05 . . . . . . . . . 129
7.16 Scatter plot matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
7.17 Scatter plot correlation matrix . . . . . . . . . . . . . . . . . . . . . . . . . 132
7.18 Scatter plot matrix with LOWESS fit . . . . . . . . . . . . . . . . . . . . . 133
7.19 Food insecurity 2020 in Valles Centrales . . . . . . . . . . . . . . . . . . . 134
7.20 Averages Chart – Valles Centrales . . . . . . . . . . . . . . . . . . . . . . . 135
7.21 Map brushing and the averages chart – 1 . . . . . . . . . . . . . . . . . . . 136
7.22 Map brushing and the averages chart – 2 . . . . . . . . . . . . . . . . . . . 136
7.23 Map brushing and the averages chart – 3 . . . . . . . . . . . . . . . . . . . 137
7.24 Map brushing and the scatter plot – 1 . . . . . . . . . . . . . . . . . . . . 138
7.25 Map brushing and the scatter plot – 2 . . . . . . . . . . . . . . . . . . . . 138
7.26 Map brushing and the scatter plot – 3 . . . . . . . . . . . . . . . . . . . . 139
xviii List of Figures
8.1 Bubble Chart | 3D Scatter Plot | Parallel Coordinate Plot | Conditional Plot 142
8.2 One dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
8.3 Two dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
8.4 Discrete evaluation points in three variable dimensions . . . . . . . . . . . 143
8.5 Two dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
8.6 Three dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
8.7 Bubble chart – default settings: education, basic services, extreme poverty 145
8.8 Bubble chart – bubble size adjusted: education, basic services, extreme
poverty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
8.9 Bubble chart – no color: education, basic services, extreme poverty . . . . 147
8.10 Bubble chart – categories: education, basic services, region . . . . . . . . . 147
8.11 3D scatter plot: education, basic services, extreme poverty . . . . . . . . . 148
8.12 Interacting with 3D scatter plot . . . . . . . . . . . . . . . . . . . . . . . . 149
8.13 Selection in the 3D scatter plot . . . . . . . . . . . . . . . . . . . . . . . . 150
8.14 Brushing and linking with the 3D scatter plot . . . . . . . . . . . . . . . . . 151
8.15 Conditional scatter plot – 3 by 3 . . . . . . . . . . . . . . . . . . . . . . . . 152
8.16 Conditional scatter plot – unique values . . . . . . . . . . . . . . . . . . . 153
8.17 Conditional box plot – lack of health access by region . . . . . . . . . . . . 154
8.18 Conditional box map – 2 by 2 . . . . . . . . . . . . . . . . . . . . . . . . . 155
8.19 Parallel coordinate plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
8.20 Parallel coordinate plot – standardized variables . . . . . . . . . . . . . . . 157
8.21 Brushing the PCP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
8.22 Brushing map and PCP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
8.23 Clusters of observations in PCP . . . . . . . . . . . . . . . . . . . . . . . . 159
8.24 Outlier observation in PCP . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
12.13 Spatial Empirical Bayes smoothed rate map using region block weights . . 243
This two-volume set is the long overdue successor to the GeoDa Workbook that I wrote
almost twenty years ago (Anselin, 2005a). It was intended to facilitate instruction in spatial
analysis and spatial regression by means of the GeoDa software (Anselin et al., 2006b). In
spite of its age, the workbook is still widely used and much cited, but it is due for a major
update.
The update is two-fold. On the one hand, many new methods have been developed or original
measures refined. This pertains not only to the spatial autocorrelation indices covered in the
original Workbook but also to a collection of newer methods that have become to define
spatial data science. Secondly, the GeoDa software has seen substantial changes to become an
open-source and cross-platform ecosystem that encompasses a much wider range of methods
than its legacy predecessor.
The two volumes outline my vision for an Introduction to Spatial Data Science. They include
a collection of methods that I view as the core of what is special about spatial data science,
as distinct from applying data science to spatial data. They are not intended to be a
comprehensive overview but constitute my personal selection of materials that I see as
central to promoting spatial thinking through teaching spatial data science.
The level in the current volume is introductory, aimed at my typical audience, which is
largely composed of researchers and students (both undergraduate and graduate) who have
not been exposed to any geographic or spatial concepts or have only limited familiarity with
the subject. So, by design, some of the treatment is rudimentary, covering basic concepts in
GIS and spatial data manipulation, as well as elementary statistical graphs. I have included
this material to keep the books accessible to a larger audience. Readers already familiar
with these topics can easily skip to the core techniques.
I believe the two volumes offer a unique perspective, in that they approach the identification
of spatial patterns from a number of different standpoints. The first volume includes an
in-depth treatment of local indicators of spatial association, whereas Volume 2 focuses on
spatial clustering techniques. The main objective is to indicate where a spatial perspective
contributes to the broader field of data science and what is unique about it. In addition,
the aim is to create an intuition for the type of method that should be applied in different
empirical situations. In that sense, the volumes serve both as the complete user guide to
the GeoDa software and as a primer on spatial data science. However, in contrast with the
original Workbook, spatial regression methods are not included. Those are covered in Anselin
and Rey (2014) and not discussed here.
Most methods contained in the two volumes are treated in more technical detail in the
various references provided. With respect to my own work, these include Anselin(1994; 1995;
1996; 1998; 1999; 2005b), Anselin et al. (2002), Anselin et al. (2004), Anselin et al. (2006b),
and, more recently, Anselin (2019a; 2019b; 2020), Anselin and Li (2019; 2020) and Anselin
et al. (2022). However, a few methods are new and have not been reported elsewhere or are
xxiii
xxiv Preface
discussed here in greater depth than previously appeared. In this volume, these include the
co-location map and the local neighbor match test.
The methods are illustrated with a completely new collection of seven sample data sets that
deal with topics ranging from crime, socio-economic determinants of health, and disease
spread, to poverty, food insecurity and bank performance. The data pertain not only to the
U.S. (Chicago) but also include municipalities in Brazil (the State of Ceará) and in Mexico
(the State of Oaxaca), and community banks in Italy. Many of these data sets were used in
previous empirical analyses. They are included as built-in Sample Data in the latest version
of the GeoDa software.
The empirical illustrations are based on Version 1.22 of the software, available in Summer
2023. Later versions may include slight changes as well as additional features, but the
treatment provided here should remain valid. The software is free, cross-platform and
open-source and can be downloaded from https://geodacenter.github.io/download.html.
Acknowledgments
This work would not have existed without the tremendous efforts by the people behind
the development of the GeoDa software over the past twenty-some years. This started in
the early 2000s in the Spatial Analysis Laboratory at the University of Illinois, with major
contributions by Ibnu Syabri, supported by the NSF-funded Center for Spatially Integrated
Social Science (CSISS). Later, the software development was continued at the GeoDa Center
of Arizona State University with Marc McCann as the main software engineer. The last
ten years, Xun Li has served as the lead developer of the software. In addition, he has also
been a close collaborator on several methodological refinements of the LISA approach. Julia
Koschinsky has been on the team for some twenty years, as a constant inspiration and
collaborator, starting at UIUC and most recently at the Center for Spatial Data Science of
the University of Chicago. Xun and Julia have been instrumental in the migration of GeoDa
from a closed-source Windows-based desktop software to an open-source and cross-platform
ecosystem for exploring spatial data. Julia in particular has been at the forefront of refining
the role of ESDA within a scientific reasoning framework, which I have tried to represent in
the book.
In addition, I would like to thank Lara Spieker from Taylor & Francis Group for her expert
guidance in this project.
Over the years, the research behind the methods covered in this book and the accompanying
software development has been funded by grants from the U.S. National Science Foundation,
the National Institutes for Health and the Centers for Disease Control, as well as by
institutional support by the University of Chicago to the Center for Spatial Data Science.
Finally, Emily has been patiently living with my GeoDa obsession for many years. This book
is dedicated to her.
Shelby, MI, Summer 2023
xxv
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
About the Author
Luc Anselin is the Founding Director of the Center for Spatial Data Science at the University
of Chicago, where he is also the Stein-Freiler Distinguished Service Professor of Sociology
and the College. He previously held faculty appointments at Arizona State University, the
University of Illinois at Urbana-Champaign, the University of Texas at Dallas, the Regional
Research Institute at West Virginia University, the University of California, Santa Barbara,
and The Ohio State University. He also was a visiting professor at Brown University and
MIT. He holds a PhD in Regional Science from Cornell University.
Over the past four decades, he has developed new methods for exploratory spatial data
analysis and spatial econometrics, including the widely used local indicators of spatial
autocorrelation. His 1988 Spatial Econometrics text has been cited some 17,000 times. He
has implemented these methods into software, including the original SpaceStat software, as
well as GeoDa, and as part of the Python PySAL library for spatial analysis.
His work has been recognized by several awards, including election to the U.S. National
Academy of Sciences and the American Academy of Arts and Sciences.
xxvii
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
1
Introduction
Spatial data are special in that the location of the observations, the where, plays a critical
role in the methodology required for their analysis. Two aspects in particular distinguish
spatial data from the standard independent and identically distributed paradigm (i.i.d.)
in statistics and data analysis, i.e., spatial dependence and spatial heterogeneity (Anselin,
1988; 1990). Spatial dependence refers to the similarity of values observed at neighboring
locations, or “everything is related to everything else, but closer places more so,” known
as Tobler’s first law of geography (Tobler, 1970). Spatial heterogeneity is a particular form
of structural change associated with spatial subregions of the data, i.e., showing a clear
break in the spatial distribution of a phenomenon. Both spatial dependence and spatial
heterogeneity require a specialized methodology for data analysis, generically referred to as
spatial analysis.
Spatial data science is an emerging paradigm that extends spatial analysis situated at the
interface between spatial statistics and geocomputation. What the term actually encompasses
is not settled, and the collection of methods and software tools it represents is also sometimes
referred to as geographic data science or geospatial data science (Anselin, 2020; Comber and
Brunsdon, 2021; Singleton and Arribas-Bel, 2021; Rey et al., 2023). The concept is closely
related to, overlaps somewhat with and has many methods and approaches in common
with fields such as geocomputation (Brunsdon and Comber, 2015; Lovelace et al., 2019),
cyberGIScience (Wang, 2010; Wang et al., 2013), and, more recently, GeoAI (Janowicz et al.,
2020; Gao, 2021).
This two-volume collection is intended as an introduction to the field of spatial data science,
emphasizing data exploration and visualization and focusing on the importance of a spatial
perspective. It represents an attempt to promote spatial thinking in the practice of data
science. It is admittedly a selection of methods that reflects my own biases, but it has proven
to be an effective collection over many years of teaching and research. The first volume deals
with the exploration of spatial data, whereas the second volume focuses on spatial clustering
methods.
The methods covered in both volumes work well for so-called small to medium data settings,
but not all of them scale well to big data settings. However, some important principles do
scale well, like local indicators of spatial association. Even though data sets of very large size
have become commonplace and arguably have been the drivers behind a lot of methodological
development in modern data science, this is not always relevant for spatial data analysis. The
point of departure is often big data (e.g., geo-located social media messages), but eventually,
the analysis is carried out at a more spatially aggregate level, where the techniques covered
here remain totally relevant.
The methodological approach outlined in this first volume supports an abductive process
of exploration, a dynamic interaction between the analyst and the data with the goal of
obtaining new insights. The focus is on insights that pertain to spatial patterns in the data,
such as the location of interesting observations (hot spots and cold spots), the presence of
DOI: 10.1201/9781003274919-1 1
2 Introduction
structural breaks in the spatial distribution of the data, and the comparison of such patterns
between different variables and over time.
The identification of the patterns is intended to provide cues about the types of processes
that may have generated them. It is important to appreciate that exploration is not the
same as explanation. In my opinion, exploration nevertheless constitutes an important and
necessary step to obtain effective and falsifiable hypotheses to be used in the next stages
of the analysis. However, in practice, the line between pure exploration and confirmation
(hypothesis testing) is not always that clear, and the process of scientific discovery may
move back and forth between the two. I return to this question in more detail in the closing
chapter.
The two volumes are both an introduction to the methodology of spatial data science and the
definitive guide to the GeoDa software. This software represents the implementation of my
vision for a gradual progression in the exploration of spatial data, from simple description
and mapping to more structured identification of patterns and clusters, culminating with
the estimation of spatial regression models. It came at the end of a series of software
developments that started in the late 1980s (for a historical overview, see Anselin, 2012).
GeoDa is designed to be user-friendly and intuitive, working through a graphical user interface,
and therefore it does not require any programming expertise. Similarly, the emphasis in the
two volumes is on spatial concepts and how they can be implemented through the software,
but it does not deal with geocomputation as such.
A distinctive characteristic of GeoDa is the efficient implementation of dynamically linked
graphs, in the sense that one or more selected observations in a “view” of the data (a graph
or map) are immediately also selected in all the other views, allowing interactive linking and
brushing of the data (Anselin et al., 2006b). Since its initial release in 2003 (through the
NSF-funded Center for Spatially Integrated Science), the software has been adopted widely
for both teaching and research, with close to 600,000 unique downloads at the time of this
writing.
In the remainder of this introduction, I first provide a broad overview of the organization of
this first volume. This is followed by a quick tour of the GeoDa software and a listing of the
sample data sets used to illustrate the methods.
In addition to the material covered in the two volumes, the GeoDaCenter Github site
(https://geodacenter.github.io) contains an extensive support infrastructure. This includes
detailed documentation and illustrations, as well as a large collection of sample data sets,
cookbook examples, and links to a YouTube channel containing lectures and tutorials.
Specific software support is provided by means of a list of frequently asked questions and
answers to common technical questions, as well as by the community through the Google
Groups Openspace list.
The organization of the toolbar (and menu) follows the same logic as the layout of the parts
and chapters in the two books. It represents a progression in the exploration, from left to
right, from support functions to queries, description and visualization, and more and more
formal methods, ending up with the estimation of actual spatial models in the regression
module (not covered here).
A brief overview of each of the major parts is given next. This also includes the spatial
clustering functionality, which is discussed more specifically in Volume 2.
The three left-most icons, highlighted in Figure 1.2, deal with data entry and general input-
output. This includes the loading of spatial and non-spatial (e.g., tabular) data layers from
a range of GIS and other file formats (supported through the open-source GDAL library).
In addition, it offers connections to spatial databases, such as PostGIS and Oracle Spatial.
It also supports a Save As function, which allows the software to work as a GIS file format
converter. Further details are provided in Chapter 2.
A Quick Tour of GeoDa 5
Functionality for data manipulation and transformation is provided by the Table icon,
highlighted in Figure 1.3. This allows new variables to be created, observations selected,
queries formulated and includes other data table operations, such as merger and aggregation,
detailed in Chapter 2.
Spatial data operations are invoked through the Tools icon, highlighted in Figure 1.4. These
include many GIS-like operations that were added over the years to provide access to spatial
data for users who are not familiar with GIS. For example, point layers can be easily created
from tabular data with X,Y coordinates, point in polygon operations support a spatial join,
an indicator variable can be used to implement a dissolve application, and reprojection can
be readily implemented by means of a Save As operation. Specific illustrations are included
in Chapter 3.
The Weights Manager icon, Figure 1.5, contains a final set of functions that are in
support of the analytical capabilities. It gives access to a wide range of weight creation
and manipulation operations, discussed at length in the chapters of Part III. This includes
constructing spatial weights from spatial layers, as well as loading them from external files,
summarizing and visualizing their properties, and operations like union and intersection.
cartogram is a specialized type of map that replaces the actual outline of spatial units by a
circle, whose area is proportional to a given variable of interest. Animation, in the sense of
moving through the locations of observations in increasing or decreasing order of the value
for a given variable is implemented by means of the map movie icon. Finally, the category
editor provides a way to design custom classifications for use in maps as well as in statistical
graphs, such as a histogram. Details are provided in Chapters 4 through 6.
The next eight icons, grouped in Figure 1.7, contain the functionality for exploratory data
analysis and statistical graphs. This includes a Histogram, Box Plot, Scatter Plot,
Scatter Plot Matrix, Bubble Chart, 3D Scatter Plot, Parallel Coordinate Plot
and Conditional Plots. These provide an array of methods for univariate, bivariate and
multivariate exploration. All the graphs are connected to any other open window (graph or
map) for instantaneous linking and brushing. This is covered in more detail in Chapters 7
and 8.
The exploration of space-time data, treated in Chapter 9, is invoked by means of the icons
on the right, highlighted in Figure 1.8. This includes a Time Editor, which is required
to transform the cross-sectional observations into a proper (time) sequence. In addition,
the Averages Chart implements a simple form of treatment analysis, with treatment and
controls defined over time and/or across space.
Spatial autocorrelation analysis is invoked through the three icons highlighted in Figure 1.9.
The first two pertain to global spatial autocorrelation. The left-most icon corresponds to
various implementations of the Moran scatter plot (Chapters 13 and 14). The middle icon
invokes nonparametric approaches to visualize global spatial autocorrelation as a spatial
correlogram and distance scatter plot (Chapter 15).
The third icon contains a long list of various implementations of local spatial autocorrelation
statistics, including various forms of the Local Moran’s I, the Local Geary c, the Getis-Ord
statistics and extensions to multivariate settings and discrete variables. The local neighbor
Sample Data Sets 7
match test is a new method based on an explicit assessment of the overlap between locational
and attribute similarity. Details are provided in the chapters of Part V.
Finally, cluster analysis is invoked through the icon highlighted in Figure 1.10. An extensive
drop-down list also includes the density-based cluster methods DBScan and HDBScan, which
are treated in this volume under local spatial autocorrelation (Chapter 20).
The other methods are covered in Volume 2. They include dimension reduction, classic
clustering methods and spatially constrained clustering methods. The last items in the drop-
down list associated with the cluster icon pertain to the quantitative and visual assessment
of cluster validity, including a new cluster match map (see Volume 2).
In this and the following chapter, I introduce the topic of data wrangling, i.e., the process
of getting data from its raw input into a form that is amenable for analysis. This is often
considered to be the most time consuming part of a data science project, taking as much
as 80% of the effort (Dasu and Johnson, 2003). Even though the focus in this book is on
analysis and not on data manipulation per se, I provide a quick overview of the functionality
contained in GeoDa to assist with these operations. Increasingly, data wrangling has evolved
into a field of its own, with a growing number of operations turning into automatic procedures
embedded into software (Rattenbury et al., 2017). A detailed discussion of this topic is
beyond the scope of the book.
The coverage in this chapter is aimed at novices who are not very familiar with spatial
data manipulations. Most of the features illustrated can be readily accomplished by means
of dedicated GIS software or by exploiting the spatial data functionality available in the
R and Python worlds. Readers knowledgeable in such operations may want to just skim
the materials in order to become familiar with the way they are implemented in GeoDa.
Alternatively, these operations can be performed outside GeoDa, with the end result loaded
as a spatial data layer.
In the current chapter, I focus on essential input operations and data manipulations contained
in the Table functionality. In the next chapter, I consider a range of basic GIS operations
pertaining to spatial data wrangling.
To illustrate these features, I will use a data set with point locations of car jackings in
Chicago in 2020. The Chicago Carjackings data layer is available from the Sample Data
tab in the GeoDa file dialog (Figure 2.2).
In addition, in order to replicate the detailed steps used in the illustrations,
three original input files are needed as well. These are available from the GeoDa-
Center sample data site. They include a simple outline of the community areas,
Chicago_community_areas.shp, as well as comma delimited (csv) text files with the socio-
economic characteristics (Chicago_CCA_Profiles.csv), and the coordinates of the car jackings
(Chicago_2020_carjackings.csv). The sample data site also contains the detailed listing of
the variable names.
DOI: 10.1201/9781003274919-2 11
12 Basic Data Operations
line segments, each characterized by the coordinates of their starting and ending points. In
other words, what may seem like a continuous boundary, is turned into discrete segments.
Traditional data tables have no problem including X and Y coordinates as columns, but
as such cannot deal with the boundary definition of irregular spatial units. Since the
number of line segments defining an areal boundary can easily vary from observation to
observation, there is no efficient way to include this in a fixed number of columns of a
flat table. Consequently, a specialized data structure is required, typically contained in a
geographic information system or GIS.
Several specialized formats have been developed to efficiently combine both the attribute
information and the locational information. Such spatial data can be contained in files with
a special structure, or in spatially enabled relational data base systems.
I first consider common GIS file formats that can serve as input to GeoDa. This is followed
by an illustration of simple tabular input of non-spatial files. Finally, a brief overview is
given of connections to other input formats.
A detailed discussion of the individual formats is beyond the current scope. All are well-
documented, with many additional resources available online. Although it is always helpful,
there is no need to know the underlying formats in detail in order to use GeoDa, since the
interaction with the data structures is handled under the hood.
The main file manipulations are invoked from the File item in the menu, or by the three
left-most icons on the toolbar in Figure 2.1.
Using the navigation dialog and conventions appropriate for each operating system, the
shape file can be selected from this directory. This opens a new map window with the
spatial layer represented as a themeless choropleth map, as in Figure 2.4. The number of
observations is shown in parentheses next to the small green rectangle in the upper-left
panel, as well as in the status bar at the bottom (#obs = 77).
The current layer is cleared by clicking on the Close toolbar icon, the second item on the
left in Figure 2.1, or by selecting File > Close from the menu. This removes the base map.
At this point, the Close icon on the toolbar becomes inactive.
A more efficient way to open files is to select the file name in the directory window and
to drag it onto the Drop files here box in the dialog. Even easier is to load a one of the
sample data sets or a recently used one, where a simple click on the associated icon in the
Sample Data or Recent tab suffices.
In contrast to the shape file format, which is binary, a GeoJSON file is simple text and can
easily be read by humans. As shown for the Chicago_community_areas.geojson file from
the sample data site in Figure 2.5 (this file must be downloaded to a working directory),
the locational information is combined with the attributes. After some header information
follows a list of features. Each of these contains properties, of which the first set consists
16 Basic Data Operations
of the different variable names with the associated values, just as would be the case in
any standard data table. The final item refers to the geometry. This includes the type,
here a MultiPolygon, followed by a list of X-Y coordinates. In this fashion, the spatial
information is integrated with the attribute information.
To view the corresponding map, the Chicago_community_areas.geojson file name can be
selected in its directory and dragged onto the Drop files here box. This brings up the
same base map as in Figure 2.4.
The point map in Figure 2.7 shows the locations of the 1,412 carjackings that oc-
curred in the City of Chicago during the year 2020. It is generated by clicking on the
Chicago Car Jackings icon in the Sample Data tab, or by dragging the file name
Chicago_carjack_2020_pts.shp from a working directory that contains the shape file.
The shape of the city portrayed by the outline of the points is slightly different from that in
the polygon map in Figure 2.4. This is due to a difference in projections: the point map is
in the State Plane Illinois East NAD 1983 projection (EPSG 3435), whereas the polygon
map uses decimal degrees latitude and longitude (EPSG 4326). This important aspect of
spatial data is often a source of confusion for non-GIS specialists. GeoDa provides an intuitive
interface to deal with projection issues. I return to this topic in Section 2.3.1.1 below and in
Section 3.2 in the next chapter.
addition to comma-separated files, GeoDa also supports tabular input from dBase database
files, Microsoft Excel and Open Document Spreadsheet (*.ods) formatted files. As is the case
for the various GIS formats, the File > Save As command allows for the ready conversion
from one tabular format to another (e.g., from csv to dBase).
An additional feature of the input dialog in Figure 2.8 is inclusion of optional Longitude/X
and Latitude/Y drop-down lists. With coordinate variables specified, this will create a
point layer. As such, it provides an alternative to the approach outlined in Section 2.3.1.
The Tools > Shape > Points from Table command opens up a dialog that lists all
numerical variables contained in the data table. In the example, shown in Figure 2.10, the
variables to be selected are Longitude and Latitude.
After specifying the coordinate variables and clicking on OK, a point map appears, as in
Figure 2.11.
The outline of Chicago suggested by the points is different from the one in Figure 2.7. Again,
this is due to the difference in projections, which is discussed next.
At this stage, the point layer can be saved as a GIS file, such as a shape file, using the
familiar File > Save As command (the file name should differ from any existing carjackings
point shape file, or the latter will be overwritten).
As a default, the Coordinate Reference System (CRS) entry in the file save dialog will
be empty, as in Figure 2.12. In other words, since nothing was specified other than the
coordinates, there is no projection information. As a result, if the layer was saved as a shape
file, it would not include a file with a prj file extension.
The lack of projection information severely hampers a range of data manipulations. Specifi-
cally, for computations such as Euclidean (straight line) distance, the decimal degrees are
inappropriate and either special calculations must be used (such as great circle distance), or
the coordinates need to be reprojected.
Even though latitude-longitude decimal degrees are not an actual projection, there is a
corresponding CRS. While in and of itself this information is not that meaningful, including
it when saving the file makes future reprojections easy. The CRS in proj4 format for decimal
degrees is (this is discussed in more detail in the next chapter):
+proj=longlat +datum=WGS84 + no_defs.
Including this entry in the CRS box (see Figure 2.13) when saving the file will ensure that
future reprojections will be possible.
In other words, in the absence of specific projection information, the best practice is to
create a spatial layer using decimal latitude-longitude degrees and recording the CRS from
Figure 2.13 when saving the information as a GIS format file.
This is pursued in greater technical detail in the next chapter.
22 Basic Data Operations
2.3.2 Grid
The second type of spatial layer that can be created is a set of grid cells. Such cells are
rarely of interest in and of themselves, but are useful to aggregate counts of point locations
in order to calculate point pattern statistics, such as quadrat counts.
The grid creation is one of the few instances where GeoDa functionality can be invoked from
the menu or toolbar without any table or layer open. From the menu, Tools > Shape
> Create Grid brings up a dialog that contains a range of options, shown in Figure
2.14. The most important aspect to enter is the Number of Rows and the Number of
Columns, listed under Grid Size at the bottom of the dialog. This determines the number
of observations. In the example, the entries are 10 rows and 5 columns, for 50 observations.
There are four options to determine the Grid Bounding Box, i.e., the rectangular extent
of the grids. The easiest approach is to use the bounding box of a currently open layer or of
an available shape data source. The Lower-left and Upper-right corners of the bounding
box can also be set manually, or read from an ASCII file.
In the example in Figure 2.14, the Chicago community area boundary layer, contained in
the Chicago_community_areas.shp file is shown as the source for the bounding box.
After the grid layer is saved as a file, it can be loaded into GeoDa. In Figure 2.15, the result
is shown, superimposed onto the community areas layer (how to implement this will be
covered in the next chapter). Clearly, the extent of the grid cells matches the bounding box
for the area layer.
operations in GeoDa, opening a data table can only be accomplished by selecting a toolbar
icon, i.e., the second icon from right on the toolbar shown in Figure 2.1.
With the table open, the options can be accessed from the menu, by selecting the Table
menu item, or by right clicking on the table itself. This brings up the list shown in Figure
2.16.
In addition to a range of variable transformations, the options menu also includes the
Selection Tool, which is the way in which data queries can be carried out. I discuss this in
more detail in Section 2.5. The more commonly used variable manipulations are covered in
what follows.
overwrite the current files. To avoid this behavior, select File > Save As and specify a
new file name.
To facilitate data table operations like Merge, the community area identifier must be
changed to integer.
including a leading zero), %Y for the year in full four digits, %I for the hour in 12 hour segments (this
requires AM or PM in addition), %M for minutes in two digits, %S for seconds in two digits and %p for AM
or PM.
Table Manipulations 25
Once a customized format is included into the setup interface, it will be recognized in all
future instances of datetime format conversion.
indicate the sorting order. The original order can be restored by sorting on the row numbers
(the left-most column).
Individual cell values can be edited by selecting the cell and entering a new value. As is
always the case in GeoDa, such changes are only permanent after the table is saved.
Specific observations can be selected manually by clicking on their row number (selection
through queries is covered in Section 2.5). The corresponding entries will be highlighted in
yellow. Additional observations can be added to the selection in the usual way, for example,
by means of shift-click (or another key combination, depending on the operating system).
The selected observations can be located at the top of the table for easy reference by means
of Move Selected to Top.
Finally, the arrangement of the columns in the table can be altered by dragging the variable
name and moving it to a different position. This is often handy to locate related variables
next to each other in the table for easy comparison when the original layout is not convenient.
2.4.2 Calculator
The Calculator functionality provides a rudimentary interface to create new variables,
transform existing variables and carry out a number of algebraic operations. It is limited to
one operation at a time, so it is not suitable for bulk programming.
The Calculator interface contains six tabs, dealing, respectively, with Special functions,
Univariate and Bivariate operations, Spatial Lag, Rates and Date/Time operations.
Spatial Lag and Rates are advanced functions that are discussed separately in Chapters 6
and 12.
The calculator interface is shown in Figure 2.21, illustrated for the Bivariate operations.
Each tab at the top includes a series of functions, either creating a new variable (Special),
operating on a single variable (Univariate), two variables (Bivariate), or values contained
in a special Date/Time format. The specific functions are selected from the Operator
drop-down list.
Each operation begins by identifying a target variable in which the result of the operation
will be stored. Typically, this will be a new variable, which is created by means of the Add
Variable button located to the right of Result.7 This brings up an interface in which the
Name for the new variable is specified, as well as its Type, its precision, and where to
7A new variable can also be created directly from the table options menu as Table > Add Variable.
Table Manipulations 27
insert it into the table. The default is to place a new variable to the left of the first column
in the table, but any other variable can be taken as a reference point (i.e., the new column is
placed to the left of the specified variable). In addition, the new variable can also be placed
at the right most end of the table (the last option in the list).
With the target variable specified, the actual calculations can be carried out by selecting
the operation from the drop-down list and choosing the variables involved. Again, to make
the new variable permanent, the data table needs to be saved.
The various operations are briefly reviewed next.
with x̄ as the mean of the original variable x, and σ(x) as its standard deviation.
28 Basic Data Operations
i.e., the average of the absolute deviations between an observation and the mean for
that variable. The estimate for mad takes the place of σ(x) in the denominator of the
standardization expression.
Two additional transformations that are based on the range of the observations, i.e., the
difference between the maximum and minimum. These the RANGE ADJUST and the
RANGE STANDARDIZE options.
RANGE ADJUST divides each value by the range of observations:
xi
ra = .
xmax − xmin
While RANGE ADJUST simply re-scales observations in function of their range, RANGE
STANDARDIZE turns them into a value between zero (for the minimum) and one (for
the maximum):
xi − xmin
rs = .
xmax − xmin
Note that any standardization is limited to one variable at a time, which admittedly is not
very efficient. However, most analyses where variable standardization is recommended, such
as in the multivariate clustering techniques covered in Volume 2, include a transformation
option that can be applied to all the variables in the analysis simultaneously. The same five
options as discussed here are available to each analysis.
One exception to this is when you do not want to merge in the duplicate fields. Checking
the box Use existing field name makes sure that only the original columns are kept.
A second potential conflict is when the variable name to be included contains more than 10
characters. In csv input files, there are no constraints on variable name length, but the dbf
format used for the attribute values in a shape file format has a limit of 10 characters for a
variable name. Again, an alternative will be suggested in the dialog, but it can be easily
edited (as long as it stays within the 10 character limit).8
The merged tables are shown in Figure 2.23. By keeping the identifiers for both data sets in
the merged table (respectively area_num_1 – in the first column – and GEOID – in the
8th column), one can easily verify that the observations are lined up correctly.
As before, the merger only becomes permanent after a File > Save operation.
Note that the merge operation in GeoDa implements a left join, in the sense that only
observations that match the ID in the current table are included in the merged table.
Specifically, if the table to be merged does not contain certain observations, the corresponding
entries in the merged table will be missing values. Also, if the table to be merged contains
observations that are not part of the current table, they will be ignored. Other forms of
joins are not currently implemented.
2.5 Queries
While queries of the data are somewhat distinct from data wrangling per se, drilling down
the data is often an important part of selecting the right subset of variables and observations.
In order to select particular observations or rows in the data table, the Selection Tool is
used. This is arguably one of the most important features of the table options.
The Chicago Carjackings data set is used to illustrate the selection functionality. First, a
few adjustments are needed: the Date column must be reformatted to datetime, and new
variables for the Month and the Day must be included (using Table > Calculator as in
Section 2.4.2.5).
8 The
same problem can occur when using File > Save As to convert a csv file to the dBase format.
Here too, a dialog will suggest alternative variable names, which can be edited.
Queries 31
A useful feature is to use Invert Selection to choose all observations except the selected
ones. For example, this could be used to choose all months but November.
Often, this approach is the most practical way to remove unwanted observations, since there
is no Save Unselected function. First, the observations to be removed are selected, followed
by inverting the selection. At this point, File > Save Selected As can be employed to
create a new data set (see Section 2.5.3).
Queries 33
The same approach can also be used in combination with Select All Undefined, which
identifies the observations with missing values for a given variable. The inverted selection
can then be saved as a data set without missing values.
The Add Neighbors To Selection option will be discussed in the chapters dealing with
spatial weights, in Part III.
variable for the unique values. A detailed discussion of mapping functionality is deferred until Chapter 4.
Queries 35
DOI: 10.1201/9781003274919-3 37
38 GIS Operations
3.2 Projections
Spatial observations need to be georeferenced, i.e., associated with specific geometric objects
for which the location is represented in a two-dimensional Cartesian coordinate system
(i.e., on a flat map). Since all observations originate on the surface of the three-dimensional
near-spherical earth, this requires a transformation from three to two dimensions.
The transformation involves two concepts that are often confused by non-geographers,
i.e., the geodetic datum and the projection. The topic is complex and forms the basis for
the discipline of geodesy. A detailed treatment is beyond the current scope, but a good
understanding of the fundamental concepts is important. The classic reference is Snyder
(1993), and a recent overview of a range of technical issues is offered in Kessler and Battersby
(2019).
The basic building blocks are degrees (and minutes and seconds) latitude and longitude
that situate each location with respect to the equator and the Greenwich Meridian (near
London, England). Longitude is the horizontal dimension (x) and is measured in degrees
East (positive) and West (negative) of Greenwich, ranging from 0 to 180 degrees. Since
the U.S. is west of Greenwich, the longitude for U.S. locations is negative. Latitude is the
vertical dimension (y) and is measured in degrees North (positive) and South (negative) of
the equator, ranging from 0 to 90 degrees. Since the U.S. is in the northern hemisphere,
its latitude values will be positive. Latitude and longitude are typically given as decimal
degrees, but, if not, a conversion from degrees, minutes and seconds is straightforward.
In order to turn a geographic description, such as an address, into latitude-longitude degrees,
it is necessary to adhere to a so-called geodetic datum, a three-dimensional coordinate system
or model that represents the shape of the earth. Currently, the most commonly used datum is
the World Geodetic System of 1984, WGS 84, which represents the earth as an ellipsoid (and
not as a perfect sphere). In North America, an alternative is NAD 83, the North American
Datum of 1983. In practice, for U.S. locations, there is not much difference between the two.
Both these standards are about to be replaced by reference frames that take advantage of
Global Navigation Satellite Systems (GNSS).1
The second step in the process of georeferencing consists of converting the latitude-longitude
coordinates to Cartesian x-y coordinates in a planar system, using a cartographic projection.
Hundreds of projections have been developed, each addressing different aspects of the
mathematical problem of converting a three-dimensional object (on a sphere or ellipsoid)
to two dimensions (a flat map). In this conversion, every projection involves distortion
of one or more fundamental properties of geographic objects: angles (sometimes confused
with shape), area, distance and direction. It is important to be aware of this limitation,
since many investigations rely on the computation of variables such as distance, or density
1 For details, see https://geodesy.noaa.gov/datums/newdatums/index.shtml.
Projections 39
(which involves area). The use of an inappropriate projection or distance metric may yield
misleading results, especially for analyses that cover large areas (e.g., continent-wide).
For our purposes, three aspects are most important. One is to recognize whether spatial
coordinates are projected (typically in units of feet or meters), or unprojected (i.e., in decimal
degrees). To confuse matters more, the latter are sometimes referred to as a geographic
projection, even though there is no projection involved. It is important to keep in mind that
latitude and longitude are not expressed in distance units, but are degrees.
For example, a graph showing locations with longitude as the x-axis and latitude as the
y-axis, treated as if they were regular distance units, can be misleading, even though it
is seen quite commonly in publications. It ignores the fundamental property that latitude
and longitude are degrees (angles). Similarly, the calculation of Euclidean distance is only
supported for projected coordinates and should not be performed on longitude-latitude pairs.
In many instances, GeoDa will generate a warning when an attempt to compute distances
with decimal degrees is made, but this cannot be detected in all situations.
A second aspect is to be aware of the characteristics of a particular projection that is being
used. Specifically, it is important to know whether the projection respects properties such as
area (equal area) or distance (equidistant), although such properties typically only pertain
to a subset of the projected map.2
A final important aspect is to make sure that layers are in the same projection when combined.
GeoDa has some functionality to reproject layers on the fly to accomplish this, but it is not
fail-safe.
literally hundreds of projection definitions that can be easily searched.4 For each projection,
there is a summary of its properties and a list of its CRS specification in a range of different
formats. For example, this includes the format used by ESRI in the *.prj files (part of the
shape file specification), as well as the proj4 specification used by GeoDa.
While it is perfectly fine to use locations expressed as latitude-longitude decimal degrees, it is
imperative to use the proper mathematical operations when computing properties like area
and distance. For the latter, it is necessary to use great circle distance (arc distance), which
is expressed in terms of the angles represented by latitude and longitude (see Chapter 11).
In addition, the proper conversion of the great circle distance from angular units to distance
unit (e.g., miles or kilometers) needs to differentiate by degree latitude. The implementation
in GeoDa is only approximate and uses the distances at the equator.
In practice, it is preferred to use projected coordinates so that Euclidean distance operations
and area calculations are straightforward to perform. For North America, a useful projection
is universal transverse Mercator, or UTM. The country is divided into parallel zones, as
shown in Figure 3.1. With each zone corresponds a specific projection that is represented by
a CRS (e.g., an ESPG code).5
Let’s say we need a projection for locations in Chicago (indicated by the arrow on the figure).
From the map, we can see that this city is located in UTM zone 16 (north – there is a
southern hemisphere equivalent). Searching the spatialreference.org site for this projection
yields a list of specifications (i.e., combinations of different datums). For example, for a
WGS84 datum we can find an associated EPSG code of 32616. In the proj4 format, the
corresponding CRS is:
+proj=utm +zone=16 +ellps=WGS84 +datum=WGS84 +units=m +no_defs
site at https://gisgeography.com/utm-universal-transverse-mercator-projection/.
Converting Between Points and Polygons 41
a new layer with the proper projection. An alternative approach that avoids the need to
type in the actual proj4 specification is considered next.
3.2.3 Reprojection
To avoid the manual entry of the correct CRS code, the Save As process in GeoDa provides
an alternative that does not need an explicit specification, but requires the presence of
another layer with the desired projection. The CRS information is then copied from that
other layer and used in the reprojection of a current layer.
For example, consider the 77 Chicago community areas expressed in latitude-longitude
coordinates in Figure 2.4. We saw how the locations of the carjackings in Figure 2.7
suggested a somewhat different shape for the city, because it was expressed in the State
Plane Illinois East NAD 1983 projection (EPSG 3435). To convert the polygon community
areas to the same projection, we could enter the proper proj4 specification in the CRS box.
An alternative is to copy the CRS information from a different layer. For example, after
opening the community area file (Chicago_community_areas.shp) and invoking File >
Save As, the CRS box shows the proj4 specification for latitude-longitude degrees, as in
Figure 3.2. Instead of typing in the new specification, the small globe icon to the right can
be selected to load a CRS specification from another file.
This brings up a file load interface into which the file name for the point shape file with the
carjacking locations can be specified (Chicago_carjack_2020_pts.shp). The easiest way to
accomplish this is by dragging the file name into the Drop file here area. Once the file
name is loaded, the contents of the CRS box change to the new proj4 specification, as in
Figure 3.3.
After saving the new file (e.g., as community_areas_proj.shp), the current project should be
closed. A new project is started by loading the just created projected layer. The corresponding
themeless base map is as in Figure 3.4. The more compressed shape matches the layout of
the car jacking locations in Figure 2.7.
constructed to represent the community areas as tessellations around those central points,
such as Thiessen polygons. The key factor is that all three representations are connected to
the same cross-sectional data set. As discussed in more detail in later chapters, for some
types of analyses it is advantageous to treat the areal units as points, whereas in other
situations Thiessen polygons form a useful alternative to dealing with an actual point layer.
The center point and Thiessen polygon functionality is invoked through the options menu
associated with the map view (right click on the map to bring up the menu). This is
illustrated with the just created projected community area map (Figure 3.4).
eos_1_1algorithm_1_1Centroid.html.
Converting Between Points and Polygons 43
(different polygons associated with the same ID, such as a mainland area and an island
belonging to the same county). In those instances, the centers can end up being located
outside the polygon. Nevertheless, the shape centers are a handy way to convert a polygon
layer to a corresponding point layer with the same underlying geography. For example, as
covered in Chapter 11, they are used under the hood to calculate distance-based spatial
weights for polygons.
A second option is to Add the coordinates of the central points to Table. This brings up
a dialog that prompts for the variable names for the coordinates, with COORD_X and
COORD_Y as defaults. For example, the mean center coordinates could be set to CO-
ORD_XM and COORD_YM and the centroids to COORD_XC and COORD_YC.
In Figure 3.6, the associated coordinate values are shown in the table (after some reshuffling
of columns). Close examination of the values reveals only minor differences between the two
types of mean centers in this particular case, in part because the community areas have
mostly fairly convex shapes (the coordinates are in meters).
Finally, the Save option results in the usual prompt for a file name (e.g.,
comm_area_centers.shp) and creates the new layer. In the example, the point layer associated
with the community area mean centers is as in Figure 3.7.
3.3.2 Tessellations
A tessellation is the exhaustive covering of an area with polygons or tiles. Of particular
interest in spatial analysis are so-called Thiessen polygons or Voronoi diagrams (for an
extensive discussion, see Okabe et al., 2000).
Thiessen polygons are a regular tiling of polygons centered around points, such that they
define an area that is closer to the central point than to any other point. In other words, the
polygon contains all possible locations that are nearest neighbors to the central point, rather
than to any other point in the data set.7 The polygons are constructed by considering a
line perpendicular at the midpoint to a line connecting the central point to its neighbors.
The latter is referred to as Delaunay triangulation and is in fact the dual problem to the
tessellation.
In sum, any point layer can be converted to a polygon layer with the same geography by
constructing Thiessen polygons around the points.
7 In economic geography, this corresponds to the notion of a market area, assuming a uniform distribution
of demand.
Minimum Spanning Tree 45
is a problem in the calculation of the polygons due to multiple observations with the same coordinates. This
may happen when considering different housing units in a high rise, or when the precision of the coordinates
is insufficient to distinguish between all points. The option adds an additional column to the table, labeled
DUP_IDS, which contains a value of 1 for those points that have multiple observations with the same
coordinates.
46 GIS Operations
connect the nodes. In the treatment of spatial weights, the graph structure will be called
the connectivity graph.
For example, in Figure 3.9, the connectivity graph is shown for the community area mean
centers from Figure 3.7. This graph represents a distance-band weights matrix with a cut-off
distance that corresponds to the largest nearest neighbor distance between points, the
so-called max-min distance criterion. This ensures that each observation (node) has at least
one neighbor.
A Minimum Spanning Tree (MST) is a subset of this graph that constitutes a tree, i.e., “a
connected, undirected network that contains no loops” (Newman, 2018). Importantly, this
means that all nodes are connected (no isolates or islands) and there is exactly one path
between any pair of nodes (no loops). As a result, a tree associated with n nodes contains
exactly n − 1 edges. The MST is a special tree in that the total sum of the edge lengths
(typically a weight) is minimized. It constitutes a simplification of the full network structure
that fulfills a minimum cost objective.
For example, the MST computed from the network structure in Figure 3.9 is shown in Figure
3.10. The complexity of the graph is greatly reduced and now consists of only 76 edges (for
77 observations or nodes).
The MST is similar to, but distinct from the solution to the so-called traveling salesperson
problem, which looks for a route through the network that achieves a minimum cost objective,
but visits each node just once. This implies that the nodes in a solution to the traveling
salesperson problem cannot have a degree (i.e., the number of edges connected to it) larger
than two. In Figure 3.10, the maximum degree is four.
A classic solution to the MST problem is achieved by Prim’s algorithm (Prim, 1957),
illustrated in the following worked example. The MST plays a major role in some of the
spatial clustering procedures discussed in Volume 2.
Minimum Spanning Tree 47
Figure 3.10: Mininum spanning tree for connectivity graph of mean centers
At node 4, the two edges have the same length, so 4 could be connected to either 6 or 5.
The order is immaterial, since the length of 15 is the smallest in the network and will have
priority over any other edge. At this point, the path consists of 1–2, 2–4, 4–6 and 4–5. The
next step connects 5 to 7 (18.03), then 7 to 8 (18.03) and 8 to 9 (15). The only remaining
unconnected node is 3, which is linked to 5 (29.15). The resulting MST follows the red lines
in Figure 3.15. Every node is visited and the total distance of the path is minimized. The
solution consists of eight edges, which is n − 1. The MST solution is not unique since there
can be several trees that achieve the same minimum cost solution.
Aggregation 49
3.5 Aggregation
Spatial data sets often contain identifiers of larger encompassing units, such as the state that
a county belongs to, or the census tract that contains individual household data. Such nested
spatial scales readily lend themselves to aggregation to the larger unit. The aggregation
computes observations for the new spatial scale as the sum, average or other operation
applied to the values at the lower scale. Note that this only makes sense if the underlying
variables are spatially extensive, such as counts (total population, total households). Without
proper re-weighting, the aggregation gives misleading results for variables expressed as
medians or percentages (e.g., median household value, median income).
GeoDa supports two different ways to compute aggregates from the smaller units. One
approach also creates a dissolved layer, i.e., a spatial layer that includes both a map for the
larger spatial units and a data table with the corresponding aggregated observations. The
second approach is limited to computing a new table with the aggregate information. It does
not generate a spatial component.
50 GIS Operations
3.5.1 Dissolve
The Dissolve functionality is part of the Tools menu (Tools > Dissolve), or can be
accessed by selecting the Dissolve option from the Tools icon on the toolbar, the right-most
icon in Figure 2.1.
To illustrate this feature, the point of departure is the merged data set created in Section
2.4.3 (e.g., saved as Chicago_2020.shp). This data set contains socio-economic data for
the 77 Chicago community areas. In addition, it also has an identifier that associates each
community area with a larger encompassing district, i.e., the variable districtno. The
corresponding layout is given by the themeless base map from Figure 2.4
The dialog, shown in 3.16, requests three important pieces of information. The most critical
is the variable for dissolving, i.e., the key that indicates the larger scale to which the
observations will be aggregated (here, districtno). Next follows the selection of the variables
for which the aggregate values will be calculated. In the example in Figure 3.16, these
are 2000_POP, 2010_POP and TOT_POP (the population in 2020 from ACS). The
variables are selected by double clicking on them, or by selecting them and then using the
> key to move them to the right-hand side column.
Finally, we need the proper aggregation Method must be selected. The available options
are Count, Average, Max, Min and Sum. The Count function doesn’t actually compute
any aggregate values, but provides a count of the number of smaller units included in the
larger unit (e.g., the number of community areas in each district). For example, Sum will
yield the population totals for each district.
After activating the Dissolve button, the usual file creation dialog is produced to specify a
file type and file name, e.g., Chicago_district_2020.shp.
Figure 3.17 shows the outline of the nine districts (in red), together with the original
community area boundaries (in blue). This is an illustration of a multiple layer operation,
since both the community area layer and the district layer are combined on the same map.
The multilayer functionality is further detailed in Section 3.6.1. The associated table (Figure
3.18) lists the population totals in the three census years for the district aggregates. The table
also includes a new variable, AGG_COUNT, which indicates the number of community
areas aggregated into the respective district.
Aggregation 51
It is important to keep in mind that the visual properties (such as color and opacity) of the
current layer are managed by the map legend (see Chapter 4). The properties of the other
layers are controlled through the Map Layer Settings.
Full Extent.
54 GIS Operations
This logic may seem a bit counter-intuitive at first, but it is based on the linking and
brushing architecture implemented in GeoDa (see Chapters 4 and 7 for a more extensive
discussion). Currently, the linking operation works only for a single layer, which is always
the first loaded layer.
Figure 3.23: Spatial join dialog – areal unit identifier for point
This functionality is invoked from the menu as Tools > Spatial Join, or by selecting the
Spatial Join option from the Tools icon in Figure 2.1.
spatial assign. In the example, there are three such points, all situated at the very edge of
a polygon. This is illustrated in Figure 3.25 for the two mismatches at the southern and
south-western edge of the city (one other mismatch is in the very north).
Depending on the goals of the analysis, one could either eliminate the points from the data
set, or manually edit the values in the table after zooming in on the actual location.
The new spatial assign variable can now be used to aggregate observations, as in Section
3.5. However, the dissolve operation does not work, since there is no boundary information
associated with the points. Their spatial information consists of the coordinates only, which
do not lend themselves to a dissolve operation.
As always, a File > Save or File > Save As operation must be used to make the addition
of the spatial assign variable permanent.
In this first of three chapters dealing with mapping, I begin to explore the concept of
geovisualization. Before getting into specific methods, I start with a brief discussion of the
larger context of exploratory data analysis (EDA) and especially its spatial counterpart,
exploratory spatial data analysis (ESDA). Central to this and to its implementation in GeoDa
are the concepts of linking and brushing.
The technical discussion begins with an overview of key aspects of thematic map construction,
followed by a review of traditional map classifications, i.e., quantile maps, equal interval
maps and natural breaks maps. More statistically inspired maps are discussed in Chapter 5.
Conditional maps are considered in the treatment of conditional plots in Chapter 8. The
particular problem of mapping rates or proportions is covered in Chapter 6.
The common map types are followed by a discussion of various mapping options in GeoDa
that allow interaction with the map window and the creation of output for use in other media.
The chapter closes with an introduction of the implementation of custom classifications and
the use of the project file.
Even though there is substantial mapping functionality in GeoDa, it is worth noting that it
is not a cartographic software. The main objective is to use mapping as part of an overall
framework that interacts with the data in the process of exploration, through so-called
dynamic graphics (elaborated upon in Section 4.2). By design, maps in GeoDa do not have
some standard cartographic features, such as a directional arrow, or a scale bar, since they
are intended to be part of an interactive framework. However, any map can be saved as an
image file for further manipulation in specialized graphics software.
As in the two previous chapters, the discussion here is aimed at novices, who are less familiar
with basic cartographic principles. Others may just want to skim the material to see how the
functionality is implemented in GeoDa. The treatment focuses on gaining familiarity with
essential concepts, sufficient to be able to carry out the various operations in GeoDa. More
technical details can be found in classic cartography texts, such as Brewer (2016), Kraak
and Ormeling (2020) and Slocum et al. (2023), among others. Also highly recommended for
those not familiar with mapping is Monmonier’s How to lie with maps (Monmonier, 2018),
which provides an easy to read overview of critical aspects of the visualization of spatial
data.
To illustrate the various techniques, a new data set is used. It contains information on Zika
infection and Microcephaly in municipios of the state of Ceará in northeastern Brazil. This
constitutes a subset of the data for the whole of Brazil reported on in Amaral et al. (2019).
In addition to the incidence of the two diseases in 2016, the data set also contains several
socio-economic indicators from the Brazilian Index of Urban Structure (IBEU) for 2013 (see
Amaral et al., 2019, for a detailed definition of each variable). Ceará Zika is included as a
GeoDa sample data set.
DOI: 10.1201/9781003274919-4 63
64 Geovisualization
Figure 4.1: Maps and Rates | Cartogram | Map Movie | Category Editor
exploratory data analysis (EDA) and its spatial extension, exploratory spatial data analysis
(ESDA). To put these various concepts into context, next, I briefly discuss the evolution of
EDA and visual analytics as well as how geovisualization and ESDA fit into this framework.
I close with a brief introduction of the important concepts of linking and brushing.
perspective is taken by Hullman and Gelman (2021), who suggest the use of a graph as a
model check. This topic is the subject of ongoing discussion and debate. I revisit some of
these ideas in the postscript (Chapter 21).
MacEachren and Kraak (1997) and MacEachren et al. (1999). Overviews of various techniques can be found
in Dykes et al. (2005), Kraak and MacEachren (2005), Rhyne et al. (2006), Andrienko et al. (2011) and
Andrienko et al. (2018), among others.
Thematic Maps – Overview 67
A dynamic version of this process is brushing, first proposed for scatter plots in the statistical
literature by Stuetzle (1987) and Becker and Cleveland (1987). It was further extended to
choropleth maps by Monmonier (1989).
The idea behind brushing is that the selection tool (e.g., the selection rectangle on a map)
becomes a moving object. As the rectangle moves over the map, the collection of selected
objects is immediately updated. In this way, one can move the selection brush over the map
(or any statistical graph) and assess the effect of the changing selection.
In GeoDa, the concept of brushing is combined with linking in the sense that the updated
selection is instantaneously transmitted to all the open windows through the linking process.
This provides a very powerful visual tool to assess the effect of the changing selection on
various aspects of the spatial and statistical distributions, in both univariate and multivariate
settings. The linking-brushing combination is critical to support ESDA. The map plays a
central role in this process as an interactive visualization tool, discussed in more detail in
the remainder of the chapter.
4.3.4 Implementation
A thematic map is created in GeoDa from the Map menu item, or by selecting the left-most
toolbar icon in Figure 4.1, Maps and Rates. This yields a list of options, shown in Figure
4.2.
In the current chapter, the focus is on traditional map classifications, i.e., the Quantile
Map, Equal Intervals Map and Natural Breaks Map. The other options are treated
in the next two chapters.
The legend, to the left of the map, shows the type of classification (quantile) and lists the
name of the variable (housing). The five legend categories correspond with increasingly
darker shades of brown. The range of values included in each category is listed in square
parentheses and the number of observations in round parentheses. The map suggests that
lower values for the housing index tend to be concentrated in the north of the state, with
the highest (darkest) values occupying a band in the southern part.
In a quintile map, each category should contain one fifth of the total number of observations,
or, 184/5 = 36.8, so roughly 37 observations. However, in the legend in Figure 4.3, the
number of observations varies from 36 to 37 and 38. This is examined more closely in Section
4.4.1.1.
Another important characteristic of the quantile map is that the range of values in each
category is not constant. In the example, this varies from 0.1 in the lowest quintile to 0.016
for the fourth. The larger the range, the more heterogeneous the observations are that were
grouped into the same category. In other words, very distinct values may be associated
with the same color on the map, which could easily create misleading impressions about the
characteristics of the spatial distribution.
In a non-spatial analysis, the correct value for the quintile is 0.823, without having to
necessarily specify the actual observation that matches this value. However, in a spatial
analysis, such as a thematic map, each location/observations must be allocated to a map
category. In the example, observations ranked 74 (actual 59), 75 (actual 19), 76 (actual 140)
and 77 (actual 18) all have the same value. In a thematic map, one cannot arbitrarily decide
which of these should be in category 2 and which in category 3. This is the problem of ties
in the ranking that forms the basis for the computation of the quantiles.
GeoDa uses a heuristic that assigns tied observations to the next higher map category. As a
result, the second category in Figure 4.3 contains only 36 observations, whereas the third
category contains 38.
Even though it is often used as the default setting in thematic mapping software, a quantile
map should be interpreted with caution. Widely different value ranges for the quantile
categories could mask underlying heterogeneity. In addition, the existence of ties can create
problems in practice. For example, when a large number of observations have the same value,
some quantile categories may turn out to be empty (GeoDa moves tied observations to the
next higher category). This is the case for the Zika and Microcephaly incidence variables
in the Ceará example data set, where many municipios have an incidence of zero (see also
Section 5.3.1).
In contrast to the quantile map, each category now contains greatly varying numbers of
observations, ranging from 7 for the highest category to 75 for the middle category. As
intended, the range of each category is exactly 0.0466. Both the lowest and the highest
categories contain much fewer observations than in the quantile map.
The patterns suggested by the equal intervals map are quite distinct from those in the
quantile map. The overall impression of a band of the lowest index value municipios in
the north has been replaced by a scattering of 8 observations, not showing any apparent
systematic pattern. Also, the band of 36 observations with the highest category in the
quantile map has been replaced by a seven locations, not grouped in any particular way.
The equal intervals map follows the same classification logic as the histogram, illustrated
in Figure 4.6. In the Figure, the two graphs are located side by side, with the highest
bar selected in the histogram. Through the process of linking, this results in the seven
observations from the fifth map category being selected in the map. In practice, inspecting
the histogram (or box map) for the variable under consideration is often instructive in
suggesting whether a quantile map or equal intervals map is appropriate for the distribution
in question.
The patterns suggested by the map have several similar features to the equal intervals
map, in that the middle categories contain most observations, and the lowest and highest
categories are small. However, the intervals are far from equal, ranging from 0.031 for the
fifth category to 0.071 for the first.
The natural breaks bring out a northern pattern for the second category, similar to what
was suggested by the first quintile in the quantile map, and again there is a seeming band
of higher values. However, in contrast to what the quintile map suggested, the associated
observations are not the extremes.
In practice, it is important to go beyond using a single map type, and to compare the
similarities and differences between the patterns suggested by the various map classifications.
which are covered respectively in Chapters 6 and 10. Also, the options in the fourth group
(Shape Centers, etc.), were introduced earlier in Section 3.3.4
A final type of customization pertains to the map legend.
For example, continuing with the Natural Breaks map for the housing variable (Figure 4.7),
a base map that uses the Stamen > TonerLite option (second from bottom in Figure
4.11) with the default map transparency would provide the background illustrated in Figure
4.12. This adds the names of several places as well as some major roads (and the ocean) as
the geographical context for the thematic map.
However, the default map transparency of 0.69 does not do justice to the features of the
thematic map. The Change Map Transparency option (third item in the list in Figure
4.11) provides a slider bar to change the transparency between 1.0 (only the base layer
visible) and 0.0 (only the thematic map visible).7
7 Note that the transparency pertains to the base layer, not to the thematic map, hence for a larger value,
To illustrate this effect, Figure 4.13 shows the same base layer with the transparency set to
0.40. The features of the thematic map are clearly more visible. In practice, some trial and
error may be needed to find a configuration that finds the proper compromise between the
thematic map and the background layer.
The various base layer sources have quite distinct design features and focus on different
aspects of the geography: some have more detailed street information, others include natural
features. For example, Figure 4.14 shows a base layer from the ESRI > WorldStreetMap
option (with transparency set to 0.40). This provides a visual impression that is quite distinct
from that in the previous maps.
general usage restrictions, so users may want to set up their own accounts for the HERE platform. This can
be accomplished by means of the Basemap Configuration option. A free HERE account can be obtained
from https://developer.here.com. The basemap can then be configured with the App ID and App Key.
78 Geovisualization
The Basemap Configuration option (at the top of Figure 4.11) provides a way to further
customize the sources for the base layer map tiles. Each entry corresponds with an entry in
the layer options following a format group_name.basemap_name,basemap_url.9 In
a typical application, these entries should not be touched. On the other hand, experts are
free to customize.
Finally, in a pinch, it may be necessary to use the Clean Basemap Cache option when
things go wrong (the second option in 4.11).
#toner/12/37.7706/-122.3782.
Map options 79
extent, in practice, one can compensate for this lack of metadata by creative naming of
variables, although this is limited by the 10 character constraint.
The saved image file can be further manipulated in specialized graphics software.
The Background Color determines the background against which the map is drawn. In
most instances, the default of white is the best choice.
4.6.1.1 Design
The custom categories are designed through the interface in Figure 4.17, shown after all
editing has been completed. The initial settings and editing operations will be discussed
below.
There are slight differences in the behavior of the dialog, depending on whether it is invoked
from the toolbar, or as an option from an existing map. In the latter case, the associated
map will be updated as a new classification is being developed. When invoked from the
toolbar, there is nothing to update, but the initial variable may not be meaningful. It is the
first variable in the data table, which is often just an identifier for each observations.
When the dialog opens (or at any point when the New button is selected), the first query is
for the New Categories Title, where a name for the new classification must be specified
(the default is Custom Breaks). In the example in Figure 4.17, Custom1 is the name of
the new classification.
82 Geovisualization
There are three major functions in the interface. First is the general definition of the
classification, carried out through the items in the Create Custom Breaks panel at the
top. This includes the naming of the categories, here Custom1.
The computations for the classification and their visualization in a histogram are based
on the values for a specific variable, the Assoc Var., here sanitation. When the custom
breaks are invoked from an existing map, the variable is entered automatically, but it is
possible to change it from the drop-down list.
As the breaks are edited, the impact on the associated histogram for the classification variable
is shown instantly on the right hand side of the dialog.
Other important categories in the interface are whether the Breaks are User Defined
(the default), or follow a traditional classification (Quantile, the default, Unique Values,
Natural Breaks, or Equal Intervals). These choices may seem counter intuitive for a
custom break editing operation, but it allows for the creation of custom labels for the
categories (Section 4.6.1.3).
The Color Scheme for the legend is either sequential, diverging (the default), thematic
(i.e., categorical) or custom. This provides an automatic selection of legend colors based
on the ColorBrewer palettes. However, the color for each category can also be specified by
clicking on its box below Edit Custom Breaks by means of the standard color editor (see
Section 4.5.1). Finally, the number of Categories must be specified (the default is 4).11
11 The classification can also be saved as a new variable in the table, using the Save Categories to Table
button.
Custom Classifications 83
In the example, in Figure 4.17, the color scheme has been set to sequential and the number
of Categories to 5. The main effort consists of determining appropriate break points, which
is considered next.
Figure 4.19: Thematic maps for sanitation and infrastructure using custom classification
then it will be lost and will need to be recreated from scratch the next time the data set is
analyzed.
The project file is created from the menu as File > Save Project. The file is saved with a
file extension of gda. It is a text file that includes XML encoding. For example, a project
file associated with the Ceará data could be zika_ceara.gda. Closer inspection reveals the
section pertaining to custom_classifications in Figure 4.20.
The custom classification section contains all the aspects needed for the definition of the
custom category.
When an analysis is started with the project file as input (e.g., instead of a shape file) then
all transformations and custom categories become immediately available as an option for
any map or histogram.
Custom Classifications 85
In this Chapter, I continue the exploration of mapping options with a focus on statistical
maps, in particular maps that are designed to highlight extreme values or outliers. Some of
these classifications were originally introduced in GeoDa and are gradually being adopted by
other exploratory software (e.g., the Python PySAL library). Their layout illustrates the
emphasis on statistical exploration rather than cartographic design. Specifically, this includes
the Percentile Map, Box Map (with two options for the hinge) and the Standard
Deviation Map.
Other topics covered in this Chapter include maps for categorical variables in the form of a
Unique Values Map. Construction of this type of map does not involve a classification
algorithm, since it uses the integer values of a categorical variable itself as the map categories.
The Co-location Map is an extension of this principle to multiple categorical variables.
The Chapter closes with a brief discussion of the Cartogram and map animation (Map
Movie), which move beyond the traditional choropleth framework to visualize a spatial
distribution.
I continue to use the Ceará Zika sample data set to illustrate the various features.
DOI: 10.1201/9781003274919-5 87
88 Statistical Maps
values are not closely together in space, countering any suggestion of clustering. However,
at least in one case (the northernmost upper percentile observation), a very high value is
surrounded by much lower ones, suggesting the potential of a spatial outlier.
With the focus on the large range of middle values (i.e., 10–50 percentile and 50–90), two
large bands of adjoining municipios in the same category seem to manifest themselves, with
below median values in the northeast and above median values forming a U-shape in the
center of the state.
As for any quantile map, there are some drawbacks to this approach related to ties and
potential heterogeneity of values within the same category. In addition, a percentile map
only makes sense when there are more than 100 observations, which is the case here. It is
particularly useful to identify the location in space of the truly extreme observations.
in the left panel of Figure 5.2. For easy reference, the corresponding box plot is contained in
the right panel.
Compared to a standard quartile map, the box map in Figure 5.2 separates out six lower
outliers from the other 40 observations in the first quartile. They are depicted in dark blue.
Similarly, it separates a single upper outlier from the 43 other observations in the upper
quartile. The upper outlier is colored dark red. The main focus of interest in a box map is to
identify the extent to which the outliers show any kind of spatial pattern. In this example,
there does not seem to be a suggestion of clustering, but possibly of the presence of spatial
outliers (this is explored more formally in Chapters 16 to 19). This constitutes the spatial
perspective that the box map adds to the data exploration.
To further illustrate the correspondence between the box plot and the box map, in Figure
5.3, the six lower outliers are selected in the box plot in the right hand panel. Through
linking, these are also highlighted in the map and correspond exactly to the lower outlier
category. Again, there is a fundamental difference between a traditional box plot where
outlying observations are identified, and the box map, where their location is taken into
account as well.
From the current map, the box map can be switched between the hinge criterion of 1.5 and
3.0 by opening the options menu (right click on the map) and selecting Change Current
Map Type > Box Map (Hinge = 3.0). Alternatively, a new map window can be opened
from the main menu or map toolbar icon by selecting Box Map (Hinge=3.0) as the
option.
The box map is arguably the preferred method to quickly and efficiently identify outliers
and broad spatial patterns in a data set, although it does suffer form the same drawbacks as
any other quantile map.
Extreme Value Maps 91
In practice, it is often useful to compare the outliers identified by both the non-parametric
(box map) and parametric (standard deviation map) approaches. Of particular importance is
whether extreme values show an interesting spatial pattern, which can then be more formally
investigated by means of the local spatial autocorrelation indicators in Chapters 16 to 19.
Figure 5.6: Co-location map variable selection, Zika and Microcephaly indicators
as real. This can be readily changed using Edit Variable Properties in the table, see Section 2.4.1.1.
3 http://colorbrewer2.org/#type=qualitative&scheme=Accent&n=3
94 Statistical Maps
numerical values associated with them do not imply any ordering, they can be changed.
This is accomplished by grabbing the associated legend rectangle and moving it up or down
in the list.
Such reordering can be handy in a so-called cluster map where the categories correspond to
different cluster classifications. The comparison of two cluster maps can be facilitated by
moving the categories around so that the same colors more or less correspond to the same
locations.4
variables. Only those locations that take on a value of 1 for all variables will be coded as
one in the resulting unique values map.
However, a co-location map is different in that it also provides information on matches of
locations with 0 for all variables, as well as on the locations of mismatches. In sum, rather
than just two categories (match and no match), there are three: match of 1, match of 0 and
mismatch.
The map is invoked in the usual way as Map > Co-location Map, which brings up a
dialog to specify the different variables to be considered, as in Figure 5.6. The interface
is used to select the variables but also to choose the associated legend structure. Several
options are provided, ranging from Unique Values, the suggested default in this case, to
several customized legends associate with different types of maps (such as extreme values
maps) and visualizations of spatial analyses (e.g., LISA Map, see Chapter 16).
In our example, the Unique Values default will do fine.
The two variables selected are indicator variables for the presence of respectively Zika
(zika_d) and Microcephaly (mic_d). The corresponding co-location map is shown in
Figure 5.7.
The legend contains three categories. One pertains to those locations with a common
occurrence of 1 (17 observations), and another to those observations that share a value of 0
(115 observations). The third category (in grey) highlights the locations where there is a
mismatch between the values (52 observations).
The logic of the co-location map is further illustrated in Figure 5.8. It shows the selected
observations in each of the respective unique values maps that correspond to locations with
a value of 1 in the co-location map. There are 17 observations selected in each of the maps.
Clearly, these are the only locations where both maps have a dark blue color.
locations. Or, to assess whether significant patterns of local spatial autocorrelation match
across multiple variables. But it is also very easy to generate nonsensical results, for example,
when the labels are not comparable.
One can change the label and color of a category by moving the legend rectangle up or
down in the legend. This again highlights that the category values do not have any intrinsic
numerical value.
5.4 Cartogram
5.4.1 Principle
A cartogram is a map type where the original layout of the areal unit is replaced by a regular
geometric shape (usually a circle, rectangle, or hexagon) that is proportional to the value of
the variable for the location. This is in contrast to a standard choropleth map, where the
size of the polygon corresponds to the area of the location in question. The cartogram has a
long history and many variants have been suggested, some quite creative (see Dorling, 1996;
Tobler, 2004, for an extensive discussion of various aspects of the cartogram). In essence, the
construction of a cartogram is an example of a nonlinear optimization problem, where the
geometric forms have to be located such that they reflect the topology (spatial arrangement)
of the locations as closely as possible.
GeoDa implements a circular cartogram, in which the areal units are represented as circles,
whose size (and color) is proportional to the value observed at that location. The changed
Cartogram 97
shapes remove the misleading effect that the area of the unit might have on perception of
magnitude.
As shown in Figure 5.11, the current cut-off value for sanitation is 0.536. This results in
21 of the lowest observations to be highlighted in the map on the left. Through linking,
the corresponding observations in the map on the right, for infra, are highlighted as well.
This allows for the comparison of the cumulative locations of the values from low to high
between the two variables. More precisely, one assesses the extent to which the colors for
the matching locations in the two maps follow the same pattern. In the example, several of
the lowest observations for sanitation belong to a higher quintile in the map for infra.
An extension to multiple maps and graphs is straightforward.
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
6
Maps for Rates
In Chapters 4 and 5, the maps pertained to a single variable. In the current chapter, I
deal with some special aspects related to the mapping of rates or proportions. In GeoDa,
such variables can be constructed on the fly, by specifying a numerator and a denominator.
The numerator is typically a type of event, such as the incidence of a disease, and the
denominator is the population at risk. Such data have broad applications in public health
and criminology. However, several of the principles covered can be equally applied to any
ratio, such as an unemployment rate (number of unemployed people in a given region as a
share of the labor force), or any other per capita measure (e.g., gross regional product per
capita).
The chapter contains three main sections. First, the creation of a raw rate or crude rate map
is considered, which is the same as the types of maps considered so far, except that the rate
is calculated on the fly. Next, the focus shifts to so-called excess risk maps, which compute a
measure that compares the value at each location to an overall average, highlighting extreme
observations. Such excess risk is known under different terms in various literatures, such
as a standardized mortality rate (SMR) in demography (Preston et al., 2001), or a location
quotient (LQ) in regional economics (McCann, 2001).
Finally, attention shifts to the important topic of variance instability that pertains to any
rate measure, and the associated concept of rate smoothing. In essence, the precision of
the rate as an estimate for the underlying risk depends on the size of the denominator. In
practice, this means that rates estimated for small populations (e.g., rural areas) may have
large standard errors and provide imprecise estimates for the actual risk. This may lead to
erroneous suggestions of extreme values, such as the presence of outliers. Rate smoothing
techniques use a Bayesian logic to borrow strength and adjust the small area estimates. There
is a large literature in statistics dealing with such techniques (e.g., Lawson et al., 2003).
Here, the discussion will be limited to the basic principle, illustrated by the most common
form of Empirical Bayes smoothing.
I continue to use the Ceará Zika sample data set to illustrate the various features.
GeoDa Functions
• Map > Rates-Calculated Map
– Raw Rate
– Excess Risk
– Empirical Bayes
– saving calculated rates to the table
• Table > Calculator > Rates
– Raw Rate
– Excess Risk
– Empirical Bayes
To illustrate this concept, consider the box maps in Figures 6.1 and 6.2, depicting, respectively,
population (pop) and population density (popdens), and GDP (gdp) and GDP per capita
Choropleth Maps for Rates 103
(gdpcap) for the municipios in the state of Ceará. The left panel pertains to the spatially
extensive variable, while the right-hand panel shows the corresponding spatially intensive
variable.
Whereas the population box maps suggests 10 outliers, several of which are spread throughout
the state, the population density map has 13 outliers, mostly concentrated around the urban
area of Fortaleza. Also, as highlighted in the two maps, several of the larger areas that rank
in the upper quartile for total population, drop to the first quartile in terms of population
density, i.e., after the area of the municipios is corrected for (their colors changes from
browns to blues). In addition, most of the outliers in the larger (rural) areas, are no longer
characterized as such in the population density map.
A similar phenomenon occurs in the maps for total GDP and GDP per capita. The number
of outliers for GDP (a highly skewed variable) drops from 25 to 18 in the per capita map.
Again, most of the larger (and rural) areas drop in the ranking, highlighted by their change
in color from browns to blues.
In sum, a proper indication of the variability in the spatial distribution of the variable of
interest is only obtained when the latter is converted to a spatially intensive form.
104 Maps for Rates
Everything is the same as before, except that the map heading spells out both the numerator
and denominator variables: Raw Rate gdp over pop. Specifically, the map is identical
to the right-hand panel in Figure 6.2, which uses the gdpcap variable from the data table,
rather than computing the ratio explicitly.
A rate map retains all the standard options of a map in GeoDa (see Section 4.5). Two
additional features are Rates and Save Rates. Rates brings up a list of the different rate
map options, since these are not available in the Change Current Map Type function in
the main options menu. The Save Rates option allows for the calculated rates to be added
to the data table. The default variable name for a raw rate is R_RAW_RT). As before,
the new variable only becomes a permanent addition after a Save or Save As operation.
divergent, using blue hues to indicate values smaller than one and brown hues for values
larger than one. The intervals are 1–2, 2–4 and greater than 4 on the high end, and 0.5–1,
0.25–0.5, and <0.25 on the low end.
The map is invoked as Maps > Rates-Calculated Maps > Excess Risk in the usual
fashion. Again, both the Event Variable and the Base Variable must be specified. In
Figure 6.7, the resulting map is shown for gdp and pop as the numerator and denominator.
The great majority of municipalities do not reach the state GDP per capita, yielding a
blue hue in the map. Only nine observations have a GDP per capita that exceeds the state
average by a multiple: five between 1 and 2, three between 2 and 4, and one observation has
an excess risk rate greater than 4, highlighted in the map.
As it turns out, the identified observation is the small rural municipality of Quixeré, which
specializes in high-tech agriculture. Due to the high capital intensive nature of that production
and its small population, it has an extremely high GDP per capita from that activity (see
also Figure 6.8). However, this extreme value may be an artifact of the particular data set.1
All standard map options also apply to the excess risk map.
GDP is from the 2013 IBGE publication. As it turns out, in later years, this figure was revised downward.
108 Maps for Rates
is 77,865,442 rais (i.e., the sum of the GDP in all municipalities), with a matching total
population of 8,452,380. Consequently, the state GDP per capital is 9.21. Clearly, the GDP
per capita of 40 for Quixeré exceeds this more than four-fold. As pointed out in the review
of the map, this is actually an exception, with most of the municipalities not matching the
state average. Figure 6.8 shows the top nine municipalities with excess rates greater than
one. More importantly, the spatial distribution shown in the map in Figure 6.7 highlights
the importance of spatial heterogeneity. Specifically, assuming that the economies of the
municipalities in the state follow the state GDP per capita is highly misleading, as only a
few such locations meet or exceed that standard.
with πi as the underlying risk parameter. The mean of this binomial distribution is πi Pi ,
and the variance is πi (1 − πi )Pi .
Returning to the crude rate ri = Oi /Pi , it can be readily seen that its mean corresponds to
the underlying risk:3
Oi E[Oi ] πi Pi
E[ri ] = E[ ] = = = πi .
Pi Pi Pi
Consequently, the crude rate is an unbiased estimator for the underlying risk.
However, the variance has some undesirable properties. A little algebra yields the variance
as:
πi (1 − πi )Pi πi (1 − πi )
V ar[ri ] = = .
Pi2 Pi
This result implies that the variance depends on the mean, a non-standard situation and an
additional degree of complexity. More importantly, it implies that the larger the population
of an area (Pi in the denominator), the smaller the variance for the estimator, or, in other
words, the greater the precision.
The flip side of this result is that for areas with sparse populations (small Pi ), the estimate
for the risk will be imprecise (large variance). Moreover, since the population typically
varies across the areas under consideration, the precision of each rate will vary as well. This
variance instability needs to somehow be reflected in the map, or corrected for, to avoid a
spurious representation of the spatial distribution of the underlying risk. This is the main
motivation for smoothing rates, considered next.
who showed that in some instances biased estimators may have better precision in a mean
squared error sense (James and Stein, 1961).
Formally, the mean squared error or MSE is the sum of the variance and the square of the
bias. For an unbiased estimator, the latter term is zero, so then MSE and variance are the
same. The idea of borrowing strength is to trade off a (small) increase in bias for a (large)
reduction in the variance component of the MSE. While the resulting estimator is biased, it
is more precise in a MSE sense. In practice, this means that the chance is much smaller to
be far away from the true value of the parameter.
The implementation of this idea is based on principles of Bayesian statistics, which are
briefly reviewed next.
where A and B are random events, and | stands for the conditional probability of one event,
given a value for the other. The second equality yields the formal expression of Bayes law as:
P [B|A] × P [A]
P [A|B] = .
P [B]
In most instances in practice, the denominator in this expression can be ignored, and the
equality sign is replaced by a proportionality sign:
In the context of estimation and inference, the A typically stands for a parameter (or a
set of parameters) and B stands for the data. The general strategy is to update what is
known about the parameter A a priori (reflected in the prior distribution P [A]), after
observing the data B, to yield a posterior distribution, P [A|B], i.e., what is known about
the parameter after observing the data. The link between the prior and posterior distribution
is established through the likelihood, P [B|A]. Using a more conventional notation with π
as the parameter and y as the observations, this gives:5
For each particular estimation problem, a distribution must be specified for both the prior
and the likelihood. This must be carried out in such a way that a proper posterior distribution
results. Of particular interest are so-called conjugate priors, which result in a closed form
expression for the combination of likelihood and prior distribution. In the context of rate
smoothing, there are a few commonly used priors, such as the Gamma and the Gaussian
4 There are several excellent books and articles on Bayesian statistics, with Gelman et al. (2014) as a
classic reference.
5 Note that in a Bayesian approach, the likelihood is expressed as a probability of the data conditional
upon a value (or distribution) of the parameters. In classical statistics, it is the other way around.
Rate Smoothing 111
(normal) distribution.6 A formal mathematical treatment is beyond the current scope, but it
is useful to get a sense of the intuition behind smoothing approaches.
In essence, it means that the estimate from the data (i.e., the crude rate) is adjusted with
some prior information, such as the reference rate for a larger region (e.g., the state or
country). Unreliable small area estimates are then shrunk toward this reference rate. For
example, if a small area is observed with zero occurrences of an event, does that mean that
the risk is zero as well? Typically, the answer will be no, and the smoothed risk will be
computed by borrowing information from the reference rate.
and
O+α
V ar[π] =
(P + β)2
where α and β are the the shape and scale parameters of the prior (Gamma) distribution.7
In the Empirical Bayes approach, values for α and β are estimated from the actual data.
The smoothed rate is then expressed as a weighted average of the crude rate, say r, and
the prior estimate, say θ. The latter is estimated as a reference rate, typically the overall
statewide average or some other standard.
In essence, the EB technique consists of computing a weighted average between the raw
rate for each small area and the reference rate, with weights proportional to the underlying
population at risk. Simply put, small areas (i.e., with a small population at risk) will tend to
have their rates adjusted considerably, whereas for larger areas the rates will barely change.8
More formally, the EB estimate for the risk in location i is:
πiEB = wi ri + (1 − wi )θ.
σ2
wi = ,
(σ 2 + μ/Pi )
with Pi as the population at risk in area i, and μ and σ 2 as the mean and variance of the
prior distribution. 9
6 For an extensive discussion, see, for example, the classic papers by Clayton and Kaldor (1987) and
Marshall (1991).
7 A Gamma(α, β) distribution with shape and scale parameters α and β has mean E[π] = α/β and
In the empirical Bayes approach, the mean μ and variance σ 2 of the prior (which determine
the scale and shape parameters of the Gamma distribution) are estimated from the data.
For μ this estimate is simply the reference rate (the same reference used in the computation
i=n i=n
of the SMR), i=1 Oi / i=1 Pi . The estimate of the variance is a bit more complex:
i=n
2 Pi (ri − μ)2 μ
σ = i=1
i=n − i=n .
i=1 Pi i=1 Pi /n
While easy to calculate, the estimate for the variance can yield negative values. In such
instances, the conventional approach is to set σ 2 to zero. As a result, the weight wi becomes
zero, which in essence equates the smoothed rate estimate to the reference rate.
municipalities in the map (blue rectangle) shown three small areas whose smoothed rates
pulled the original zero value above the median.
In this and the next two chapters, I continue the review of data exploration, but now shift
the focus to traditional non-spatial EDA methods. Through the use of linking and brushing,
the connection with a spatial representation (a map) can always be made explicit. This idea
of spatializing EDA is central to the perspective taken here.
In the current chapter, I focus on techniques to describe the distribution of one variable
at a time (univariate), and on the relationship between two variables (bivariate) through
standard statistical graphs. These include the histogram, box plot, scatter plot and scatter
plot matrix, considered in turn. The chapter closes with a discussion of spatial heterogeneity,
both for a single variable (through the averages chart) and pertaining to the relationship
between two variables (brushing the scatter plot).
To illustrate the methods covered in this and the next two chapters, I introduce a new
sample data set with poverty indicators and census data for 570 municipalities in the state
of Oaxaca in Mexico. The poverty indicators are from CONEVAL (the National Council for
the Evaluation of Social Development Policy) for 2010 and 2020. The census variables cover
2000, 2010 and 2020 and are from INEGI (National Institute of Statistics and Geography).
The data are contained in the Oaxaca Development sample data set.
Figure 7.1: Histogram | Box Plot | Scatter Plot | Scatter Plot Matrix
7.2.1 Histogram
Arguably the most familiar statistical graphic is the histogram, which is a discrete rep-
resentation of the density function of a continuous variable. In essence, the range of the
variable (the difference between maximum and minimum) is divided into a number of equal
intervals (or bins), and the number of observations that fall within each bin is depicted
proportional to the height of a bar. This classification is the same as the principle underlying
the equal intervals map, which was covered in Section 4.4.2. The main challenge in creating
an effective visualization is to find a compromise between too much detail (many bins,
containing few observations) and too much generalization (few bins, containing a broad
range of observations).
lower percentages, highlighting the prevalence of high-poverty municipalities (on the right).
The histogram options are brought up in the usual fashion, by right-clicking on the graph.
The options are grouped into three categories, the first consisting of histogram-specific items:
Choose Intervals, Histogram Classification and View. Next is a Color option to set
the background color (the default of white is usually best). Finally, the options to Save
Selection, Copy Image to Clipboard and Save Image As work in the same way as
for maps (see Section 4.5.5).
Of these options, the Choose Intervals is the most commonly used. It brings up a dialog
to specify the number of bins for the histogram. For example, after changing from the default
of seven to twelve bins, the histogram becomes as in Figure 7.3. In this particular instance,
there is not much gained by the greater detail since the same overall pattern is maintained.
The Histogram Classification option allows for the selection of a custom category speci-
fication. This works in the same way as for map classification, covered in Section 4.6.
118 Univariate and Bivariate Data Exploration
The View option contains six items. Four of these pertain to the overall look of the graph
and are self-explanatory: Set Display Precision (for the variable being depicted), Set
Display Precision on Axes, Show Axes and Status Bar. The latter two are checked
by default.
million inhabitants. CATMAX20 is the highest category obtained for a locality in the municipality. In this
example, the maximum is 9, for 30,000 to 49,999 inhabitants.
Analyzing the Distribution of a Single Variable 119
locations in the elevation map. The resulting configuration does not appear to be random
(our prior expectation, or null hypothesis), but the respective municipalities are concentrated
in the top elevation categories. In other words, settlements with smaller populations tend to
be located in higher elevations (not an unexpected result, but still worthy of checking).
The reverse linking between the map and the histogram is illustrated in Figure 7.7. The
selection rectangle in the map results in the corresponding observations to be highlighted
in the histogram. Here, the focus is on spatial heterogeneity (further elaborated upon in
Section 7.5). Everything else being the same, the expectation (null hypothesis) would be
that the distribution of the selected observations largely follows that of the whole. In other
words, in the histogram, the heights of the bars of the selected observations should roughly
be proportional to the corresponding height in the full data set. In the example, this clearly
is not the case, suggesting a difference between the distribution in the spatial subset and the
overall distribution. In other words, this suggests the presence of spatial heterogeneity.
The process can be made dynamic through brushing, i.e., by moving the selection rectangle
over the map. This results in an immediate adjustment of the selected observations in
the histogram, allowing for an assessment of spatial heterogeneity by eye. A more formal
assessment is pursued in Section 7.5.
Analyzing the Distribution of a Single Variable 121
7.2.2.1 Implementation
The box map is invoked as Explore > Box Plot from the menu, or by selecting the Box
Plot as the second icon from the left in the toolbar in Figure 7.1. Identical to the approach
followed for the histogram, next appears a Variable Settings dialog to select the variable.4
To illustrate this graph, we select the variable c_ptot12, the percentage population change
between 2020 and 2010 (positive values are population growth). The default box plot, with
3 GeoDa takes the vertical approach.
4 In GeoDa, the default is that the variable from any previous analysis is already selected.
122 Univariate and Bivariate Data Exploration
Figure 7.9: Box map for population change 2020–2010 (upper outliers selected)
a hinge of 1.5, is shown in the left-hand panel of Figure 7.8. The descriptive statistics are
listed at the bottom.5
The observations range from −31.2% to 113.3%, with a slightly positive median of 3.0%
(the mean is 4.0%, reflecting the influence of upper outliers). The interquartile range is
13.9. Consequently, the upper hinge is roughly 10.7 (Q3) + 1.5 × 13.9, or 37.9%. Fifteen
observations take on values that are larger than this upper hinge and are designated as
upper outliers. The lower hinge is −3.2 − 1.5 × 13.9, or −21.1%. Ten observations have
population decreases that are even larger, hence these are lower outliers.
the options.
6 The basemap is Stamen > TonerLite.
Bivariate Analysis – The Scatter Plot 123
The View option contains four items, all self-explanatory: Set Display Precision, Set
Display Precision on Axes, Display Statistics and Show Vertical Axis.
The typical multiplier for the IQR to determine outliers is 1.5 (roughly equivalent to the
practice of using two standard deviations in a parametric setting). However, a value of 3.0
is fairly common as well, which considers only truly extreme observations as outliers. The
multiplier to determine the fence can be changed with the Hinge > 3.0 option (obtained
by right clicking in the plot to select the options menu, and then choosing the hinge value).
This yields the box plot shown in the right-hand panel of Figure 7.8. The new hinge no
longer yields lower outliers, and the upper outliers are reduced to five extreme observations.
The main purpose of the box plot in an exploratory strategy is to identify outlier observations
in an a-spatial sense. Its spatial counterpart is the box map.
y = a + bx + ,
where a is the intercept, b is the slope and is a random error term. The coefficients
are estimated by minimizing the sum of squared residuals, a so-called least squares fit, or
ordinary least squares (OLS).
The intercept a is the average of the dependent variable (y) when the explanatory variable
(x) is zero. The slope shows how much the dependent variable changes on average (Δy) for
a one unit change in the explanatory variable (Δx). It is important to keep in mind that
the regression applied to the scatter plot pertains to a linear relationship. It may not be
appropriate when the variables are non-linearly related (e.g., a U shape).
When y and x are standardized (mean zero and variance one), then the slope of the regression
line is the same as the correlation between the two variables (the intercept is zero). Note
that while correlation is a symmetric relationship, regression is not, and the slope will be
different when the explanatory variable takes on the role of dependent variable, or vice
versa. Just as a linear regression fit may not be appropriate when the variables are related
in a non-linear way, so is the correlation coefficient. It only measures a linear relationship
between two variables, which is clearly demonstrated by its equivalence to the regression
slope between standardized variables.
To illustrate this graph, the X variable is ppov_20 (percent population living in poverty in
2020) and the Y variable pfood_20 (percent population with food insecurity in 2020). The
resulting scatter plot, with all the default settings, is as in Figure 7.10.
The top of the graph spells out the X and Y variables, which are also listed along the axes.
The linear regression fit is drawn over the points, with summary statistics listed below the
graph. These included the fit (R2 ) and the estimates for constant and slope, with estimated
standard error, t-statistic and associated p-value. In the example, the fit is a low 0.091,
which is not surprising, given the cone shape of the scatter plot, rather than the ideal cigar
shape. Nevertheless, both intercept and slope coefficients are highly significant (rejecting the
null hypothesis that they are zero) with p-values of 0.002 and 0.000 respectively.
The positive relationship suggests that as poverty goes up by one percent, the share in food
insecurity increases by 0.24 percent.
In the current setup, no observations are selected, so that the second line in the statistical
summary in Figure 7.10 (all red zeros) has no values. This line pertains to the selected
observations. The blue line at the bottom relates to the unselected observation. The sum
of the number of observations in each of the two subsets always equals the total number
of observations, listed on the top line. The three lines are included because of the default
View setting of Regimes Regression, even though there is currently no active selection
(see Section 7.3.1.1).
The scatter plot has several options, invoked in the customary fashion by right clicking on
the graph. They are grouped into eight categories:
Bivariate Analysis – The Scatter Plot 125
• Selection Shape
• Data
• Smoother
• View
• Color
• Save Selection
• Copy Image to Clipboard
• Save Image As
Selection Shape works as in the other graphs and maps (see Section 2.5.4), and so do
the three last options. The Color options allow for the Regression Line Color, Point
Color and Background Color to be set. The remaining options are discussed in more
detail below.
These options are mostly self-explanatory, or work in the same way as before (such as
Statistics), except for the Regimes Regression.
The latter allows for a dynamic assessment of structural stability by comparing the regression
line for a selected subset of the data to its complement (unselected). For example, in Figure
7.11, 206 observations have been selected (inside the selection rectangle). This yields three
regression lines: a purple one for the full data set (same results as in Figure 7.10), a red one
for the selected observations and a blue one for its complement (384 observations). With
each regression line corresponds a line in the statistics summary. In Figure 7.11, the selected
observations show no significant linear relationship between the two variables. In other words,
for a range of poverty roughly between 30% and 70%, there is no linear relationship with
food insecurity (p-value of 0.542), whereas for the whole and for the complement, there is.
This is an indication of structural instability.
A formal assessment of structural stability is based on the Chow test (Chow, 1960). This
statistic is computed from a comparison of the fit in the overall regression to the combination
of the fits of the separate regressions, while taking into account the number of regressors (k).
In our simple example, there is only the intercept and the slope, so k = 2. The residual sum
of squares can be computed for the full regression (RSS) and for the two subsets, say RSS1
and RSS2 . Then, the Chow test follows as:
(RSS − (RSS1 + RSS2 ))/k
C= ,
(RSS1 + RSS2 )/(n − 2k)
distributed as an F statistic with k and n − 2k degrees of freedom. Alternatively, the statistic
can also be expressed as having a chi-squared distribution, which is more appropriate when
Bivariate Analysis – The Scatter Plot 127
multiple breaks and several coefficients are considered.7 In our example, the p-value of the
Chow test is 0.039, which is only weakly significant (in part this is due to the fact that the
overall regression coefficient of 0.243 is not that different from zero itself). The assessment
of structural stability is considered further in the context of spatial heterogeneity in Section
7.5.2.
The Regimes Regression option is on by default. Turning it off disables the effect of any
selection and reduces the Statistics to a single line.
linear fit has been turned off. However, in some instances, it may be insightful to have both
fits showing in the scatter plot.
In the example, the curve shows a fairly regular, near linear increase up to a poverty level of
about 60%, but a much more irregular pattern at higher poverty rates, with a very steep
positive relationship at the very high end. In the middle range (the same range of ppov_20
as selected in Figure 7.11), the curve is almost flat, in rough agreement with what was found
in the regime regression example.
determines how many times the fit is adjusted by refining the weights. A smaller value for this option will
speed up computation, but result in a less robust fit. The Delta Factor drops points from the calculation of
the local fit if they are too close (within Delta) to speed up the computations. Technical details are covered
in Cleveland (1979).
130 Univariate and Bivariate Data Exploration
A higher value of the bandwidth results in a smoother curve. For example, with the bandwidth
set to 0.6, the curve in Figure 7.14 results. This more clearly suggests different patterns
for three subsets of poverty rates: a slowly increasing slope at the lower end, a near flat
slope in the middle and a much steeper slope at the upper end. In a data exploration, one
would be interested in finding out whether these a-spatial subsets have an interesting spatial
counterpart, for example, through linking with a map.
The opposite effect is obtained when the bandwidth is made smaller. For example, with a
value of 0.05, the resulting curve in Figure 7.15 is much more jagged and less informative.
The literature contains many discussions of the notion of an optimal bandwidth, but in
practice a trial and error approach is often more effective. In any case, a value for the
bandwidth that follows one of these rules of thumb can be entered in the dialog. Currently,
GeoDa does not compute optimal bandwidth values.
Finally, while one might expect the LOWESS fit and a linear fit to coincide with a bandwidth
of 1.0, this is not the case. The LOWESS fit will be near-linear, but slightly different from a
standard least squares result due to the locally weighted nature of the algorithm.
7.4.1 Implementation
The scatter plot matrix operation is started by selecting the fourth icon on the EDA toolbar
(Figure 7.1), or by choosing Explore > Scatter Plot Matrix from the menu.
This brings up a Scatter Plot Matrix Variables Add/Remove dialog, through which
the variables are selected. The design of the interface is such that one selects a variable from
the Variables list on the left and clicks on the right arrow > to include it in the Include
Scatter Plot Matrix 131
list on the right. Alternatively, one can double click on the variable name to move it to the
right-hand column. The left arrow < is there to remove a variable from the Include list.
As soon as two variables are selected, the scatter plot matrix is rendered in the background.
As new variables are added to the Include list, the matrix in the background is updated
with the additional scatter plots.
Figure 7.16 shows an example with four variables that were already used in the illustrations
above: ppov_20, pfood_20, c_ptot12 and ALTID. Note that the latter really should
not be portrayed on the Y-axis, although by construction this always happens for every
variable in the scatter plot matrix.
With the default setting, each scatter plot shows the slope coefficient at the top, together with
a designation of significance (** is p < 0.01, * is p < 0.05). Ignoring the bottom row (with
ALTID on the Y-axis), significant coefficients are obtained in all instances, except between
pfood_20 and c_ptot12. In other words, food insecurity and population change do not
seem to be related. On the other hand, there is a strong and negative relationship between
ppov_20 and c_ptot12. Altitude is significantly related with all three socio-economic
outcomes, positively with ppov_20 (greater poverty at higher altitudes), but negatively
with the other two. A substantive interpretation is beyond the scope.
• Data
• Smoother
• View
• Color
• Save Image As
The first option invokes the Scatter Plot Matrix Variables Add/Remove dialog to
allow for changes in the variable selection. Selection Shape and Save Image As work in
the usual fashion.
View has options to Set Display Precision on Axes, Regimes Regression and Dis-
play Slope Values. The latter is checked by default. In contrast to the standalone scatter
plot, Regimes Regression is turned off by default in the scatter plot matrix. However,
it can readily be invoked, allowing for all features of linking and brushing, e.g., to explore
spatial heterogeneity across all variables (see Section 7.5.2).
The Color option provides a way to change the Regression Line Color and Point Color.
Data and Smoother are separately considered next.
The result for the four variables considered before is as in Figure 7.17.
is equivalent to a t-test on the coefficient of the indicator variable, since there is only one
slope.
This F-statistic is basically a test on whether there is a significant gain in explanation in the
regression beyond the overall mean (i.e., the constant term). Formally, the statistic uses the
sum of squared residuals in the regression RSS and the sum of squared deviations from the
mean for the dependent variable RSY. The statistic follows as:
RSS − RSY RSY
F = / ,
k−1 n−k
with k as the number of explanatory variables. In our simple dummy variable regression,
k = 2, so that the degrees of freedom for the F-statistic are 1, n − 2 (see also Anselin and
Rey, 2014, pp. 98–99).
the time settings can be ignored. The statistics for the two groups are listed in a small table.
Selected has 121 observations with a Mean of 17.65 and S.D. of 7.74. In contrast, the
Unselected group contains 449 observations, with Mean of 29.09 and S.D. of 15.07.
The formal test on equality of means results in an F-statistic of 64.99, which yields a very
small p-value (essentially zero) for 568 degrees of freedom.
In the right-hand panel of the window, the overall mean (black), selected mean (red) and
unselected mean (blue) are represented graphically.
In this case, there is strong evidence that food insecurity is less severe in the Central Valleys
compared to the rest of the state.
the averages chart on the right. Note that the brushing is not based on the altitude variable,
but is purely a move over space. However, by including a different variable, one may be able
to discover what lies behind the spatial structural differences.
The variable under consideration in the averages chart is ppov_20. The first selection
contains 213 observations with a mean of 75.42, compared to a mean of 78.46 for the rest.
The test on the difference between the means is not significant at p = 0.052. Thus this initial
selection of higher elevation locations is not substantially different from the rest of the state.
In Figure 7.22, the selection is moved to the center of the state, with 215 selected observations,
yielding a mean of 73.69. The unselected mean is 79.53. For this selection, the test rejects
the null hypothesis with a p-value of 0.000, suggesting strong spatial heterogeneity.
Finally, in Figure 7.23, the selection is moved even further to the east. The 87 selected
observations have a mean of 79.01, whereas the complement has a mean of 77.02. The test
on difference between the means is not significant at p = 0.348.
By moving the brush over different regions of the map, the extent of spatial structural
instability can be assessed. In addition, by using a map for a different variable (such as
altitude here), one can possibly gain insight into factors that may be behind the spatial
structural instability. However, for this to be meaningful, one has to make sure that sufficient
observations are contained in each selection.
In this chapter, I switch to a full multivariate exploration of spatial data, where the focus
is on the potential interaction among multiple variables. This is distinct from a univariate
(or even bivariate) analysis of multiple variables, where the properties of a distribution
are considered in isolation. The goal is to discover potential pathways of interaction. For
example, in a bivariate analysis, one may have found a strong correlation between lung
cancer and socio-economic factors, but after controlling for smoking behavior (itself strongly
correlated with SES), this relationship disappears.
The methods considered share the same objective, i.e., how to represent relationships in
higher dimensions on a two-dimensional screen (or piece of paper). Three techniques, the
bubble chart, three-dimensional scatter plot and conditional plots are limited to the analysis
of three to four variables. True multivariate analyses for several variables (more than four)
can be carried out by means of the parallel coordinate plot (PCP). Each is considered in
turn. I continue to use the Oaxaca Development data set for illustrations.
As in the previous chapter, the methods covered are inherently non-spatial, but by means of
linking and brushing with one or more maps, they can be spatialized.
Before focusing on the specific techniques, some special features of multivariate analysis are
outlined in a brief discussion of the curse of dimensionality.
– conditional histogram
– conditional box plot
– conditional map
– conditional plot option
– changing the condition breakpoints
• Explore > Parallel Coordinate Plot
– changing the classification theme for the PCP
– changing the order of the axes
– brushing the PCP
Toolbar Icons
Figure 8.1: Bubble Chart | 3D Scatter Plot | Parallel Coordinate Plot | Conditional Plot
weights, and making sure the transformation is set to raw. The nearest neighbor distances are the third
column in a weights file created for k = 1. See Chapter 11 for a detailed discussion.
144 Multivariate Data Exploration
random numbers generated), the smallest nearest neighbor distance in the x dimension is
0.009. In two dimensions, between x and y, the smallest distance is 0.094, and in three
dimensions, it is 0.146. All distances are in the same units, since the observations fall within
the unit interval for each variable.
Two insights follow from this small experiment. One is that the nearest neighbors in a lower
dimension are not necessarily also nearest neighbors in higher dimensions. For example, in
the x dimension, the shortest distance is between points 5 and 8, whereas in the x-y space,
it is between 1 and 4. The other property is that the nearest neighbor distance increases
with the dimensionality. In other words, more of the attribute space needs to be searched
before neighbors are found. This becomes a critical issue for searches in high dimensions,
where most lower dimensional techniques (such as k-nearest neighbors) become impractical.
Further examples of some strange aspects of data in higher dimensional spaces can be found
in Chapter 1 of Lee and Verleysen (2007).
Figure 8.7: Bubble chart – default settings: education, basic services, extreme poverty
The introduction of the extra variable allows for the exploration of interaction. The point of
departure (null hypothesis) is that there is no interaction. In the plot, this would be reflected
by a seeming random distribution of the bubble sizes among the scatter plot points. On the
other hand, if larger (or smaller) bubbles tend to be systematically located in particular
subareas of the plot, this may suggest an interaction. Through the use of linking and brushing
with the map, the extent to which systematic variation in the bubbles corresponds with
particular spatial patterns can be readily investigated, in the same way as illustrated in the
previous chapter.
The bubble chart was popularized through the well-known Gapminder web site, where
it is used extensively, especially to show changes over time.2 This aspect is currently not
implemented in GeoDa.
8.3.1.1 Implementation
The bubble chart is invoked from the menu as Explore > Bubble Chart, and from the
toolbar by selecting the left-most icon in the multivariate group of the EDA functionality,
shown in Figure 8.1. This brings up a Bubble Chart Variables dialog to select the
variables for up to four dimensions: X-Axis, Y-Axis, Bubble Size and Bubble Color.
To illustrate this method, three variables from the Oaxaca Development sample data set are
selected: peduc_20 (percent population with an educational gap in 2020) for the X-Axis,
pserv_20 (percent population without access to basic services in the dwelling in 2020) for
the Y-Axis, and pepov_20 (percent population living in extreme poverty in 2020) for
both Bubble Size and Bubble Color. The default setting uses a standard deviational
diverging legend (see also Section 5.2.3) and results in a somewhat unsightly graph, with
circles that typically are too large, as in Figure 8.7.
Before going into the options in more detail, it is useful to know that the size of the circle can
be readily adjusted by means of the Adjust Bubble Size option. With the size significantly
reduced, a more appealing Figure 8.8 is the result.
2 https://www.gapminder.org/tools/#$chart-type=bubbles&url=v1
146 Multivariate Data Exploration
Figure 8.8: Bubble chart – bubble size adjusted: education, basic services, extreme poverty
As mentioned, the null hypothesis is that the size distribution (and matching colors) of the
bubbles should be seemingly random throughout the chart. In the example, this is clearly
not the case, with high values (dark brown colors and large circles) for pepov_20 being
primarily located in the upper right quadrant of the graph, corresponding to low education
and low services (high values correspond with deprivation). Similarly, low values (small
circles and blue colors) tend to be located in the lower left quadrant. In other words, the
three variables under consideration seem to be strongly related.
Given the screen real estate taken up by the circles in the bubble chart, this is a technique
that lends itself well to applications for small to medium sized data sets. For larger size data
sets, this particular graph is less appropriate.
Specifically, the Univariate tab with ASSIGN allows one to set the a variable equal to a constant.
Three Variables: Bubble Chart and 3-D Scatter Plot 147
Figure 8.9: Bubble chart – no color: education, basic services, extreme poverty
With peduc_20 for the X-Axis and ppov_20 (percentage population living in poverty)
as the Y-Axis, Bubble Size is now set to the constant Const, with the categorical variable
Region for Bubble Color.4 First, this yields a rather meaningless graph, based on the
default standard deviation classification. Changing Classification Themes to Unique
Values results in a scatter plot of the two variables of interest, with the bubble color
corresponding to the values of the categorical variable. In the example, this is Region, as
in Figure 8.10.
In this case, there seems to be little systematic variation along the region category, in line
with our starting hypothesis.
4 The regional classifications are: (1) Canada; (2) Costa; (3) Istmo; (4) Mixteca; (5) Papaloapan; (6)
The use of a bubble chart to address structural change is particularly effective when more
than two categories are involved. In such an instance, the binary selected/unselected logic of
scatter plot brushing no longer works. Using the bubble chart in this fashion allows for an
investigation of structural changes in the bivariate relationship between two variables along
multiple categories, each represented by a different color bubble. This forms an alternative
to the conditional scatter plot, considered in Section 8.4.2.
8.3.2.1 Implementation
The three-dimensional scatter plot method is invoked as Explore > 3D Scatter Plot from
the menu, or by selecting the second icon from the left on the toolbar depicted in Figure 8.1.
This brings up a 3D Scatter Plot Variables selection dialog for the variables corresponding
to the X, Y and Z dimensions. Again, peduc_20, pserv_20 and pepov_20 are used.
The corresponding initial default data cube is as in Figure 8.11, with the Y-axis (pserv_20)
as vertical, and the X (peduc_20) and Z-axes (pepov_20) as horizontal. Note that the
Three Variables: Bubble Chart and 3-D Scatter Plot 149
axis marker (e.g., the X etc.) is at the end of the axis, so that the origin is at the unmarked
side, i.e., the lower left corner where the green, blue and purple axes meet.
Checking this box creates a small red selection cube in the graph. The selection cube can be
moved around with the command key pressed, or can be moved and resized by using the
controls to the left, next to X:, Y: and Z:, with the corresponding variables listed.
The first set of controls (to the left) move the box along the matching dimension, e.g., up or
down the X values for larger or smaller values of peduc_20, and the same for the other two
variables. The slider to the right changes the size of the box in the corresponding dimension
(e.g., larger along the X dimension). The combination of these controls moves the box around
to select observation points, with the selected points colored yellow.
The most effective way to approach this is to combine moving around the selection box and
rotating the cube. The reason for this is that the cube is in effect a perspective plot, and one
cannot always judge exactly where the selection box is located in three-dimensional space.
A selection is illustrated in Figure 8.13, where only a few out of seemingly close points in
the data cube are selected (yellow). By further rotating the plot, one can get a better sense
of their location in the point cloud.
As in all other graphs, linking and brushing is implemented for the 3D scatter plot as well.
Figure 8.14 shows an example of brushing in the six quantile map for ALTID and the
associated selection in the point cloud. The assessment of the match between closeness
in geographical space (the selection in the map) and closeness in multivariate attribute
space (the point cloud) is a fundamental notion in the consideration of multivariate spatial
correlation, discussed in Chapter 18.
Similar to the bubble chart, the 3D scatter plot is most useful for small to medium sized
data sets. For larger numbers of observations, the point cloud quickly becomes overwhelming
and is no longer effective for visualization.
to adjust the Rendering quality of points, the Radius of points, as well as the Line
width/thickness and Line color. These are fairly technical options that basically affect
the quality of the graph and the speed by which it is updated. In most situations, the default
settings are fine.
Finally, under the Data button, there are the Show selection and neighbors and Show
connection line options. These items are relevant when a spatial weights matrix has been
specified. This is covered in Chapter 10.
categories), or suggesting an interaction effect between the conditioning variable(s) and the
statistic under scrutiny.
In GeoDa, conditional graphs are implemented for the histogram, box plot, scatter plot and
thematic map.
8.4.1 Implementation
A conditional plot is invoked from the menu by selecting Explore > Conditional Plot.
This brings up a list of four options: Map, Histogram, Scatter Plot and Box Plot. The
same list of four options is also created after selecting the Conditional Plot icon, the
right-most in the multivariate EDA subset in the toolbar shown in Figure 8.1. In addition,
the conditional map can also be started as the third item from the bottom in the Map
menu.
Next follows a dialog to select the conditioning variables and the variable(s) of interest. The
conditioning variables are referred to as Horizontal Cells for the x-axis, and Vertical
Cells for the y-axis. They do not need to be chosen both. Conditioning can also be carried
out for a single conditioning variable, on either the horizontal or on the vertical axis alone.
The remaining columns in the dialog are either for a single variable (histogram, box plot,
map), or for both the Independent Var (x-axis) and the Dependent Var (y-axis) in
the scatter plot option.
Conditional Plots 153
The conditioning variable in the vertical dimension is ALTID. The default classification is
to use Quantile with three categories for both conditioning variables.
This yields the plot shown in Figure 8.15. Something clearly is wrong, since there turns
out to be only one subcategory for the horizontal conditioning variable. This is because a
3-quantile classification of the indicator variable does not provide meaningful categories.
This will be dealt with next.
However, first consider the substantive interpretation of this graph. The three plots show a
strong positive and significant relationship in each case (significance indicated by **). In
other words less access to education is linearly related to less food security, more or less in
line with our prior expectations. The lack of difference between the graphs would suggest
that the relationship is stable across all three ranges for altitude, the conditioning variable.
Figure 8.16 results after changing the Horizontal Bins Breaks option to Unique Values.
At this point, the two categories for the horizontal conditioning variable correspond to the
values of 0 and 1 for the population change indicator variable.
Conditioning on this variable does provide some indication of interaction. For pch12 = 1,
the linear relationship between peduc_20 and pfood_20 is strong and positive, and not
affected by altitude. However, for pch12 = 0, there does not seem to be a significant slope
for any of the subgraphs, suggesting a lack of relationship between the two variables in those
municipalities suffering from population loss. The change in sign of the slope along altitude
is not meaningful since the slopes are not significant.
If only the horizontal classification had been used as a condition, the result would be the
same as a Regimes Regression in a standard scatter plot. However, in order to consider
the simultaneous conditioning on two variable,six separate selection sets would be required
to create six individual scatter plots with regimes, which is not very practical. In addition,
the double conditioning allows for the investigation of more complex interactions among the
variables.
Conditional Plots 155
As in any EDA exercise, considerable experimentation may be needed before any meaningful
patterns are found for the right categories of conditioning variables. Of course, this runs the
danger than one ends up finding what one wants to find, an issue which revisited in Chapter
21.
8.5.1 Implementation
The PCP functionality is invoked from the menu as Explore > Parallel Coordinate
Plot, or by means of the PCP toolbar icon, the third item from the left in Figure 8.1.
This brings up a Parallel Coordinate Plot variable selection interface that consists of
two columns: Exclude and Include. Variables are chosen by moving them to the Include
column by means of arrow buttons.
Four variables are selected to illustrate the PCP: peduc_20, pserv_20 and pepov_20,
as well as pss_20 (percent of population without social security). With the default settings,
the result is as in Figure 8.19. Each variable is represented by a horizontal axis, with the
observations as lines connecting points on each axis. The variable name is listed at the left,
together with the range of the variable, its mean and standard deviation. Note that the
axes are represented with equal lengths, which corresponds to the range for each individual
variable. This implies that the distance between points on each axis does not necessarily
correspond with the same difference in value for each variable.
Parallel Coordinate Plot 157
Shape, Color, Save Selection, Copy Image to Clipboard and Save Image As. Note that Classifi-
cation Themes always pertains to the variable that was originally listed on top, even when the order of
axes is later changed.
158 Multivariate Data Exploration
offers a way to make all axes lengths equivalent by selecting View Standardized Data.
As a result, the data points on each axis are given in standard deviational units and become
directly comparable between the different variables, as in Figure 8.20. Observations that are
outside the −2 to +2 range can be considered to be outliers.
This graph also illustrates another feature of the PCP. By grabbing the small circular handle
on the left of an axis, it can be moved up or down in the graph, changing the order of the
variables. For example, in Figure 8.20, the axis for pepov_20 has been moved up to be
above pserv_20. Realigning the axes in this manner can often bring out patterns in the
data. However, in practice, it is not always clear which is the optimal alignment.
which these lines move together. For example, in the example, several observations follow a
similar path for pss_20, pserv_20 and pepov_20, but not for peduc_20. Insight into
any corresponding spatial pattern can be obtained by linking the selection with a map, as in
the example, using the six quantile map for ALTID.
A more explicit spatial perspective is taken by brushing a map linked to the PCP. In Figure
8.22, this is again implemented with the same six quantile map. The selection in the map
suggests that several locations within close geographical proximity also take on similar values
for a subset of the variables (close lines), but not for all. This illustrates some of the practical
difficulties associated with the concept of multivariate spatial correlation.
the observation lines closely track for the three variables. In the data cube on the right, the
same three observations are highlighted in yellow (with the red arrow pointing to them), very
close together near the origin of the axes. This demonstrates the equivalence of lines being
close together in the PCP and observation points being close together in multi-attribute
variable space. Clearly, for more than three variables, only the PCP remains a viable method
to assess such closeness.
Figure 8.24 illustrates the concept of an outlier in the PCP. Now, the selection rectangle
is over the highest value for pepov_20 (in the lowest axis), an observation with extreme
poverty. As the graph shows, this also corresponds to an extreme value for pserv_20, but
much less so for peduc_20. In the data cube, the selected observation, highlighted in yellow
(with the red arrow pointing to it), is indeed somewhat removed from the rest of the data
cloud.
The number of variables that can be visualized with a PCP is only constrained by screen real
estate. However, as pointed out, for larger data sets, the graph can become quite cluttered,
making it less practical.
9
Space-Time Exploration
While the primary focus of GeoDa is on the exploration of cross-sectional or static data, it
also includes limited functionality to investigate space-time dynamics through the use of the
Time Editor, Time Player and Averages Chart.
This allows for the evolution of a variable to be shown over time through maps and graphs,
in a form of comparative statics. One limitation of this approach is that it is cross-section
oriented. More precisely, the different time periods are considered as separate cross-sectional
variables (one cross-section for each time period). While this results in quite a bit of flexibility
when it comes to grouping variables, it also means there is no inherent time-awareness, nor
a dedicated panel data structure.
After explaining the approach to space-time exploration through the Time Editor and
Time Player, I revisit the use of the Averages Chart for treatment effect analysis. In
contrast to the discussion in Chapter 7, here a dynamic analysis is possible, in that spatial
structural change can be assessed at two points in time through a difference in difference
approach.
To illustrate these methods, I continue to use the Oaxaca Development sample data set,
particularly the variables from the Mexican census at three different points in time: 2000,
2010 and 2020.
• Scatter Plot
– time-wise autoregressive scatter plot
• Map
– moving a choropleth map through time using custom categories
• Averages Chart
– difference in means test
– difference in difference (DID) analysis
– saving the dummy variables
Toolbar Icons
for the group is initially derived from the variable name entered but also typically requires
some further editing.
Individual variables can be moved up or down in the sequence (the sequence determines
the order of the comparative statics graphs and maps) by selecting the variable and using
the buttons at the bottom of the panel (Move Up and Move Down). In addition, any
variable can be moved out of the grouping, back to the ungrouped variables list, by selecting
it and clicking on the left arrow buttons (<).
When all the variables are included under New Group Details, in the correct order and
with the correct Time label, another arrow key (to the right of the variable list) adds
the group under the Grouped Variables list with the name as specified in the central
panel. At this point, the process can be repeated for additional variables. However, for each
grouped variable, the same number of original variables must be combined. In other words,
the grouping always pertains to a fixed number of time periods, set when the first grouped
variable is created.
The example in Figure 9.2 is for the Oaxaca data set. It shows the situation after the variable
p_P614NS (percentage of the population between 6 and 14 years of age that received
no schooling) has been created by grouping p_P614NS00 (for 2000), p_P614NS10 (for
2010) and p_P614NS20 (for 2020). The Time labels have been set to 2000, 2010 and
2020. A new grouped variable is in the process of being created using p_PJOB00 for
2000 (percentage of the population over 12 years of age with a job). Note how the entry for
name is not very useful at this point, and should be edited to p_PJOB, for example. The
next variable to be selected should be p_PJOB10, for 2010.
164 Space-Time Exploration
To support a range of illustrations, two more groups are created: p_PHA (percentage of
the population with access to health services), and p_CAR (percentage private inhabited
dwellings with a car or van).
The grouped variables are shown in the data table with their new name, followed by the first
time period in parenthesis. For example, in Figure 9.3, the entries are shown for p_PJOB
and p_PHA.
Note that the grouped variables have not been dropped as individual entries from the table,
they are just not displayed as such. The new label in parentheses indicates that they have
been grouped.
Finally, as mentioned, GeoDa is agnostic to what variables are grouped and in which order
they are arranged, so this procedure can also be used to group variables that do not necessarily
correspond to different time periods. For example, this may be useful when constructing a
box plot graph that shows the box plots for multiple variables in one window.
cross-sectional observation, the time period and time period dummy variables.2 The time
grouped variables are stacked as cross-sections for each successive time period.
The space-time data file allows for pooled cross-section and time series analyses, but does
not work with a corresponding map. In other words, the data can be loaded into a table for
analysis (and, if appropriate, spatial weights included in the weights manager), but no map
is available.
In addition, if a spatial weights file is active, a corresponding space-time weights GAL file is
created as well, as detailed in Section 11.5.4.
2 When a spatial weights file is active, its ID variable is used as the cross-sectional identifier. In the
coefficient, i.e., how the variable in one period is related to its values in a previous period.
More formally, this is the slope coefficient in a linear regression of yt on yt−1 .
Since the time periods for the variables are available from a drop-down list in the variable
selection dialog, it is easy to change the time selection for both X and Y variables. In Figure
9.8, this is illustrated for p_PJOB. The Independent Var X is set to Time 2000, and
the drop-down list for Dependent Var Y is checked for Time 2010.
This results in the scatter plot shown in the left-hand panel of Figure 9.9, including both
a linear fit and a LOWESS smoother (with the bandwidth changed to 0.4). The slope of
0.288, while highly significant, only results in an overall fit of 10%. The LOWESS smoother
illustrates why this is the case. There is a clear break in the slope, with a positive slope
(steeper than the overall 0.288) for values of p_PJOB in 2000 up to roughly 52%, but
beyond that it turns to a negative slope. So, for municipalities that already had a high
percentages employed people, the relationship between 2000 and 2010 is negative, suggesting
a loss of employment for those municipalities. In an actual application, this structural break
should be further investigated by means of a Chow test. Also, its spatial footprint can be
assessed by means of linking and brushing with a map (see Sections 7.3.1.1 and 7.5.2 in
Chapter 7).
Time Player – Comparative Statics 169
After adjusting the Time for the X-variable to 2010 and for the Y-variable to 2020, the
scatter plot in the right-hand panel of Figure 9.9 results. Now, the slope is much steeper,
and the fit quite a bit better (R2 of 24.9%). However, there is still considerable evidence of
structural instability, as emphasized by the S-shaped LOWESS fit (again with bandwidth at
0.4). In an actual application, this would need to be investigated further.
Unlike what is the case for univariate graphs (and maps), the Time Player does not
necessarily have the desired effect. It is essentially a univariate tool, so it only changes the
Time period for the dependent variable (Y). The X-variable remains unchanged, rather
than updating both axes to move the time lag forward or backward.
each map legend, the maps represent very different realities. In 2000, the percentage car
ownership ranged from 0% to 46.8%, whereas in 2020, it varied between 1.1% and 65.8%, a
20% difference for the maximum.
In order to better represent these absolute changes, a set of custom categories must be
created by means of the Category Editor (see Section 4.6). In the right-hand panel of
Figures 9.10–9.12, a six-category equal interval classification is employed, going from 0 to 66
(the map legend lists car_map as the classification method).
The result is very different. The paucity of car ownership in the first period is emphasized
(there are only five observations over 33%) and the maps show a gradual increase over time,
with growing prevalence of the darker colors (from the center outward and to the east).
Treatment Effect Analysis – Averages Chart 171
Figure 9.14: Difference in means, selected and unselected, static, in 2000 and 2020
This allows for two different types of applications. One is the same Difference-in-Means
Test as considered before, applied statically to the variable at different points in time.
A second is a more complete treatment effect analysis, implemented through the Run
Diff-in-Diff Test option. Each is considered in turn.
test, the evolution of the target variable for a control group is taken into account as well.
This is the basis for a difference in difference analysis. The problem is that the counterfactual
is not actually observed. Its behavior is inferred from what happens to the control group.
A critical assumption is that in the absence of the policy, the target variable follows parallel
paths over time for the treatment and control group. In other words, the difference between
the treatment and control group at period 1 would be the same in period 2, in the absence
of a treatment effect.
The counterfactual is thus a simple trend extrapolation of the treated group. In the absence
of a treatment effect, the value of the target variable in period 2 should be equal to the
difference between treated and control in period 1 added to the change for the control
between period 1 and period 2, i.e., the trend extrapolation. The difference between that
extrapolated value for the counterfactual and the actual target variable at period 2 is then
the estimate of the treatment effect.
Formally, this can be expressed as a simple linear regression of the target variable stacked
over the two periods, using a dummy variable for the treatment-control dichotomy (the space
dummy, S, equal to 1 for the Selected), a second one for the before and after dichotomy (the
time dummy, T , equal to 1 for Period 2) and a third dummy for the interaction between
the two (i.e., treatment in the second period, S × T ):3
yt = β0 + β1 S + β2 T + β3 (S × T ) + ,
with β as the estimated coefficients, and as a random error term. The treatment effect is the
coefficient β3 . The coefficient β0 corresponds to the mean in the control group (Unselected)
in 2000 (14.74 in Figure 9.14). β1 is the pure space effect, the difference in means between
Selected and Unselected in 2000 (-4.29 in Figure 9.14). Finally, β2 is the time trend in
3 A full discussion of the econometrics of difference in difference analysis is beyond the scope of this
chapter and can be found in Angrist and Pischke (2015), among others.
174 Space-Time Exploration
the control group, which has not yet been computed in the cross-sectional analysis (but see
Figure 9.17).
9.4.2.1 Implementation
The difference in difference implementation uses the same Averages Chart interface as
before, but with settings that disable the difference in means test. As shown in Figure 9.15,
the two groups are Selected and Unselected, but Period 1 is 2000, with Period 2 as
2020. Even though the two means are listed and the graph is shown in the right hand panel,
the results of the difference in means test are listed as 0.
Treatment Effect Analysis – Averages Chart 175
The analysis is carried out by selecting the Run Diff-in-Diff Test button. This yields the
standard GeoDa regression output, shown in Figure 9.16.4 The overall fit is quite good, with
an R2 of 83%, and all coefficients are highly significant.
4 The regression functionality of GeoDa in not considered in this book. For a detailed treatment, see
treatment effects analysis, see, among others, Kolak and Anselin (2020), Reich et al. (2021) and Akbari et al.
(2023).
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
Part III
Spatial Weights
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
10
Contiguity-Based Spatial Weights
changed.
but this is only relevant in a discussion of global spatial autocorrelation statistics, which is
postponed until Part IV.
There are different criteria that can be used to reduce the number of neighbors for each
observation. In this chapter, approaches based on the notion of contiguity are considered, i.e.,
sharing a common border. In spatial analysis, this is typically based on geographic borders,
although as discussed in Chapter 11, the concept is perfectly general.
In the remainder of this section, the concept is introduced more formally first, followed
by a brief discussion of the two most common forms, i.e., rook and queen contiguity. This
is followed by a treatment of the concept of higher order contiguity and some practical
considerations.
The spatial weights wij are non-zero when i and j are neighbors, and zero otherwise. By
convention, the self-neighbor relation is excluded, so that the diagonal elements of W are
zero, wii = 0.3
To make the concept of a spatial weights matrix more concrete, a spatial layout is used with
six polygons, shown in Figure 10.2. Two spatial units are defined as neighbors when they
share a common border, i.e., when they are contiguous. For example, in the figure, unit 1
shares a border with units 2, 4 and 5, and hence they are neighbors.
3 An exception to this rule are the diagonal elements in kernel-based weights, which are considered in
Chapter 12.
182 Contiguity-Based Spatial Weights
The same neighbor structure can also be represented in the form of a network or graph, as
in Figure 10.3. Here, each spatial unit becomes a node in the network, and the existence of
a neighbor relation is represented by an edge or link connecting the respective nodes. Again,
node 1 is connected with a link to nodes 2, 4 and 5. It is very important to keep in mind
the generality of this network representation, since it applies well beyond purely geographic
concepts of contiguity (see Chapter 11).
In its simplest form, the spatial weights matrix expresses the existence of a neighbor relation
in binary form, with weights 1 and 0. For the layout in Figure 10.2, this yields a 6 × 6 matrix:
⎡ ⎤
0 1 0 1 1 0
⎢1 0 0 1 1 0⎥
⎢ ⎥
⎢0 0 0 0 1 1⎥
W=⎢ ⎢ ⎥. (10.1)
⎥
⎢1 1 0 0 1 0⎥
⎣1 1 1 1 0 0⎦
0 0 1 0 0 0
In the first row of this matrix, matching unit 1, the non-zero elements correspond to columns
(units) 2, 4 and 5.
In sum, the polygon boundaries in the map layout, the presence of edges in the network
and the non-zero weights in the matrix are all equivalent representations of the topology or
spatial arrangement of the data.
While naturally associated with areal units represented as polygons, the notion of contiguity
can also be applied in situations where the observations are represented as points, see Section
11.5.1.
Finally, it is important to note that even though it is referred to as a spatial weights matrix,
no such matrix is actually used in actual software operations. Spatial weights are typically
very sparse matrices, and this sparsity is exploited by using specialized data structures (there
is no point in storing lots and lots of zeros).
The Concept of Spatial Weights 183
As a
result,
each row sum of the new matrix equals one. Also, the sum of all weights,
S0 = i j wij , equals n, the total number of observations.4
For the layout in Figure 10.2, the matching row-standardized weights matrix is:
⎡ ⎤
0 1/3 0 1/3 1/3 0
⎢1/3 0 0 1/3 1/3 0 ⎥
⎢ ⎥
⎢ 0 0 0 0 1/2 1/2 ⎥
Ws = ⎢ ⎢ ⎥. (10.2)
1/3 1/3 0 0 1/3 0 ⎥
⎢ ⎥
⎣1/4 1/4 1/4 1/4 0 0 ⎦
0 0 1 0 0 0
Importantly, whereas the original binary weights matrix was symmetric, this is no longer the
case with the row-standardized weights. For example, the element in row 1-column 5 equals
1/3, whereas the element in row 5-column 1 equals 1/4. This has important consequences for
some operations in spatial regression analysis, but less so in an exploratory context (but see
Section 12.3.1). Also, the weights are no longer equal. In fact, they are inversely related to
the number of neighbors. With more neighbors (say, k), the weight given to each neighboring
observation (1/k) becomes smaller. More importantly, when there is only one neighbor (as in
row 6, column 3 in the example), the weights remains 1. This may lead to counter intuitive
results when computing spatial transformations, as in Chapter 12.
matrix: ⎡ ⎤
0 1 0 1 0 0 0 0 0
⎢1 0 1 0 1 0 0 0 0⎥
⎢ ⎥
⎢0 1 0 0 0 1 0 0 0⎥
⎢ ⎥
⎢1 0 0 0 1 0 1 0 0⎥
⎢ ⎥
W=⎢
⎢0 1 0 1 0 1 0 1 0⎥⎥. (10.3)
⎢0 0 1 0 1 0 0 0 1⎥
⎢ ⎥
⎢0 0 0 1 0 0 0 1 0⎥
⎢ ⎥
⎣0 0 0 0 1 0 1 0 1⎦
0 0 0 0 0 1 0 1 0
The structure of the weights matrix is characterized by a band of ones. In addition, there is
a strong effect of boundary units in this small data set. Internal units have four neighbors,
but corner units only have two, and the other boundary units only have three.
The Concept of Spatial Weights 185
A second criterion to define contiguity, less frequently used, is based on common vertices
(corners) as the convention to identify neighbors. This is referred to as bishop contiguity.
Due to its lack of adoption in practice, it is not further considered here.
Finally, the queen criterion combines rook and bishop. As a result, neighbors share either
a common edge or a common vertex. This yields eight neighbors for the non-boundary
units. For example, as shown in Figure 10.6, the central cell 5 has all the other cells being
neighbors.
The matching 9 × 9 binary contiguity matrix is much denser than before:
⎡ ⎤
0 1 0 1 1 0 0 0 0
⎢1 0 1 1 1 1 0 0 0⎥
⎢ ⎥
⎢0 1 0 0 1 1 0 0 0⎥
⎢ ⎥
⎢1 1 0 0 1 0 1 1 0⎥
⎢ ⎥
W=⎢ ⎢1 1 1 1 0 1 1 1 1⎥ ⎥
⎢0 1 1 0 1 0 0 1 1⎥
⎢ ⎥
⎢0 0 0 1 1 0 0 1 0⎥
⎢ ⎥
⎣0 0 0 1 1 1 1 0 1⎦
0 0 0 0 1 1 0 1 0
Now, the corner cells have three neighbors, and the other boundary cells have five.
The concepts of rook and queen contiguity can easily be extended to other regular tessellations,
such as hexagons.5 In addition, they apply to irregular lattice layouts as well. In this context,
rook neighbors are those spatial units that share a common edge, and queen neighbors
those that share either a common edge or a common vertex. As a result, spatial weights
based on the queen criterion will always have at least as many neighbors as a corresponding
rook-based weights matrix.
In practice, whether a common segment is categorized as a point (vertex) or line segment
(edge) depends on the precision of the geocoding in a GIS. Due to generalization (a process
of simplifying the detail in a spatial layer), some small line segments may be represented
as points in one GIS and not in another. In addition, similar quirks in the representation
of polygon boundaries may result in (very) small empty spaces between them. In a naive
approach to defining common boundaries, such instances would fail to identify neighbors.
Therefore, some tolerance may be required to make the weights creation robust to such
characteristics.
so-called von Neumann and Moore neighborhoods. The main difference is that here, the central cell is not
included in the neighborhood concept, whereas it is in cellular automata.
186 Contiguity-Based Spatial Weights
Using this logic, candidates to be a second-order neighbor would be any first-order neighbor
to another observation that is already a first-order neighbor. However, this would be limited
to only those locations that are not already first-order neighbors. Parenthetically, it should
be noted that the concept of higher order contiguity is similar to the notion of reachability
in social networks (e.g., Newman, 2018).
In the layout shown in Figure 10.7, cells 4 and 3 are highlighted. Cell 4 has three first-order
neighbors, immediately surrounding it: 1, 2 and 5. Cell 3 is a first-order neighbor to 5, hence
3 is a second-order neighbor to 4.
In the graph representation of the contiguity structure in Figure 10.8, second-order contiguity
corresponds to of a path of length 2 from node 4 to node 3, illustrated by the dashed lines.
This path requires the traversal of two edges in the graph, [4-5] and [5-3] and is thus of
length 2. As it turns out, the number of edges separating a pair of nodes corresponds to the
order of contiguity.
The Concept of Spatial Weights 187
As such, identifying paths of length 2 between nodes is not sufficient to find the correct
second-order neighbors, since it does not preclude redundant or circular paths. For example,
a closer examination of Figure 10.8 reveals that 1 and 5 can be reached in two steps from 4
as well (e.g., 4-1-2, 4-5-1). In fact, multiple paths of length 2 can exist between two nodes
(e.g., 4-1-2 and 4-5-2).
These paths illustrate the complexity of the problem of finding higher order neighbors. For
example, nodes 2 and 5 cannot be both first-order and second-order neighbors of node 4, so
a strict application of the graph-theoretical approach is not sufficient in the spatial weights
case. Ways must be found to eliminate these redundant paths in order to yield the proper
second-order neighbor, node 3.
back to the GIS and fix or clean the topology of the data set. Editing of spatial layers is not
implemented in GeoDa, but this is a routine operation in most GIS software.
10.3.1.1 ID variable
The first item to specify in the dialog is the ID Variable. This variable has to be unique
for each observation. It is a critical element to make sure that the weights are connected to
the correct observations in the data table. In other words, the ID variable is a so-called key
that links the data to the weights.
In GeoDa, it is best to have the ID Variable be integer. In practice, this is often not the case,
and even though the identifier may look like an integer value, it may be stored as a string.
For example, the standard identifiers that come with many open data sets (such as the US
census data) are typically character variables. One way to deal with this problem is to use
the Edit Variable Properties functionality in the table to turn a string into an integer,
as shown in Section 2.4.1.1.
In some instances, there is no easy way to identify an ID variable. In that case, the Add ID
Variable button provides a solution: the added ID variable is simply an integer sequence
number that is inserted into the data table (as always, there must be a Save operation to
make the addition permanent).
For the Ceará Zika data set, code7 is used as the ID variable, as indicated in Figure 10.10.
For example, an _r added to the name of the data would suggest rook weights. However, as
outlined in Section 10.3.5, if a Project File is saved, several of the characteristics of the
weights (i.e., its metadata) are stored in that file for later reuse.
The weights are immediately computed and written to the GAL file. At the end of this
operation, a success message appears (or an Error message if something went wrong).
A useful option in the weights file creation dialog is the specification of a Precision
threshold (situated right below the Rook contiguity radio button). In most cases, this is
not needed, but in some instances the precision of the underlying shape file is insufficient to
allow for an exact match of coordinates (to determine which polygons are neighbors). When
this happens, GeoDa suggests a default error band to allow for a fuzzy comparison. In most
cases, this will be sufficient to fix the problem.
After the weights are created, the weights manager becomes populated, as shown in Figure
10.11. The name for the file that was just created is now included under the Weights Name.
In addition, under the item Property, several summary characteristics are listed. This is
further examined in Section 10.4.1.
Creating Contiguity Weigths 191
This is useful in order to trick the cross-sectional data structure of GeoDa to allow for simple
pooled time series/cross section analyses.7 The pooled aspect means that the same coefficients
or model parameters hold for all time periods, which is not necessarily a realistic assumption.
However, it is often the point of departure in an analysis of space-time dynamics.
7 The most common application is to carry out a regression analysis, but this approach can also be used
The critical element in this endeavor is to create a unique space-time ID variable (STID) so
that the stacked cross-sections (one for each time period) can be handled as if they were one
single cross-sectional data set, with a unique ID for each space-time observation. Analogously,
the spatial weights are block-diagonal, with the same weights structure repeated for each
stacked time period.
This is illustrated for the Oaxaca Development data set from Chapter 9. Note that it is
necessary to have a spatial weights file active in the weights manager for this to work
properly. In the example, this is a first-order queen contiguity weights file using mun as the
ID. A snapshot of the corresponding space-time data table, saved as a csv file, is shown in
Figure 10.18. In addition to the cross-sectional identifier mun, there is a unique space-time
identifier STID, as well as an indicator variable for each time period: T_2000, T_2010
and T_2020. The remaining entries consist of the time-enabled (grouped) variables.
The contents of the corresponding gal file are illustrated in Figure 10.19. The header line
lists 1710 as the number of space-time observations (570 × 3), stoaxaca as the name of the
source file with the data (i.e., the file shown in Figure 10.18), and STID as the key. The
remaining entries follow the same format as before.
196 Contiguity-Based Spatial Weights
The overall pattern is quite symmetric, with a mode of 5 (i.e., most municipalities have 5
rook neighbors), but quite a long tail to the right (i.e., a few municipalities have a much
larger number of neighbors). In addition to the visual inspection, the usual statistics of the
distribution can be added to the bottom of the table by means of the View > Display
Statistics option. The descriptive statistics are the same as listed in the weights manager
summary.
Figure 10.21 illustrates the minor differences between rook and queen contiguity. In the
histogram on the right, for queen contiguity, the municipalities with six neighbors are selected.
In the linked contiguity histogram for rook weights on the left, the matching distribution is
shown. Several of the observations with 6 neighbors using the queen criterion, have only
198 Contiguity-Based Spatial Weights
5 or even 4 neighbors for the rook criterion. This highlights the more inclusive nature of
queen contiguity.
In addition, the contiguity histogram can be linked to any other window, such as the
themeless map in the left-hand panel of Figure 10.22. In the contiguity histogram on the
right (queen contiguity), the two observations with 13 neighbors are selected. A closer
examination of the map identifies two very large areal units. Consequently, they will have
more neighbors than the average sized unit.
It is good practice to check the connectivity histogram for any strange patterns, such as
observations with only one neighbor, or neighborless observations (isolates, see Section
11.4.2).
Ideally, the distribution of the cardinalities should be smooth (i.e., no major gaps) and
symmetric, with a limited range. Deviations from symmetry, such as bi-modal distributions
(some observations have few neighbors and some have many) require further investigation.
Any such characteristics will have an immediate effect on the corresponding weights, with
implications for all the computations that use these weights.
observation. The window header lists the weights matrix to which the connectivity structure
pertains.
In Figure 10.23, this is illustrated using queen contiguity for the Ceará municipalities. As
soon as the pointer is moved over one of the observations, it is selected and its neighbors
are highlighted. Note that this behavior is slightly different from the standard selection in
a map, since it does not require a click to select. Instead, the pointer is in so-called hover
mode. As soon as the pointer moves outside the main map, the complete layout is shown.
In the example, the selection is over the municipality of Santa Quitéria (code7 2312205),
one of the two observations identified as having 13 neighbors. This outcome is listed in the
status bar below the map, as well as the identifying codes for the neighbors. Moving the
pointer to another observation immediately updates the selection and its neighbors.
The Connectivity Map in the weights manager is actually a special case of the Connec-
tivity option in any map. This is considered in more detail in Section 10.4.5.1.
The most dramatic of the options is the Hide Map feature. By checking this, the background
map is removed and only the pure graph structure remains, as in Figure 10.25. This clearly
brings out some interesting features of the graph, such as the observation with a single
neighbor (a single edge in the graph) in the extreme north-east (highlighted in red).
This option supports operations similar to those in a Connectivity Map, but applied
to any current map. The functionality is invoked by selecting Connectivity > Show
Selection and Neighbors from the options menu, in the usual fashion (right click).
With the option checked, the selection feature works the same as in the Connectivity
Map. The main difference between the functionality in the Connectivity Map and the
Connectivity option in any thematic map is that the latter is updated to the current active
spatial weights file. In other words, as the active weights change, the connectivity structure
is adjusted. In contrast, when selecting the Connectivity Map from the weights manager,
there is only one such map, tied directly to the weights file from which it was invoked.
With the Show Selection and Neighbors option checked, it becomes possible to Change
Outline Color of Neighbors, to Change Fill Color of Neighbors, and implement
other cosmetic adjustments to the look of the selected polygons.
In this second chapter devoted to spatial weights, the focus is on weights that are derived
from a notion of distance. Intrinsically, this is most appropriate for point layers, but it can
easily be generalized to polygons as well, through the use of the polygon centroids.
The chapter starts with a brief overview of distance metrics and the fundamental difference
between points expressed in projected coordinates and points in latitude-longitude decimal
degrees. Whereas for the former the familiar concept of Euclidean distance can be applied,
the latter requires the computation of great circle distance or arc distance. In addition, a
concept of distance in multivariate attribute space is introduced.
This is followed by a review of spatial weights that are based on distance-bands, k-nearest
neighbor weights and the use of generalized concepts of distance.
Next is a discussion of the implementation of the weights functionality in GeoDa, including a
broadening of the strict concept of contiguity to apply to points.
The chapter closes with a discussion of set operations on weights, such as intersection, union
and making symmetric.
The methods are illustrated with a point data set that contains performance measures for
261 Italian community banks for 2011–2017. This data set was used in an analysis of spatial
spillovers in technical efficiency by Algeri et al. (2022). The data are contained in the Italy
Community Banks sample data set. The space-time weights are illustrated with the Oaxaca
Development data.
As mentioned in the previous chapter, to work most effectively with the spatial weights files,
the data sets should first be copied (Save As) to a working directory.
• Space-time weights
• Create a general dissimilarity matrix based on multi-attribute distance
• Investigate commonalities among spatial weights (intersection and union)
GeoDa Functions
• Weight File Creation dialog
– distance-band weights
– k-nearest neighbor weights
– great circle distance option
– block weights option
• Weights Manager
– Connectivity Histogram, Map and Graph
– Intersection, Union
– Make symmetric
• Time Editor
– Save Space-Time Weights
with p as a general exponent. The Minkowski metric itself is not often used in spatial analysis,
but there are two special cases of great interest.
Arguably the most familiar case is for p = 2, i.e., Euclidean or straight line distance, dij , as
the crow flies:1
(1/2)
dij = |xi − xj |2 + |yi − yj |2 ,
or, in its more familiar form:
dij = (xi − xj )2 + (yi − yj )2 .
where the subscripts d and r refer respectively to decimal degrees and radians, and π =
3.14159 . . .. With ΔLon = Lonr(j) − Lonr(i) , the expression for the arc distance is:
or, equivalently:
where R is the radius of the earth. In GeoDa, the arc distance is obtained in miles with R =
3959, and in kilometers with R = 6371.
However, it should be noted that these calculated distance values are only approximate,
since the radius of the earth is taken at the equator. A more precise measure would take
into account the actual latitude at which the distance is measured. In addition, the earth’s
shape is much more complex than a sphere. In most instances in practice, the approximation
works fine.
the pair (lat, lon) actually pertains to the coordinates as (y,x) and not (x,y).
206 Distance-Based Spatial Weights
variable space (considered in Part II of Volume 2). These artificial locations can then be
used to derive spatial weights based on the distance between them.
In the empirical literature, such weights are often referred to as economic weights, based
on on a notion of economic distance. Simply put, the distance between observation i and j
expressed in terms of an economic (or social) variable z is:
dij = |zi − zj |α ,
distance conforms to a max-min criterion, i.e., it is the largest of the nearest neighbor
distances.3
In practice, the max-min criterion often leads to too many neighbors for locations that are
somewhat clustered, since the critical distance is determined by the points that are farthest
apart. This problem frequently occurs when the density of the points is uneven across the
data set, such as when some of the points are clustered and others are more spread out.
This is examined more closely in the illustrations.
For example, in the much cited study by Case et al. (1993), two separate weights matrices
were constructed using economic weights. One was based on the difference in per capita
income between states and the other on the difference in the proportion of their Black
population. In another nice example, Greenbaum (2002) uses the concept of a similarity
matrix to characterize school districts that are similar in per-student income. Instead of
using those weights as such, they are combined with a nearest neighbor distance cut-off,
so that the similarity weight is only computed when the school districts are also within a
given order of geographic nearest neighbors (see also the discussion of weights intersection
in Section 11.6.1).
Weights created as functions of distance are further considered in Chapter 12.
4 W is the transpose of the weights matrix W, such that rows of the original matrix become columns in
the transpose.
Implementation 209
11.4 Implementation
As mentioned in the chapter introduction, the distance-based weights functionality is
illustrated with a point layer of 261 Italian community banks, shown in Figure 11.2.5
Distance weights are invoked in the same way as contiguity weights, through the Weights
Manager, as outlined in Section 10.3.1. The Weights File Creation dialog with the
Distance Weight button selected is shown in Figure 11.3. The ID Variable is set to idd.
The dialog controls the various options available to create distance-based weights. These
include the choice between geographic distance – the Geometric centroids button under
Variables – or generalized distance – the Variables button (see Section 11.4.4).
The default for geographic distances is to have the X-coordinate variable as the internally
obtained X-Centroids, and the Y-coordinate variable as the internally obtained Y-
Centroids. However, any other pair of variables can be selected, including non-geographic
ones. The resulting concept of general distance is limited to two dimensions, whereas the
Variables tab allows for higher dimensions as well.
5 The map is customized by selecting Fill Color for Category and Outline Color for Category as
red (instead of the default green), changing the Point Radius to 1, and using Stamen > Toner Lite as
the Basemap (with Transparency set to 30).
210 Distance-Based Spatial Weights
The default Distance metric is Euclidean Distance. Other options, appropriate for
unprojected coordinates, are Arc Distance (mi) and Arc Distance (km), for great circle
distance expressed as either miles or kilometers.
Implementation 211
The type of weight is set by the Method tab. The Distance band criterion is the default.
Other options are K-Nearest neighbors and Kernel weights.6 The default bandwidth
for distance band weights is the max-min distance. In the current example, this is listed as
about 125 km (the units are meters). The options for inverse distance and Power are
considered in Chapter 12.
will be large), whereas dense clusters of locations will encompass many neighbors using this
large cut-off distance.
The properties of the distance band weights can be further investigated by means of the
Connectivity Graph. As before, this is invoked through the right-most button at the
bottom of the weights manager.
The pattern shown in Figure 11.7 highlights how the connectivity is highly uneven, with
very dense areas alternating with much sparser distributions. The connection highlighted in
blue corresponds to the max-min distance. Clearly, this distance is much larger than the
average distance among locations, leading to a highly unbalanced weights matrix.
11.4.2 Isolates
The default distance band ensures that each observation has at least one neighbor, but this
has several undesirable side effects. To create a more balanced weights matrix, one can lower
the distance threshold. However, this will result in observations that do not have neighbors,
isolates or islands.
The Weights File Creation dialog is flexible enough that a specific distance cut-off can
be entered in the box, or the movable button can be dragged to any value larger than the
minimum distance, but smaller than the max-min distance. Sometimes, theoretical or policy
considerations suggest a specific value for the cut-off that may be smaller than the max-min
distance.
When such a threshold is chosen, a warning will appear, pointing out that the specified
cut-off value is smaller than the distance needed to ensure that each observation has at least
one neighbor.
214 Distance-Based Spatial Weights
For example, using a cut-off distance of 73 km instead of the default 125 km yields a much
sparser weights matrix (italy_banks_te_d73.gwt), as evidenced by the summary in Figure
11.8. The maximum number of neighbors has been reduced from 95 to 59 (which is still
quite large), but, more importantly, the min neighbors is listed as 0. Compared to the
default, the density has decreased substantially, going from 15.16% non-zero elements to
7.87% non-zero.
The connectivity graph shown in Figure 11.9 illustrates the greater sparsity. As it turns out,
the distance band of 73 km resulted in only two isolates, highlighted in the figure, one near
Aosta, in the north-west, and a second on the island of Sicily. All other observations are
connected. However, there begin to form disconnected entities, i.e., separate components in
the graph that are connected internally, but not between them.
Further lowering the cut-off distance may result in more meaningful groupings, but at the
expense of a growing number of isolates.
In contrast, the connectivity graph, shown in Figure 11.11, clearly demonstrates how each
point is connected to six other points. In the example, this yields a fully connected graph.
11.4.3.2 Caveat
One drawback of the k-nearest neighbor approach is that it ignores the distances involved. The
first k neighbors are selected, irrespective of how near or how far they may be. This suggests
a notion of distance decay that is not absolute, but relative, in the sense of intervening
opportunities (e.g., one considers the two closest grocery stores, irrespective of how far they
may be).
In addition, the requirement to have exactly k neighbors may create artificial connections,
spanning large distances or bridging across natural and other barriers. For example, in the
graph in Figure 11.11, the blue lines highlight some instances where this is the case. In the
case of the northern isolate, the Aosta location becomes connected to observations located
near Milan. For an observation on the north coast of Sicily, two long connections are included
across the straight of Messina to the boot of Italy. A similar distant neighbor is identified
for a location in Calabria to a far away one in the region of Basilicata. It is unlikely that
these connections correspond with meaningful interactions among the community banks.
The relative distance effect should be kept in mind before mechanically applying a k-nearest
neighbor criterion.
multi-attribute space. Instead of selections for the X and Y coordinates, a drop-down list
with all the variables is available. From this, any number of variables can be selected, not
just two, as in the standard distance case.
In addition, there is a Transformation option that allows the usual standardizations (see
Section 2.4.2.3). The distance metric can be either Euclidean Distance or Manhattan
Distance (great circle distance is not meaningful in this context).
Typically, the variables are first standardized, such that their mean is zero and variance one,
which is the default setting.
All the standard options are available, such as specifying a distance band (in multi-attribute
distance units) or k-nearest neighbors. Such a general distance matrix or dissimilarity matrix
is a required input in the multivariate clustering methods considered in Volume 2.
Figure 11.13: Weights summary properties for point contiguity (Thiessen polygons)
These operations can be carried out explicitly, by actually creating a separate Thiessen
polygon layer or centroid point layer, and subsequently applying the weights operations.
However, in GeoDa, this is not necessary, since the computations happen under the hood. In
this way, it is possible to create contiguity weights for points or distance weights for polygons
directly in the Weights File Creation dialog.
demonstrates how they are perpendicular to the dashed lines that connect the points. The
latter correspond to a queen contiguity relation.
Contiguity weights for a point layer are created in the same way as such weights for polygon
layers. In the Weights File Creation dialog, one selects the Contiguity Weight option
and specifies the type of contiguity. For the Italian community bank example, the summary
properties of a queen contiguity (italy_banks_te_q.gal) are shown in Figure 11.13.7
Compared to the other weights for this point layer, the mean neighbors is 5.78 and the
median neighbors is 6, similar to the k-nearest neighbor results in Figure 11.10. In addition,
the resulting weights are still a bit less dense (or, sparser), with a % non-zero of 2.21%
(compared to 2.30% for k-nearest neighbors).
on farmers were grouped into districts, and all the farms within the same district were
considered to be neighbors. However, the neighbor relation did not extend across districts.
In other words, farms that may be physically adjoining but in different districts would not
be considered neighbors, in effect turning each district into an island.
The result of this approach is a block-diagonal spatial weights structure. Within each region
or district, all observations are neighbors (row elements of 1, except for the diagonal), but
there is no neighbor relation between the regions, hence the block-diagonal structure.
222 Distance-Based Spatial Weights
The block weights structure has important consequences for the properties of estimators
in spatial regression models, but is less useful in an exploratory context, unless used in
combination with other criteria.10
362, 298, 264, 218, 178 and 177. On the right hand side, observation 571, which corresponds
to the first cross-sectional unit in the second time period, has the same six neighbors, but
all ID values are incremented by 570. For example, the first neighbor, with STID 932, is
obtained as 362 + 570.
The resulting space-time weights take on a block-diagonal structure in the sense that the
same cross-sectional block repeats itself for each time period. Formally, this corresponds to
the matrix expression I ⊗ W, where I is a T -dimensional identity matrix and W are the
cross-sectional spatial weights.12
The specialized weights structure can be used in a pooled analysis, where the space-time
aspect of the data is hidden. The observations are represented as one large cross-section,
with variables stacked by time period, and spatial weights block-diagonal by time period.
Consequently, all time periods are assumed to have the same parameters, such as the same
spatial autocorrelation coefficient.
multiplied by the second matrix. In other words, since the first matrix is a T × T identity matrix, the result
is a block diagonal matrix with a block of W for each time period.
13 This is the operational implementation used by Algeri et al. (2022), with the exception that the two
In this third chapter devoted to spatial weights, I consider some more specialized topics,
in the sense that they are less frequently encountered in empirical practice. There are two
main subjects.
First, I discuss two situations where the actual values for the spatial weights take on a
special meaning. So far, only the presence or absence of a neighbor relation has been taken
into account. In this chapter, this is generalized to weights that are transformations of the
pairwise distances between neighbors, i.e., inverse distance functions and kernel weights.
The resulting weights files primarily provide the basis for creating new spatially explicit
variables for use in further analyses, such as in spatial regression specifications.1 The actual
value of the weights themselves is not used in measures of spatial autocorrelation or other
exploratory analyses in GeoDa. As mentioned before, only the existence of a neighbor relation
is taken into account.
There are two important applications for these spatial transformations. One pertains to the
construction of so-called spatially lagged variables. Such variables are used in the exploration
of spatial autocorrelation and in spatial regression analysis. In a second set of applications,
the spatial weights are included as part of rate smoothing operations, as an extension of the
methods covered in Section 6.4.
To illustrate these techniques, I continue to use the point layer from the Italy Community
Banks sample data set and the Oaxaca Development polygon layer.
1 The distance functions in GeoDa provide an alternative and more user-friendly way to calculate the
weights included in PySAL and GeoDaSpace (see Anselin and Rey, 2014, for details).
GeoDa Functions
• Weight File Creation dialog
– inverse distance weights
– kernel weights
– bandwidth options
– diagonal element options
• Table > Calculator > Spatial Lag
– select spatial weights
– row-standardized weights or not
– include diagonal or not
• Map > Rates-Calculated Map
– Spatial Rate
– Spatial Empirical Bayes
• Table > Calculator > Rates
– Spatial Rate
– Spatial Empirical Bayes
Commonly used distance functions are the inverse, with wij = 1/dα
ij (and α as a parameter),
−βdij
and the negative exponential, with wij = e (and β as a parameter). The functions are
Spatial Weights as Distance Functions 227
typically combined with a distance cut-off criterion, such that for dij > δ, it follows that
wij = 0.
In practice, the parameters are often not estimated, but instead set to a fixed value, such
as α = 1 for inverse distance weights (1/dij ), and α = 2 for gravity weights (1/d2ij ). By
convention, the diagonal elements of the spatial weights are set to zero and not computed.
In fact, if the actual value of dii = 0 were used in the inverse distance function, this would
result in a division by zero.
An often overlooked feature in the computation of distance functions is that the resulting
values depend not only on the parameter value and functional form but also on the metric
and scale used for the distance measure. Since the weights are inversely related to distance,
large values for the latter will yield small values for the former, and vice versa. This may be
a problem in practice when the distances are so large (i.e., measured in small units) that the
corresponding inverse distance weights become close to zero, possibly resulting in an all zero
spatial weights matrix. This issue is directly related to the scale of the coordinates. It is a
common concern in practice when a projection yields distances expressed in feet or meters.
In addition, a potential problem may occur when the distance metric is such that distances
take on values less than one. As a consequence, some inverse distance values may be larger
than one, which is typically not a desired result.
A simple rescaling of the coordinates on which the distance computation is based will fix
both problems.
12.2.1.1 Implementation
Inverse distance weights are specified through the Weights File Creation interface, with
the Distance Weight tab selected, as in Figure 11.3. The check box next to Use inverse
distance? enables this functionality. This also activates the Power drop-down list, where a
coefficient can be specified, the default being 1.
The remainder of the interface works in the same fashion as before, with a query for a file
name for the new weights.
Using the Italy Community Banks sample data with the default settings yields a spatial
weights file like italy_banks_te_id1, the summary properties of which are shown in Figure
12.1. The descriptive statistics are identical to those for the distance band weights in Figure
11.4. The only differences are that inverse distance takes on the value of true, and the
power is listed as 1.
By design, GeoDa only takes into account the existence of a neighbor relation. The actual
values of the weights are ignored. Consequently, the properties of any distance function
(irrespective of the power) will be identical to those of the corresponding simple weights for
the same distance band. Also, the connectivity histogram, connectivity map and connectivity
graph will be the same as before. They do not provide any new information.
Inverse distance weights can be calculated for any bandwidth specified. While the default is
the max-min bandwidth, in some applications this is not useful. For example, the bandwidth
can be set to the maximum inter-point distance. As a result, the calculations will be for a
full n × n distance matrix. This is not recommended for larger data sets, but it can provide
a useful point of departure to compute accessibility indices formulated as spatially lagged
variables (see Section 12.3.1).
In addition, the coordinate option is perfectly general, and any two variables contained
in the data set can be specified as X and Y coordinates in the distance band setup. For
228 Special Weights Operations
example, this allows for the computation of so-called socio-economic weights, where the
inverse operation is applied to the Euclidean distance computed in the attribute domain
between two locations, using any two (but only two) variables.
In contrast, with the Variables tab checked in the Weights File Creation interface, fully
multivariate inverse distance weights can be computed as well, for as many variables as
required. However, for such distances to be meaningful, one has to be mindful of the scale in
which the variables are expressed. Standardization is highly recommended.
Figure 12.2: GWT files with distance for original and rescaled coordinates
To illustrate this, consider the two GWT files just created, using 6 nearest neighbors as the
bandwidth. The left-hand panel of Figure 12.2 shows the outcome when using the default
coordinates (italy_banks_te_idk6.gwt). Note how the value of 2.01549423e-05 for the pair
1-95 is exactly the inverse of the distance of 49,615.6221 (meters), listed on the matching
line of Figure 11.5.
This illustrates the problem just mentioned. Since the projected coordinates for the Italian
bank point file are expressed in meters, the distances are fairly large (e.g., almost 50 km
in the example). As a consequence, the values obtained for inverse distance are very small.
This has undesirable effects for the computation of spatial transformations like a spatially
lagged variable.
The right-hand panel in Figure 12.2 shows the result after the coordinates were rescaled to
kilometers.2 The outcome is simply multiplied by 1,000.
the coordinates in meters, a new set of variables XKM and YKM are computed by dividing these values
by 1,000. The new XKM and YKM then need to be entered as respectively X_coordinate variable and
Y_coordinate variable in the Weights File Creation interface.
3 An application in spatial econometrics is the heteroskedasticity and autocorrelation consistent (HAC)
estimation of the regression error variance, due to Kelejian and Prucha (2007). See Anselin and Rey (2014),
for implementation details in GeoDaSpace and PySal.
230 Special Weights Operations
functions:
• Uniform, K(z) = 1/2 for |z| < 1,
• Triangular, K(z) = (1 − |z|) for |z| < 1,
• Quadratic or Epanechnikov, K(z) = (3/4)(1 − z 2 ) for |z| < 1,4
• Quartic, K(z) = (15/16)(1 − z 2 )2 for |z| < 1 and
• Gaussian. K(z) = (2π)(−1/2) exp(−z 2 /2).5
Typically, the value for the diagonal elements of the weights is set to 1, although sometimes
the actual kernel value is used. For example, for a quadratic of Epanechnikov kernel function,
the weight for dij = 0 would in principle be 3/4, although in practice a value of 1 is often
used as well.
Many careful decisions must be made in selecting a kernel function. Apart from the choice
of a functional form for K( ), a crucial aspect is the selection of the bandwidth. In the
literature, the latter is found to be more important than the functional form. Several rules
have been developed to determine an optimal bandwidth (e.g., see Silverman, 1986), although
in practice, trial and error may be more appropriate.
An important choice is whether the bandwidth should be the same for all observations, a
so-called fixed bandwidth, or instead allowed by vary, as a variable bandwidth.
A drawback of fixed bandwidth kernel weights is that the number of non-zero weights can
differ considerably from one observation to the next, especially when the density of the point
locations is not uniform throughout space. This is the same problem encountered for the
distance band spatial weights (Section 11.3.1).
GeoDa includes two different implementations of the concept of fixed bandwidths. One is the
max-min distance used earlier (the largest of the nearest-neighbor distances). The other is
the maximum distance for a given specification of k-nearest neighbors. For example, for a
specified number of nearest neighbors, this is the distance between the k-nearest neighbors
pairs that are the farthest apart. In other words, this is an extension of the max-min concept
to the specific context of k neighbors, rather than nearest neighbor pairs.
To correct for the issues associated with a fixed bandwidth, a variable bandwidth approach can
be taken. It adjusts the bandwidth for each location to ensure equal or near-equal coverage.
One common approach is to take the k-nearest neighbors, and to adjust the bandwidth
for each location such that exactly k neighbors are included in the kernel function. The
bandwidth specific to each location is then any distance larger than its k nearest neighbor
distance, but less than the k+1 nearest neighbor distance.
When the kernel weights are implemented for the nearest neighbor concept, a final decision
pertains to the value of k. In GeoDa, the default value for k equals the cube root of the
number of observations.6 In general, a wider bandwidth gives smoother and more robust
results, so the bandwidth should always be set at least as large as the recommended default.
Some experimentation may be needed to find a suitable combination of bandwidth and/or
number of nearest neighbors in each particular application.
4 Note that the Epanechnikov kernel is sometimes referred to without the (3/4) scaling factor. GeoDa
weights. The differences are in the first six items. The type of weights is given as kernel,
the kernel method is identified as Epanechnikov, with the bandwidth definition (knn
7) and adaptive kernel set to true. It is also indicated that the kernel is applied to the
diagonal elements, since kernel to diagonal is true. Also, as for the k-nearest neighbor
weights, the resulting weights are asymmetric.
Since the connectivity histogram, map and graph ignore the actual weights values and are
solely based on the implied connectivity structure, they are identical to those obtained for
the corresponding k-nearest neighbor weights. Consequently, they are not informative with
respect to the properties of the weights themselves.
or,
n
[W y]i = wij yj ,
j=1
where the weights wij consist of the elements of the i-th row of the matrix W, matched up
with the corresponding elements of the vector y.
In other words, the spatial lag is a weighted sum of the values observed at neighboring
locations, since the non-neighbors are not included (those j for which wij = 0). Typically,
the weights matrix is very sparse, so that only a small
number of neighbors contribute to the
weighted sum. For row-standardized weights, with j wij = 1, the spatially lagged variable
becomes a weighted average of the values at neighboring observations.
In matrix notation, the spatial lag expression corresponds to the matrix product of the n × n
spatial weights matrix W with the n × 1 vector of observations y, or W × y. The matrix
W can therefore be considered to be the spatial lag operator on the vector y.
In a number of applied situations, it may be useful to include the observation at location i
itself in the weights computation. This implies that the diagonal elements of the weights
matrix must be non-zero, i.e., wii = 0. Depending on the context, the diagonal elements
may take on the value of one or equal a specific value, such as for kernel weights where the
kernel function is applied to the diagonal.
234 Special Weights Operations
An alternative concept of spatial lag is the so-called median spatial lag. Instead of computing
a (weighted) average of the neighboring observations, the median is selected.
of a spatial autocorrelation coefficient. The outcome of the empty entry is that the isolate
observations are excluded from such computations.
The effect of the spatial lag operation is a form of smoothing of the original data. The mean
is roughly the same (the theoretic expected values are the same), e.g., 0.6363 for the original
variable vs. 0.6378 for the spatial lag, but the standard deviation is much smaller, i.e., 0.0943
for the original vs. 0.0562 for the spatial lag. Similarly, a form of shrinking has occurred:
the original variable has a minimum of 0.3496 and a maximum of 0.9143, whereas for the
spatial lag this has narrowed to 0.5124 to 0.7478.7
A measure of spatial correlation can be computed as the correlation between the original
and the spatially lagged variable. Note that this is NOT the accepted notion of spatial
autocorrelation considered in the literature (see Chapter 13). Nevertheless, as long as it is
not interpreted in the context of a spatial autoregressive model, it remains a useful measure
of the linear association between a variable and the average of its neighbors. In the example,
this value is 0.350.8
In practice, this calculation is typically combined with the application of a bandwidth. Such
a bandwidth can be expressed in the form of a distance-band weight, with wij = 0 for
dij > δ, with δ as the bandwidth. Formally:
wij yj
[W y]i = ,
j
dα
ij
As mentioned, α is either 1 or 2. This expression for the spatial lag corresponds to the
concept of potential, familiar in geo-marketing analyses.9 The potential is a measure of how
accessible location i is to opportunities located in the neighboring locations (as defined by
the weights).
(1947). See also Harris (1954) and Isard (1960) for extensive early discussions.
Spatial Transformations 237
Figure 12.8: Variables computed with inverse distance weights added to table
Using the Italian bank example, with the inverse distance weights computed from the
original coordinates (in meters) with a 6 nearest neighbor bandwidth requires the file
italy_banks_te_idk6.gwt. The result is shown in the column ID_SERV16 of Figure 12.8.
The small values illustrate the problem associated with the original weights.
Instead, with the inverse distances computed from coordinates in kilometers, the result
in column IDK_SERV16 follow (using the weights file italy_banks_te_idk6km.gwt),
essentially a rescaled version of ID_SERV16.
For example, for observation with idd=1, the inverse distance weights are given in the
right-hand panel of Figure 12.2. Combining these weights with the observation values for
the neighbors listed in Section 12.3.1.1 yields the value 0.093700 given in the first row of
the IDK_SERV16 column.
to the neighboring observations, or averaged with them. In the inverse distance case, the
neighbors are weighted, but it is not obvious what the weight should be for the diagonal,
since a distance of zero does not result in a valid weight.
The convention taken in GeoDa is to keep the diagonal value unweighted. For example,
without row-standardization, this amounts to:
yj
[W y]i = yi + .
j
dα
ij
Again, this can be implemented in combination with a specific bandwidth for the distance.
In some contexts, this may be the desired result, but it is by no means the most intuitive
concept. It should therefore be used sparingly and only when there is a strong substantive
motivation. The corresponding result is listed in the first row of the IDS_SERV16 column.
It is simply the sum of the value for SERV (2016) and IDK_SERV16.
The same problem occurs when using row-standardized weights. Again, in GeoDa the diagonal
is unweighted, which yields:
[W y]i = yi + wij yj .
j
When kernel weights are used, the concept of a bandwidth is already implemented in the
calculation of the kernel.
For the example using Epanechnikov weights with 6 nearest neighbors
(italy_banks_te_epa.kwt), the result is listed on the first row of the column EPA_SERV16.
The value of 1.508157 can be obtained by combining the weights in the right-hand panel
of Figure 12.5 with the neighboring observations from Section 12.3.1.1).
Creating spatially lagged variables and their use in spatial rate smoothing are the only
operations in GeoDa where kernel weights can be applied. They are not available for the
analysis of spatial autocorrelation.
Spatial Weights in Rate Smoothing 239
impact on the actual values. The range is reduced from 0.0149 to 0.3161 for the raw rate, to
a much narrower 0.0174 to 0.2164 for the smoothed rate. The correlation between the two
rates is 0.984.
where Oj is the event count in location j, Pj is the population at risk and wij are the spatial
weights. In contrast to most procedures covered so far, here the weights typically include
the diagonal element as well, i.e., wii = 0.
Different smoothing functions are obtained for different spatial definitions of neighbors
and/or different weights applied to those neighbors, such as contiguity weights, inverse
distance weights, or kernel weights.
An early application of this principle was the spatial rate smoother outlined in Kafadar
(1996), based on the notion of a spatial moving average or window average (see also Kafadar,
1997). The window average is not applied to the rate itself, but it is computed separately for
the numerator and denominator.
The simplest case boils down to applying the idea of a spatial window sum (Section 12.3.1.4)
to the numerator and denominator (i.e., with binary spatial weights in both, and including
the diagonal term):
Ji
Oi + j=1 wij Oj
πi = Ji ,
Pi + j=1 wij Pj
A map of spatially smoothed rates tends to emphasize broad spatial trends and is useful for
identifying general features of the data. However, it is not suitable for the analysis of spatial
autocorrelation, since the smoothed rates are autocorrelated by construction. It is also not
very insightful for identifying outlying observations, since the values portrayed are regional
averages and not specific to an individual location. By construction, the values shown for
individual locations are determined by both the events and the population sizes of adjoining
spatial units, which can lead to misleading impressions. More specifically, the counts and
populations of small areas are dwarfed by those or larger surrounding areas. Therefore,
inverse distance weights are sometimes applied to both the numerator and denominator, e.g.,
as in the early discussion by Kafadar (1996). In GeoDa this is not implemented in the rates
calculation, but can be carried out manually in the data table (see Section 12.4.1.2).
Figure 12.10: Spatial rate map: first- and second-order queen contiguity
The map construction follows the same approach as for other rate maps (see Section 6.2.2).
The numerator in the ratio is set as the Event Variable, here DIS20, and the denominator
as the Base Variable, here PTOT20. At the bottom of the variable selection dialog are two
drop-down lists. One contains different Map Themes and the other the spatial Weights.
In the example, two Box map are created, using first- and second-order queen contiguity,
oaxaca_q and oaxaca_q2. The left panel of Figure 12.10 illustrates the former, the right
panel the latter.
Compared to the maps in Figure 12.9, a lot of detail is lost, and the emphasis is on broader
regional patterns, which tend to be dominated by the data for the larger municipalities.
Increasing the range of the contiguity to second-order yields somewhat larger regions of
similar values (in the sense of falling in the same quartile).
The spatial rate smoothing yields an even greater reduction in the variable range than
the non-spatial EB. For first-order contiguity, the range becomes 0.0390 to 0.1370, and
for second-order contiguity 0.0468 to 0.1164. In contrast to EB rates, which remain highly
correlated with the original raw rate, this is no longer the case for spatially smoothed rates.
The averaging over spatial units breaks the connection with the original values somewhat.
In the current example, the correlation between raw rate and the two spatial rates is,
respectively, 0.504 for first-order and 0.468 for second-order contiguity. This should be kept
in mind when interpreting the results.
The spatially smoothed rate map retains all the standard options of a map in GeoDa, as
well as the special options of other rate maps (see Section 6.2.2). The default variable name
for the Save Rates option is R_SPAT_RT. Currently, there are no metadata that keep
track of the spatial weights used in the smoothing operation.
Figure 12.11: Spatial rate map: k-nearest neighbor and inverse distance weights
Figure 12.12: Spatial rate map: inverse distance weights from Calculator
weights, with k = 6, oaxaca_k6,10 the map on the right on inverse distance weights using
the 6 nearest neighbors as a variable bandwidth, oaxaca_idk6. The maps are identical
because only the contiguity information is used in the weights operations.
The proper way to compute spatially smoothed rates for inverse distance or kernel weights
is through the Calculator. A spatially lagged variables must be computed separately for
the numerator and denominator, using either inverse distance or kernel weights. These new
variables are then used to compute a ratio, as a division operation in the calculator. Figure
12.12 shows a box map with the result (the computed variable is ID_RATE). The pattern
clearly differs from that in the maps in Figure 12.11. The effect of large neighbors is lessened
because of the distance decay effect embedded in the inverse distance weights.
were constructed using Arc Distance (km) option in the Weights File Creation dialog.
Spatial Weights in Rate Smoothing 243
Figure 12.13: Spatial Empirical Bayes smoothed rate map using region block weights
reference rate is computed for a spatial window, specific to each individual observation,
rather than taking the same overall reference rate for all observations. This approach only
works when the number of observations in the reference window, as defined by the spatial
weights, is large enough to allow for effective smoothing. This is a context where block
weights become very useful (Section 11.5.3), since they avoid the irregularity in the number
of neighbors on which the reference rate is based. All the observations in the same block
have an identical number of reference points from which the reference rate is computed.11
Similar to the standard EB principle, a reference rate (or prior) is computed from the data.
It is estimated from the spatial window surrounding a given observation.
Formally, the reference mean for location i is then:
j wij Oj
μi = ,
j wij Pj
Note that the average population in the second term pertains to all locations within the
window, therefore, this is divided by ki + 1 (with ki as the number of neighbors of i). As
in the case of the standard EB rate, it is quite possible (and quite common) to obtain a
negative estimate for the local variance, in which case it is set to zero.
The spatial EB smoothed rate is computed as a weighted average of the crude rate and the
prior, in the same manner as for the standard EB rate (Section 6.4.2.2).
Map > Rates-Calculated Map > Spatial Empirical Bayes. The map is created in
exactly the same way as for the spatial rate map.
In the example, shown in Figure 12.13, block weights based on the region to which a
municipality belongs (oaxaca_reg) are used to define the reference window for each
observation. These weights have on average 95 neighbors, ranging from a minimum of 19 to
154. The left panel of the figure illustrates the regional division, whereas the smoothed rates
are mapped in the right panel.
While there are many similarities with the pattern for the standard EB rate, shown in Figure
12.9, there are also a few differences, such as an additional upper outlier and a number of
observations that shift from the first quartile to the second quartile, as well as from the
fourth quartile to the third. The correlation with the original EB rate is very high, at 0.996,
suggesting overall a great similarity between the two results.
All the usual options are in effect. The rate can be saved with a default variable name of
R_SPAT_EBS.
In this chapter, I begin the discussion of spatial autocorrelation, both in general as well
as focused on the special case of the Moran’s I statistic. I start off with some definitions,
more specifically the notion of spatial randomness as the null hypothesis, and positive and
negative spatial autocorrelation as the alternative hypotheses.
This is followed by an overview of the general concept of a spatial autocorrelation statistic,
with a brief discussion of several specific implementations. Then attention shifts to arguably
the best known such statistic, Moran’s I (Moran, 1948). The formal structure, inference and
interpretation of the statistic are covered, followed by its visualization through the so-called
Moran scatter plot (Anselin, 1996). The chapter closes with an outline of the implementation
of the Moran scatter plot in GeoDa.
To illustrate these concepts, I use the Chicago Community Areas sample data set with socio-
economic variables for the 77 community areas in the city of Chicago (from the American
Community Survey).
Toolbar Icons
Figure 13.3: Spatially random observations – true and randomized per capita income
13.2.2 Randomization
The randomization approach does not require the selection of a specific parametric distribu-
tion for the data. Instead, it takes advantage of the property that under the null hypothesis,
each observation is equally likely to be at any location. By randomly permuting or reshuffling
the data across locations, the concept of spatial randomness can be simulated.
For example, the left hand panel of Figure 13.3 shows the spatial distribution of the income
per capita for the 77 Chicago community areas (from the 2015–2019 ACS) as a quantile map
with eight categories. On the right, the same 77 values are randomly allocated to community
areas, resulting in a totally different map. The true map illustrates the well-known pattern
of higher incomes in the northern part of the city, contrasting with low income in the south
and west. By contrast, the randomized map shows no such pattern. High income areas (dark
brown) are found all over the city, without any apparent structure to their locations.
Parenthetically, the contrast between the two maps illustrates the value of a spatial per-
spective. Taken as such, the statistical distribution of the original and randomized variables
are identical. In Figure 13.4, an eight category histogram highlights this property. In other
words, the a-spatial perspective on the data distribution offered by the histogram for the
two variables cannot account for the important distinction between their spatial patterns.
250 Spatial Autocorrelation
where the sum is over all pairs of observations, xi and xj are observations on the variable x
at locations i and j, f (xi , xj ) is a measure of attribute similarity and l(i, j) is a measure of
locational similarity between i and j.
Different choices for the functions f and l lead to specific spatial autocorrelation statistics.
A few common options are briefly reviewed next.
similarity, with extreme large or extreme small values indicating greater similarity (i.e., the
product of adjoining large values, or the product of adjoining small values). In contrast, the
squared and absolute differences are measures of dissimilarity, with smaller values indicating
more alike neighbors.
i.e., the sum over all neighbors (i.e., pairs where wij = 0) of a measure of attribute similarity
between them. This approach is most appropriate when the observations pertain to a set of
discrete locations, a so-called lattice data structure.
A different perspective is taken when the observations are viewed as a sample from a
continuous surface, although this approach can also be applied to discrete locations. Instead
of subsuming the neighbor relations in the spatial autocorrelation statistic, a measure
of attribute similarity is expressed as a function of the distance that separates pairs of
observations:
f (xi , xj ) = g(dij ),
where g is a function of distance. The function can take on a specific form, as in a variogram
or semi-variogram, or be left unspecified, in a non-parametric approach. This form of
modeling is representative of geostatistical analysis, which concerns itself with modeling
and interpolation on spatial surfaces (Isaaks and Srivastava, 1989; Cressie, 1993; Chilès and
Delfiner, 1999; Stein, 1999). The discussion of this approach is deferred to Chapter 15.
Some commonly used statistics can be viewed as special cases of this statistic, using a
particular expression for Aij .
13.4.3.1 Cross-product
Arguably the most commonly used spatial autocorrelation statistic is Moran’s I (Moran,
1948), based on a pairwise cross-product as the measure of similarity. Moran’s I is typically
expressed in deviations from the mean, using zi = xi − x̄, with x̄ as the mean. The full
expression is:
j zi .zj .wij /S0
I=
i
2 , (13.1)
i zi /n
254 Spatial Autocorrelation
with S0 = i j wij as the sum of the weights, and the other notation as before.
This statistic is examined more closely in Section 13.5.1.
13.5 Moran’s I
Moran’s I is arguably the most used spatial autocorrelation statistic in practice, especially
after its discussion in the classic works by Cliff and Ord (Cliff and Ord, 1973, 1981). In
this section, I briefly review its statistical properties and introduce an approach toward its
visualization by means of the Moran scatter plot (Anselin, 1996). The implementation of
the Moran scatter plot in GeoDa is further detailed in Section 13.6.
13.5.1 Statistic
Moran’s I statistic, given in Equation (13.1), is a cross-product statistic similar to a Pearson
correlation coefficient. However, the latter is a bivariate statistic, whereas Moran’s I is a
univariate statistic. Also, the magnitude for Moran’s I critically depends on the characteristics
of the spatial weights matrix W. This has as the unfamiliar consequence that, as such,
without standardization, the values of Moran’s I statistics are not directly comparable, unless
they are constructed with the same spatial weights. This contrasts with the interpretation
of the Pearson correlation coefficient, where a larger value implies stronger correlation. For
Moran’s I, this is not necessarily the case, depending on the structure of the spatial weights
involved.
The scaling factor for the denominator of the statistic is the familiar n, i.e., the total number
of observations (sample size), but the scaling factor for the numerator is unusual. The latter
is S0 , the sum of all the (non-zero) spatial weights. However, since the cross-product in the
Moran’s I 255
numerator is only counted when wij = 0, this turns out to be the proper adjustment. For
example, dividing by n2 would also account for all the zero cross-products, which would not
be appropriate.
For row-standardized weights, in the absence of unconnected observations (isolates), the
sum of weights equals the number of observations, S0 = n. As a result, the scaling factors in
numerator and denominator cancel out. Therefore, the statistic can be written concisely in
matrix form as:
z Wz
I= ,
zz
with z as a n × 1 vector of observations as deviations from the mean.3
13.5.2 Inference
Inference is carried out by assessing whether the observed value for Moran’s I is compatible
with its distribution under the null hypothesis of spatial randomness. This can be approached
from either an analytical or a computational perspective.
This property no longer holds under the randomization assumption. Now, the second-order
moment is obtained as:
n((n2 − 3n + 3)S1 − nS2 + 3S02 ) − b2 ((n2 − n)S1 − 2nS2 + 6S02 )
E[I 2 ] = ,
(n − 1)(n − 2)(n − 3)S02
with b2 = m4 /m22 , the ratio of the fourth moment of the variable over the second moment
squared.
In other words, under randomization, the variance of Moran’s I does depend on the moments
of the variable under consideration, which is a more standard result.
Actual inference is then based on a normal approximation of the distribution under the null.
Operationally, this is accomplished by transforming Moran’s I to a standardized z-variable,
by subtracting the theoretical mean and dividing by the theoretical standard deviation:
I − E[I]
z(I) = ,
SD[I]
with SD[I] as the theoretical standard deviation of Moran’s I. The probability of the result
is then obtained from a standard normal distribution.
Note that the approximation is asymptotic, i.e., in the limit, for an imaginary infinitely large
number of observations. It may therefore not work well in small samples.4
Figure 13.7 illustrates a reference distribution based on 999 random permutations for the
Chicago community area income per capita, using queen contiguity for the spatial weights.
The green bar corresponds with the Moran’s I statistic of 0.6011. The dark bars form a
histogram for the statistics computed from the simulated data sets. The theoretical expected
value for Moran’s I depends only on the number of observations. In this example, n = 77,
such that E[I] = −1/76 = −0.0132. This compares to the mean of the reference distribution
of -0.0125. The standard deviation from the reference distribution is 0.0696, which yields
the z-value as:
0.6011 + 0.0125
z(I) = = 8.82,
0.0696
as listed at the bottom of the graph. Since none of the simulated values exceed 0.6011, the
pseudo p-value is 0.001.
In contrast, Figure 13.8 shows the reference distribution for Moran’s I computed from the
randomized per capita incomes from the right-hand panel in Figure 13.3. Moran’s I is -0.0470,
with a corresponding z-value of −0.4361, and a pseudo p-value of 0.357. In the figure, the
green bar for the observed Moran’s I is slightly to the left of the center of the graph. In other
words, it is indistinguishable from a statistic that was computed for a simulated spatially
random data set with the same attribute values.
13.5.3 Interpretation
The interpretation of Moran’s I is based on a combination of its sign and its significance.
More precisely, the statistic must be compared to its theoretical mean of −1/(n − 1). Values
larger than E[I] indicate positive spatial autocorrelation, values smaller suggest negative
spatial autocorrelation.
However, this interpretation is only meaningful when the statistic is also significant. Otherwise,
one cannot really distinguish the result from that of a spatially random pattern, and thus
the sign is meaningless.
258 Spatial Autocorrelation
Figure 13.8: Reference distribution, Moran’s I for randomized per capita income
It should be noted that Moran’s I is a statistic for global spatial autocorrelation. More
precisely, it is a single value that characterizes the complete spatial pattern. Therefore,
Moran’s I may indicate clustering, but it does not provide the location of the clusters. This
requires a local statistic.
Also, the indication of clustering is the property of the pattern, but it does not suggest
what process may have generated the pattern. In fact, different processes may result in the
same pattern. Without further information, it is impossible to infer what process may have
generated the pattern.
This is an example of the inverse problem (also called the identification problem in econo-
metrics). In the spatial context, it boils down to the failure to distinguish between true
contagion and apparent contagion. True contagion yields a clustered pattern as a result
of spatial interaction, such as peer effects and diffusion. In contrast, apparent contagion
provides clustering as the result of spatial heterogeneity, in the sense that different spatial
structures create local similarity.
This is an important limitation of cross-sectional spatial analysis. It can only be remedied by
exploiting external information, or sometimes by using both temporal and spatial dimensions.
Figure 13.9: Moran scatter plot, community area per capita income
is 8.82, whereas for k-nearest neighbor weights, it is 9.46, instead suggesting the latter shows
stronger spatial correlation.
Comparison of Moran’s I among different variables is only valid when the same weights are
used. For example, percent population change between 2020 and 2010 yields a Moran’s I of
0.410 using queen contiguity, with a corresponding z-value of 6.30. In this instance, there is
consistent evidence that income per capita shows a stronger pattern of spatial clustering.
a way to visualize the magnitude of Moran’s I, inference should still be based on a permutation
approach, and not on a traditional t-test to assess the significance of the regression line.
Figure 13.9 illustrates the Moran scatter plot for the income per capita data for Chicago
community areas (based on queen contiguity). The value for income per capita is shown
in standard deviational units on the horizontal axis, with mean at zero. The vertical axis
corresponds to the spatial lag of these standard deviational units. While its mean is close to
zero, it is not actually zero. Hence the horizontal dashed line is typically not exactly at zero.
The slope of the linear fit, in this case 0.611, is listed at the top of the graph.
neighbors that are mostly of a higher quartile. One neighbor in particular belongs to the
upper outliers, which will pull up the value for the spatial lag. The corresponding selection
in the scatter plot on the left is indeed in the low-high quadrant.
One issue to keep in mind when assessing the relative position of the value at a location
relative to its neighbors is that a box map such as Figure 13.3 reduces the distribution to
six discrete categories. These are based on quartiles, but the range between the lowest and
highest observation in a given category can be quite large. In other words, it is not always
intuitive to assess the high-low or low-high relationship from the categories in the map.
13.6.2.1 Randomization
As such, the slope of the linear fit only provides an estimate of Moran’s I, but it does not
reveal any information about the significance of the test statistic. The latter is obtained
by means of the Randomization option, which implements a permutation approach. Four
hard-coded choices are provided for the number of permutations: 99, 199, 499 and 999.
In addition, there is an Other option, which allows for any value up to 99,999. As a
consequence, the smallest result that can be obtained for the pseudo p-value is 0.00001.
This implies that out of 99,999 randomly generated data sets, none yielded a computed
Moran’s I that equals or exceeds the one obtained from the actual data.
The result for a choice of 999 is shown in Figure 13.7, with an associated pseudo p-value of
0.001 for the spatial distribution of the Chicago community area per capita income data.
The graph provides strong evidence of spatial clustering of similar values.
As detailed in Section 13.5.2.2, the reference distribution is accompanied by several descriptive
statistics, such as the theoretical (E[I]) and empirical means, the standard deviation and the
associated z-value.
The Run button allows for repeated simulations in order to assess the sensitivity of the result
to the particular run of random numbers that is used. For a large number of permutations
such as 999, one run is typically sufficient.
when a different number of CPU cores is used.6 In order to control for any such discrepancies,
it is possible to Set the number of CPU cores manually in the Preferences.
As a result, the slope of the Moran scatter plots on the left and the right are as if the
data set consisted only of the selected/unselected points. This is a visual implementation
of a regionalized Moran’s I, where the indicator of spatial autocorrelation is calculated for
a subset of the observations (Munasinghe and Morris, 1996). As the brush is moved over
the graph, the selected and unselected scatter plots are updated in real time, showing the
corresponding regional Moran’s I statistics in the panels.
An alternative perspective is to approach the brushing starting with the map, as in Figure
13.14. In the left panel, 22 community areas are selected, with the associated observations
highlighted in the Moran scatter plot to the right. The Moran’s I corresponding to the
selection is 0.006, with 0.583 for the complement, again suggesting somewhat of a structural
break.
266 Spatial Autocorrelation
In this chapter, I continue with the treatment of global spatial autocorrelation. First, I
consider some special applications of the Moran scatter plot principle to time differenced
variables and to proportions. These cases provide some short cuts to the computations
needed to construct the graph, but otherwise are identical to the standard plot covered in
Chapter 13.
Next, I cover the topic of bivariate spatial correlation, a common source of confusion in
practice. After presenting the principle behind a bivariate Moran scatter plot, attention
focuses on the special bivariate situation where the observation on a variable at one point
in time is related to its neighbors in a different time period, i.e., space-time correlation. A
critical review is provided of different perspectives and their relative merits and drawbacks.
To illustrate these methods, I go back to the Oaxaca Development sample data set.
In the first example, the data for the same variable are available at different points in time.
Rather than having to explicitly compute change values between two points in time, the
differential Moran scatter plot provides a short cut, in the sense that the differences are
computed under the hood, as part of the scatter plot construction.
The second situation is the familiar case where the variable of interest is a rate or proportion.
As discussed at length in Chapter 6, the inherent variance instability of such data is
something that needs to be accounted for. This is especially important in the context of
spatial autocorrelation, since an important requirement for such statistics to be valid, like
Moran’s I, is to have a constant variance. The Morans’ I with EB Rate implements the
necessary standardization.
A differential
Moran’s I is then the slope in a regression of the spatial lag of the difference,
i.e., j wij (yj,t − yj,t−1 ) on the difference (yi,t − yi,t−1 ).
In GeoDa, the slope computation is applied to the standardized value of the difference (i.e.,
the standardization of yi,t − yi,t−1 ), and not to the difference between the standardized
values.
14.2.1.1 Implementation
The differential Moran scatter plot functionality is the third item in the drop-down list
activated by the Moran scatter plot toolbar icon (Figure 13.1). Alternatively, it can be
started from the main menu as Space > Differential Moran’s I.
Specialized Moran Scatter Plots 269
Figure 14.2: Change in access to health care between 2010 and 2020
Note that this option is only available for data sets with grouped variables (see Section 9.2.1).
Without grouped variables, a warning message is generated.
The functionality is illustrated with a time enabled version of the Oaxaca Development
sample data set (see Chapter 9). The relevant variables include p_P614NS, p_JOB and
p_PHA, among others.
Before proceeding with the actual differential Moran graph, Figure 14.1 illustrates the
spatial pattern of access to health care in 2010 and 2020, p_PHA, as box maps.1 The
classification is relative and therefore does not reflect that the values for access to health
care have improved rather dramatically between these two years: for example, the mean
changed from 53.4% to 75.8%. The spatial distribution shows some differences, especially in
the center of the map, where several municipalities moved from a higher quartile (brown) to
a lower quarter (blue), including 14 lower outliers in 2020.
1 Notethat the item in the variable selection list is initially given as p_PHA (all times). After selection,
this becomes, respectively, p_PHA (2010) and p_PHA (2020).
270 Advanced Global Spatial Autocorrelation
Figure 14.3: Moran scatter plot for access to health care in 2010 and 2020
A box map for the first difference between 2020 and 2010 (computed in the table) is shown
as Figure 14.2. In contrast to the map for the individual years, the first differences include
both lower (2) and upper outliers (8), i.e., areas that improved much less or much more than
others, relatively speaking. It is for this spatial pattern that the differential Moran’s I is
computed.
To provide some context, the individual Moran scatter plots for p_PHA in 2010 and 2020
are shown in Figure 14.3, computed using queen contiguity weights (Oaxaca_q.gal). Both
Moran’s I statistics are positive and highly significant, suggesting strong clustering. For
2010, the statistic is 0.221, with an associated z-value of 8.371, for 2020, it is 0.286, with an
associated z-value of 11.058. Using 999 permutations, both cases yield a pseudo p-value of
0.001.
After starting the differential Moran’s I functionality, a Differential Moran Variable
Settings interface provides the means to select the variable and time interval. The drop-
down list by Select variable contains only the grouped variables in the data set, such as
p_PHA for the example. Next, two time periods need to be selected. The values for the
time periods are determined by their definition in the Time Editor (see Section 9.2.1). For
the illustration, the years 2020 and 2010 are selected (in this order). Finally, the Weights
drop-down list contains the available spatial weights, here using oaxaca_q.
The resulting differential Moran scatter plot is shown in Figure 14.4. Moran’s I is 0.279, with
an associated z-value of 10.611 (from a randomization), again suggesting a strong pattern of
clustering. In other words, not only is there clustering in the values of the percent health
access, there is also clustering in their change over the ten year period. However, the global
Moran’s I provides no information as to where the clustering may happen, nor whether it is
driven by high or by low values. This is a common misinterpretation of the global Moran’s I
statistic.
The differential Moran scatter plot has the same options as the regular Moran scatter
plot, with one exception. The Save Results option provides a way to not only save the
Standardized Data (as MORAN_STD) and the Spatial Lag (as MORAN_LAG),
as in the regular case, but also Diff Values (as DIFF_VAL), the value of the difference
before standardization. Clearly, applying a regular Moran scatter plot to this new variable
provides the same result as the differential Moran scatter plot. The latter simply saves the
effort of first having to construct the differences.
Specialized Moran Scatter Plots 271
Figure 14.4: Differential Moran scatter plot for access to health care 2020–2010
with n as the total number of observations (in other words, P/n is the average population).
One problem with the method of moments estimator is that the expression for α could yield
a negative result. In that case, its value is typically set to zero, i.e., α = 0. However, in
Assuncao and Reis (1999), the value for α is only set to zero when the resulting estimate for
the variance is negative, that is, when α + β/Pi < 0. Slight differences in the standardized
variates may result depending on the convention used. In GeoDa, when the variance estimate
is negative, the original crude rate is used.
Also, after the EB adjustment, the rates are further standardized to have a mean of zero
and a variance of one, in the usual fashion for a Moran scatter plot.
14.2.2.1 Implementation
The Moran scatter plot for standardized rates is invoked from the main menu as Space >
Moran’s I with EB Rate, or as the fourth item in the drop-down list from the Moran
scatter plot toolbar icon, Figure 13.1.
To illustrate this procedure, the same disability rate example for Oaxaca municipalities is
used as in Section 12.4.0.1, i.e., the ratio of the number of disabled persons (DIS20) over
the municipal population (PTOT20) in 2020. The crude rate and EB smoothed rate are
mapped in Figure 12.9.
The left panel of Figure 14.5 shows the Moran scatter plot obtained for the crude rate,
again using queen contiguity for the spatial weights. The Moran statistic is 0.2649, with an
associated z-value (from 999 permutations) of 10.354.
The right hand panel is for the EB Moran scatter plot. This is obtained through the same
variable selection interface as for spatially smoothed rates in Section 12.4.1.1, with a variable
for the numerator (DIS20) and for the denominator (PTOT20), as well as the spatial
weights (oaxaca_q).
The result for the EB Moran’s I is only slightly different from the crude rate, yielding a
statistic of 0.2680, with associated z-value of 10.452. However, the values along the horizontal
axis show the effect of the standardization, with a much smaller range than for the crude
rate.
In practice, the differences between Moran’s I for the crude rate and EB standardized rate
tend to be minor.
The EB standardized Moran scatter plot has all the same options as the conventional Moran
scatter plot, except again for the Save Results feature. In addition to the Standardized
Data (the values used in the calculation of the Moran scatter plot), and their Spatial Lag,
the EB Rates (i.e., before the standardization used in the Moran computation) can be
saved, with default variable name MORAN_EB.
Bivariate Spatial Correlation 273
Figure 14.5: Moran scatter plot for raw rate and EB Moran scatter plot
between the correlative aspect and the spatial aspect. This is not pursued further.
274 Advanced Global Spatial Autocorrelation
As before, all variables are expressed in standardized form, such that their means are zero
and their variance one. In addition, the spatial weights are row-standardized.
Note that, unlike in the univariate autocorrelation case, the regression of x on W y also yields
an unbiased estimate of the slope, providing an alternative perspective on bivariate spatial
correlation (see Section 14.4). In the case of the regression of x on W x, the explanatory
variable W x is endogenous, so that the ordinary least squares estimation of the linear fit
is biased. However, with W y referring to a different variable, the so-called simultaneous
equation bias becomes a non-issue, and OLS in a regression of x on W y has all the standard
properties (such as unbiasedness).
A special case of bivariate spatial autocorrelation is when the variable is measured at two
points in time, say zi,t and zi,t−1 , as in Section 14.2.1. The statistic then pertains to the
extent to which the value observed at a location at a given time is correlated with its value
at neighboring locations at a different point in time.
The natural interpretation of this concept is to relate zi,t to j wij zj,t−1 , i.e., the correlation
between a value at t and its neighbors in a previous time period:
i( j wij zj,t−1 × zi,t )
IT = 2 ,
i zi,t
which would have the future predicting the past. This contrasts with the linear regression
specification used (purely formally) to estimate the bivariate Moran’s I, for example:
wij zj,t−1 = β3 zi,t + ui .
j
Bivariate Spatial Correlation 275
Figure 14.6: Bivariate Moran scatter plot for access to health care 2020 and its spatial lag in
2010
In terms of the interpretation of a dynamic process, only β1 has intuitive appeal. However,
in terms of measuring the degree of spatial correlation between past neighbors and a current
value, as measured by the linear fit in a Moran scatter plot, β3 is the correct interpretation.
In the univariate case, only the specification with the spatially lagged variable on the left
hand side yields a valid estimate. As a result, for the univariate Moran’s I, there is no
ambiguity about which variables should be on the x-axis and y-axis. In contrast, in the
bivariate case, both options are valid, although with a different interpretation.
Inference is again based on a permutation approach, but with an important difference. Since
the interest focuses on the bivariate spatial association, the values for x and y are fixed at
their locations, and only the remaining values for y are randomly permuted. In the usual
manner, this yields a reference distribution for the statistic under the null hypothesis that
the spatial arrangement of the remaining y values is random. It is important to keep in mind
that since the focus is on the correlation between the x value at i and the y values at the
neighboring locations, the correlation between x and y at location i is ignored.
To illustrate this functionality, the same variables are used as in Section 14.2.1, again with
queen contiguity spatial weights (oaxaca_q). For the bivariate Moran’s I, the x-variable is
then p_PHA (2020) and the y-variable its spatial lag in 2010, W_pPHA (2010).
The resulting graph is shown in Figure 14.6. The linear fit is almost horizontal, reflected
by a Moran’s I of −0.008. Randomization, using 999 permutations, yields a pseudo p-value
of 0.363, clearly not significant. This result may seem surprising, given the strong spatial
autocorrelation found for p_PHA in each of the individual years (Figure 14.3). One possible
explanation for this finding is that whereas there is strong evidence of clustering in each
year, the location of the clusters may be different. Therefore, the relationship between the
value at a given location and its neighbors in a different year may be weak, as is the case in
this example.
All the same options as for the univariate Moran scatter plot apply here as well. Also, the
standardized value of the x variable and the spatial lag of the y-variable (applied to its
standardized value) can be saved to the table, in the same way as for the univariate Moran
scatter plot.
plot, or by using the saved standardized values and their spatial lags from the Moran scatter plot for each
individual year (by means of the Save Results option).
The Many Faces of Space-Time Correlation 277
Figure 14.7: Correlation between access to health care in 2020 and 2010
The result is shown in Figure 14.8. The regression slope of 0.073 is not significant, with a
p-value of 0.132. Also, the overall fit is very poor, given the R2 of 0.004. This is no surprise,
given the ball-like shape of the scatter plot. In other words, while there is a relationship
between the health access for two different time periods, this does not hold for the neighbors
(spatial lags). This finding provides additional support for the notion that the clusters in the
two years may be in different locations. If they were in the same locations, then the spatial
lags for the two time periods would tend to be correlated.
Figure 14.8: Correlation between the spatial lag of access to health care in 2020 and the
spatial lag in 2010
Figure 14.9: Correlation between access to health care in 2020 and its spatial lag in 2010
The Many Faces of Space-Time Correlation 279
Figure 14.10: OLS regression of pPHA 2020 on pPHA 2010 and W PHA 2010
4 See Anselin and Rey (2014) for specifics on both the methods and their implementation in GeoDa.
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
15
Nonparametric Spatial Autocorrelation
In this chapter, the spatial weights matrix is abandoned in the formulation of a spatial
autocorrelation statistic. Instead, a measure of attribute (dis)similarity is related directly
to the distance that separates pairs of observations, in a non-parametric fashion. Two
approaches covered are the spatial correlogram and the smoothed distance scatter plot.
To illustrate these methods, I will use the Italy Community Banks point layer sample data
set.
Toolbar Icons
Figure 15.2: Spatial distribution of loan loss provisions – Italian banks 2016
The main interest in the spatial correlogram is to determine the range of interaction, i.e.,
the distance within which the spatial correlation is positive. In addition, following Tobler’s
law (Tobler, 1970), the strength of the correlation should decrease with distance.
Beyond the range of interaction, the correlation should become negligible, although in
practice it can still show fluctuations.1
This brings up a Correlogram Parameters dialog with the default parameter settings
and a graph in the background. However, this initial graph is almost never informative at
this point, since it shows a correlogram for the first variable, which is usually an ID value.
The initial layout of the interface is shown in the left-hand panel of Figure 15.3. The items
at the top of the dialog are the Variable (available from a drop-down list), the Distance
metric (default is Euclidean Distance, but Arc Distance is also available, expressed
either as miles or as kilometers). In the example, since the data set is time enabled, the
Variable also has an associated Time drop-down list. For the initial default setting, LLP
(2016) and 2016 are used with Euclidean Distance (given the projection used, this
distance is in meters).
Next follows the Number of Bins. The non-parametric correlogram is computed by means
of a local regression on the spatial correlation estimated from the pairs that fall within each
distance bin. The number of bins determines the distance range of each bin. This range is
the maximum distance divided by the number of bins. The more bins are chosen, the more
fine-grained the correlogram will be. However, this also potentially can lead to too few pairs
in some bins (the rule of thumb is to have at least 30).
The number of pairs contained in each bin is a result of the interplay between the number of
bins and the maximum distance. As the default, All Pairs are used, with Max Distance
unchecked. In the example, with 261 observations, this yields: [2612 − 261]/2 = 33930 pairs,
as listed in the Estimated Pairs box.
In many instances in practice, using all pairs may not be a good choice. For example, when
there are many observations, the number of pairs quickly becomes unwieldy and the program
will run out of memory. Also, the correlations computed for pairs that are far apart are not
that meaningful, since they should be zero due to Tobler’s law. Therefore it is often practical
to truncate the observations to only those pairs that are within a reasonable distance range.
The bottom half of the dialog provides options for fine-tuning these choices. These are
considered in Sections 15.3.2.1 and 15.3.2.2.
With the default set as in the left-hand panel of Figure 15.3, the spatial correlogram shown
in Figure 15.4 is created.
Spatial Correlogram 285
15.3.1.1 Interpretation
The top half of the graph in Figure 15.4 is the actual correlogram that depicts how the
spatial autocorrelation changes with distance. Hovering the pointer over each blue dot gives
the spatial autocorrelation associated with that distance band in the status bar. For example,
the second dot corresponds with an autocorrelation of −0.024 for observations within the
distance band from 119.4 km to 238.9 km, as shown in the status bar (the distances in the
graph are expressed in meters). The first dot corresponds with an autocorrelation of 0.081.
The intersection between the correlogram and the dashed zero axis, which determines the
range of spatial autocorrelation, happens in the midpoint of the second range, or roughly
around 179 km. Beyond that range, the autocorrelation fluctuates around the zero line.
The bottom half of the graph consists of a histogram that shows the number of pairs of
observations in each bin. Hovering the pointer over a given bin shows in the status bar
how many pairs are contained in the bin. In the example, each bin has more than sufficient
observation pairs. Even the last bin, which seems small (a function of the vertical scale),
contains 102 pairs to compute the average autocorrelation.
The View > Display Statistics option generates a list of descriptive statistics for each
bin. The computed autocorrelation is provided, as well as the distance range for the bin
(lower and upper bound), and the number of pairs used to compute the statistic. In addition,
there is a summary with the minimum and maximum distance, the total number of pairs
and an estimate for the range, i.e., the distance at which the estimated autocorrelation first
becomes zero (see the discussion of Figures 15.5 and 15.6 for an illustration). The other
View options are the familiar Set Display Precision, Set Display Precision on Axes
and Show Status Bar. The latter is active by default.
Finally, Save Results provides a record of the descriptive statistics of the correlogram in a
text file (with file extension csv). The file contains the information listed at the bottom of
the graph when View > Display Statistics has been selected, with a column matching
each bin. The file includes the estimates of the spatial autocorrelation, bin ranges, number
of observations in each bin, as well as the summary.
means of the slider, it is only an approximation. The actual number of pairs used in the calculation of the
pairwise correlations is given in the status bar of the correlogram.
Spatial Correlogram 287
(150 km) is entered. Combined with the default Number Bins of 10, and with View >
Display Statistics activated, this yields the correlogram in Figure 15.5.
The graph header lists the cutoff but also gives that all pairs have been used. This should
be interpreted in the sense that all pairs that meet the distance cutoff are used in the
computations, as opposed to a random sample (Section 15.3.2.2).
The distance range for each bin is now 15 km (15,000 in the graph) and, as a result, the
range is estimated to be between 45 km and 60 km, contrasted with more than double the
figure obtained in Figure 15.4. Also, the initial spatial autocorrelation, for pairs within 15
km of each other is 0.506, compared to 0.081 for the first range in the default setting. The
descriptive statistics depict the change of the estimated autocorrelation coefficient with
distance at a much more fine grained level than before. The correlogram shows the desired
shape, decreasing with distance until the range is reached and more or less flat afterward.
The computations are based on 6,579 pairs, less than a fifth of the original total.
A further refinement can be obtained by increasing the number of bins for the same maximum
distance. In Figure 15.6, the correlogram is depicted for 15 bins with the 150 km cut-off (the
settings shown in the right-hand panel of Figure 15.3), resulting in a range of 10 km for each
bin. The zero autocorrelation is estimated to occur between 50 and 60 km, an even narrower
band than for Figure 15.5. The autocorrelation in the first 10 km range is 0.503, basically
the same as in the first 15 km range shown before. The overall shape of the correlogram also
remains the same.
Overall, this suggests quite a strong pattern of positive spatial autocorrelation at short
distances, but negligible association beyond distances of 60 km. This highlights a different
aspect of spatial autocorrelation than can be provided by a statistic like Moran’s I, which
considers all pairs that meet the spatial weights criterion.
In the example used here, this is not really necessary, since the default sample size of
1,000,000 that is used to generate the random sample exceeds the current total number of
pairs in the data set. To implement this option, the Random Sample radio button needs
to be checked and a sample size specified if the default of 1 million is not desired. Also, to
allow exact replication, the Use Specified Seed option should be checked (the seed can be
adjusted by means of the Change button).
In practice, the sampling approximation is quite good, as long as the selected sample size is
not too small relative to the original data size.
with i and j as the observation locations. Note that this is not the squared difference, as in
the semi-variogram, but its square root, as a direct measure of distance.
The geographic Euclidean distance between the observation point is then the familiar:
dij = (xi − xj )2 + (yi − yj )2 ,
Whereas the geographic distance has an immediate interpretation in terms of distance units
(e.g., kilometers), this is not the case for the attribute distance matrix. In practice, it is
therefore advisable to first standardize the variable, so that attribute distance is expressed
in standard deviational units.
An example of the resulting scatter plot is shown in Figure 15.7 for the variable LLP (2016)
from the Italy Community Banks data set. The graph shows a slow increase up to about
120 km, after which it is mostly flat (but see Section 15.4.3 for more detailed options and
interpretation).
The idea of combining a measure of distance in attribute space with geographic distance
goes back at least to an example in Oden and Sokal (1986). They implemented a Mantel
test (Section 13.4.3) to assess the similarity between the elements of a geographic distance
matrix and a variable dissimilarity matrix. However, their approach is less flexible than the
smoothed distance scatter plot and does not lend itself readily to an extension to multiple
variables (Section 15.4.1).
One may be tempted to apply a linear fit to the scatter plot as an intuitive measure of
the association between the two variables. However, this would imply a linear relationship
between the two, whereas Tobler’s law suggests a non-linear distance decay, with a range
beyond which there is no association, in the same fashion as shown for the spatial correlogram.
However, in contrast to the correlogram, the smoothed distance scatter plot increases with
distance until the range is reached, since it involves a measure of dissimilarity.
A non-parametric local regression fit is applied to the smoothed distance scatter plot to
extract the overall pattern. Specifically, the implementation in GeoDa is based on the loess
approach (Cleveland et al., 1992).
290 Nonparametric Spatial Autocorrelation
As for the spatial correlogram, the main interest in this graph is to obtain an estimate of
the range of interaction, i.e., the point where the curve ceases to increase with distance and
begins to flatten out.
Figure 15.8: Smoothed distance scatter plot with 150 km distance cut-off
of some 1,200 km). As in the spatial correlogram, this yields 17095 pairs to be considered,
compared to the 33,930 pairs in the default case. A different option is to use 1/2 of the
diagonal of the bounding box, another rule of thumb often used for empirical variograms.
The All Pairs button is checked by default. The alternative is to use Random Sample,
which works in the same way as for the spatial correlogram.
These default settings yield the plot in Figure 15.7, but without the actual points, which are
not shown by default. Using a more reasonable cut-off distance of 150 km yields the graph
in Figure 15.8, based on the computation for 6579 observation pairs. In this plot, the range
seems to be between 70 and 80 km, slightly higher than for the spatial correlogram in Figure
15.6.
Further options to refine the visualization are described in Sections 15.4.3 and 15.4.3.1.
Figure 15.9: Smoothed distance scatter plot with 150 km distance cut-off and customized
Y-Axis
The default View setting is to show the status bar. View > Show Data Points adds
all the points to the scatter plot. As mentioned, this is turned off by default. The Color
options allow the color to be specified for the regression line, the scatter plot points and the
background.
An important option for effective visualization is to adjust the range on the Y-axis. This
is implemented through Axis Option > Adjust Value Range of Y-Axis. It allows the
minimum and maximum values for the Y-Axis to be customized, resulting in a clearer view
of the shape of the graph. For example, setting the maximum at 1.5 in Figure 15.8, yields
the plot shown in Figure 15.9.
In addition, the Axis Option can be used to specify the display precision, in the usual way.
Finally, Save Results will create a comma-separated value (csv) file that contains all the
scatter plot points with their X and Y coordinates. This file can then be used as input to
more advanced statistical analyses.
Figure 15.10: Smoothed distance scatter plot with 150 km distance cut-off and span 0.5
Figure 15.11: Smoothed distance scatter plot with 150 km distance cut-off, span 0.5, linear
fit
netlib default is 0.75, which is also the default value in GeoDa. This corresponds to the
formal smoothing parameter α which is used in some other software implementations. The
294 Nonparametric Spatial Autocorrelation
observations within the maximum range from the fit point are weighted as in a kernel
regression.6
The Degree of the polynomials is 2 by default, for a quadratic model, but 1 (linear)
is another (less often used) option. Finally, the Family setting selects the type of fitting
procedure. The default is gaussian, which boils down to a least squares fit. The alternative
is a sometimes more robust symmetric option.
To illustrate the sensitivity of the graph to some of these settings, Figure 15.10 shows the
effect of setting the span to 0.50, for a more local fit.
Figure 15.11 illustrates the effect of a linear fit vs a quadratic fit (with the same span of 0.5).
In practice, some experimentation may be necessary to yield the most effective graph.
Finally, Figure 15.12 illustrates a smoothed distance scatter plot for two variables, technical
input efficiency in 2016, TE_IN (2016) and loan loss provision LLP (2016). The span is
set to 0.5, with a quadratic function and the vertical axis truncated at 2. The graph shows a
gradually increasing function up to about 70 km, after which it is more or less horizontal.
This suggests the presence of a spatial relationship of the tuples of the two variables observed
within this range.
6 The netlib implementation uses a tricubic weighting scheme, i.e., (1 − (distance/maximum distance)3 )3 .
Part V
This chapter is the first in a series where attention shifts to the detection of the location
of clusters and spatial outliers. These are collections of observations that are surrounded
by neighbors that are more similar to them or more different from them than would be
expected under spatial randomness. The general principle by which this is approached is
based on the concept of Local Indicators of Spatial Association, or LISA (Anselin, 1995).
I begin by outlining the basic idea underlying the concept of LISA. This is followed by an
in-depth coverage of its most common application, in the form of the Local Moran statistic.
This statistic becomes a powerful tool to detect hot spots, cold spots, as well as spatial
outliers when combined with the classification of spatial autocorrelation in the Moran scatter
plot. The ultimate result is a local cluster map. An extensive discussion of its properties and
interpretation is provided, with special attention to the notion of significance.
To illustrate these methods, I will employ the Oaxaca Development sample data set.
Toolbar Icons
For each different measure of attribute similarity f , a different statistic for global spatial
autocorrelation results. Consequently, there will be a corresponding LISA for each such
global statistic. First, the local counterpart of Moran’s I is considered.
16.3.1 Formulation
As discussed in Chapter 13, the Moran’s I statistic is obtained as:
j zi .zj .wij /S0
I=
i
2 , (16.1)
i zi /n
with the variable of interest (z) expressed as deviations from the mean, and S0 as the sum of
all the weights. In the row-standardized case, the latter equals the number of observations,
n. As a result, as shown in the discussion of the Moran scatter plot, the Moran’s I statistic
simplifies to:
wij zi zj
I=
i
j 2 .
i zi
Using the logic just outlined, a corresponding Local Moran statistic would consist of the
component in the double sum that corresponds to each observation i, or:
j wij zi zj
Ii = 2 .
i zi
In this expression, the denominator is fixed and can thus further be ignored. To keep the
notation
simple, it can be replaced by a, so that the Local Moran expression becomes
a. j wij zi zj . After some re-arranging, a simple expression is:
Ii = a × zi wij zj ,
j
or, the (scaled) product of the value at location i with its spatial lag, the weighted sum or
the average of the values at neighboring locations.
A specialcase occurs when the variable z is fully standardized. Then its variance i zi2 /n = 1,
so that i zi2 = n. Consequently, the global Moran’s I can be written as:
I= (a.zi wij zj ) = (1/n) (zi wij zj ),
i j i j
or the average of the local Moran’s I. This case illustrates a direct connection between the
local and the global Moran’s I.
Significance can be based on an analytical approximation, but, as argued in Anselin (1995),
this is not very reliable in practice.1 A preferred approach consists of a conditional per-
mutation method. This is similar to the permutation approach considered in the Moran
scatter plot, except that the value of each zi is held fixed at its location i. The remaining n-1
z-values are then randomly permuted to yield a reference distribution for the local statistic,
one for each location.
The randomization operates in the same fashion as for the global Moran’s I (see Section
13.5.2.2), except that the permutation is carried out for each observation in turn. The result
is a pseudo p-value for each location, which can then be used to assess significance. Note
that this notion of significance is not the standard one, and should not be interpreted that
way (see the discussion in section 16.5).
1 Further discussion of analytical results can be found in Sokal et al. (1998a), for an asymptotic approach,
Assessing significance in and of itself is not that useful for the Local Moran. However, when
an indication of significance is combined with the quadrant location of each observation
in the Moran Scatter plot, a very powerful interpretation becomes possible. The combined
information allows for a classification of the significant locations as High-High and Low-Low
spatial clusters, and High-Low and Low-High spatial outliers. It is important to keep in mind
that the reference to high and low is relative to the mean of the variable, and should not be
interpreted in an absolute sense. The notions of clusters and outliers are considered more
in-depth in section 16.4.
16.3.2 Implementation
The Univariate Local Moran’s I is started from the Cluster Maps toolbar icon, the
right-most icon in the spatial correlation group, highlighted in Figure 16.1. This brings up a
drop-down list, with the univariate Local Moran as the top-most item. Alternatively, this
option can be selected from the main menu, as Space > Univariate Local Moran’s I.
Either approach brings up the familiar Variable Settings dialog which lists the available
variables as well as the default weights file, at the bottom (e.g., oaxaca_q in the example).
As in Chapter 14, the variable is p_PHA(2010), the 2010 percentage of the municipal
population with access to health care. In the example, this variable is time enabled, which
results in the selected year (2010) being listed in the dialog as well.
As a reference, a box map reflecting the spatial distribution of this variable is shown in the
left-hand panel of Figure 14.1 (see also the left-hand panels in Figures 16.6 and 16.8 in this
Chapter).
The significance map on the left shows the locations with a significant local statistic, with
the degree of significance reflected in increasingly darker shades of green. The map starts
with p < 0.05 and shows all the categories of significance that are meaningful for the given
number of permutations. In the example, since there were 999 permutations, the smallest
pseudo p-value is 0.001, with nine such locations (the darkest shade of green).
The cluster map augments the significant locations with an indication of the type of spatial
association, based on the location of the value and its spatial lag in the Moran scatter plot
(see section 16.4). In this example, all four categories are represented, with dark red for the
High-High clusters (40 in the example), dark blue for the Low-Low clusters (48 locations),
light blue for the Low-High spatial outliers (12 locations) and light red for the High-Low
spatial outliers (16 locations).
that useful. The clusters are identified by an integer that designates the type of spatial
association: 0 for non-significant (for the current selection of the cut-off p-value, e.g., 0.05 in
the example), 1 for High-High, 2 for Low-Low, 3 for Low-High and 4 for High-Low.
The default variable names would typically be changed, especially when more than one
variable is considered in an analysis (or different spatial weights for the same variable).
16.4.1 Clusters
Clusters are centered around locations with a significant positive Local Moran statistic, i.e.,
locations with similar neighbors. In the terminology of the Moran scatter plot, these are
either High-High or Low-Low locations.
Figure 16.4 illustrates the connection between the Moran scatter plot and the cluster map.
On the left, the 230 observations in the High-High quadrant of the Moran scatter plot for
p_PHA (2010) (using queen contiguity) are selected. Through the process of linking, the
corresponding observations are highlighted in the cluster map on the right. It is clear that
only a fraction of the locations in the High-High quadrant are actually significant (42),
shown as the bright red municipalities in the map. The non-significant ones are identified in
a grey color, as illustrated by the locations within the red rectangle on the map.
The reverse logic works as well, as shown in Figure 16.5. The High-High locations are selected
in the cluster map on the right (by clicking on the red rectangle in the legend) and the
corresponding 42 observations are highlighted in the Moran scatter plot on the left. Again,
Clusters and Spatial Outliers 303
this confirms that just belonging to the High-High quadrant in the Moran scatter plot does
not imply significance.
A similar exercise can be carried out for observations in the Low-Low quadrant, which also
correspond with positive local spatial autocorrelation.
The interpretation of clusters is not always that straightforward (see also Section 16.5.3),
especially after inspecting a choropleth map of the variable of interest. In some instances, it
is very clear that an observations is selected as High-High when it is in the top category and
surrounded by neighbors that are also in the top category. For example, in Figure 16.6, the
location selected within the black rectangle on the cluster map (with the arrow pointed at
it) belongs to the top quartile in the box map on the left. In addition, its neighbors all also
belong to the top quartile, hence a natural interpretation as a cluster.
However, in other instances, the identification as a High-High (or similarly, Low-Low) cluster
may seem counter-intuitive. Consider the High-High location identified with the blue arrow
on the cluster map. In the box map to the left, that location is shown as belonging to the
second quadrant, i.e., below the median, which would seem to contradict the characterization
304 LISA and Local Moran
as high. However, as it turns out, for 2010 the median value of access to health care (56.9) is
above the mean (53.4), so that it is possible for an observation to be below the median, but
still above the mean. Moreover, the simplification of the continuous distribution into a small
number of discrete map categories may be confusing. In this particular example, the spatial
lag is well above the mean, even though the neighbors of the location in question consist of
a mix of municipalities from all four categories on the map. It is important to keep in mind
that the spatial lag is the average of the neighbors, and its value can be easily influenced by
a few extreme values. In addition, the categories on the map may contain observations with
very disparate values (depending on the range in the given category), which may yield less
than intuitive insights into the similarity of neighbors.
Checking the False Discovery Rate radio button will update the significance and cluster
maps accordingly, displaying only 17 significant locations, as shown in Figure 16.13. At this
point, the High-High and Low-Low clusters are again fairly balanced (respectively, 6 and
7), but there are no longer any Low-High spatial outliers (there are 4 remaining High-Low
outliers).
Also, for the Bonferroni and FDR procedures to work properly, it is necessary to have a
large number of permutations, to ensure that the minimum p-value can be less than α/n.
Currently, the largest number of permutations that GeoDa supports is 99,999, and thus the
smallest possible p-value is 0.00001. For the Bonferroni bound to be effective, α/n must be
greater than 0.00001, or n should be less than α/0.00001. This is not due to a characteristic
of the data, but to the lack of sufficient permutations to yield a pseudo p-value that is small
enough.
In practice, this means that with α = 0.01, data sets with n > 1000 cannot have a single
significant location using the Bonferroni criterion. With α = 0.05, this value increases to
5000, and with α = 0.1 to 10,000.
The same limitation applies to the FRD criterion, since the cut-off for the first sorted
observation corresponds with the Bonferroni bound (1 × α/n).
Clearly, this drives home the message that a mechanical application of p-values is to be
avoided. Instead, a careful sensitivity analysis should be carried out, comparing the locations
of clusters and spatial outliers identified for different p-values, including the Bonferroni and
FDR criteria, as well as the associated neighbors to suggest interesting locations that may
lead to new hypotheses and help to discover the unexpected.
The third option selects both the cores and their neighbors (as defined by the spatial weights).
This is most useful to assess the spatial range of the areas identified as clusters. For example,
in Figure 16.14, the Cores and Neighbors option is applied to the significant locations
obtained using the FDR criterion. The non-significant neighbors are shown in grey on the
cluster map on the right, with all the matching cores and neighbors highlighted in the box
map on the left.
A careful application of this approach provides insight into the spatial range of interaction
that corresponds with the respective cluster cores.
the right (i.e., for more indigenous communities), and the bulk of the High-High clusters on
the left.
It should be noted that this example is purely illustrative of the functionality available
through the conditional cluster map feature, rather than as a substantive interpretation. As
always, the main focus is on whether the micromaps suggest different patterns, which would
imply an interaction effect with the conditioning variable(s).
17
Other Local Spatial Autocorrelation
Statistics
This chapter continues the exploration of local spatial autocorrelation statistics. First,
several extensions of the Local Moran are introduced, such as the Median Local Moran, the
Differential Local Moran and a specialized version that deals with the variance instability
in rates or proportions, the EB Local Moran. In addition, a local version of Geary’s c
statistic is discussed. The chapter also takes an in-depth look at an important class of local
statistics introduced by Getis and Ord. These statistics are not a LISA in a strict sense, but
nevertheless are effective tools to discover local hot spots and cold spots.
The chapter closes with a brief discussion of the comparative merits of the local statistics
covered so far.
These methods are again illustrated by means of the Oaxaca Development sample data set.
IiM = zi × med(zj , j ∈ Ni ),
where Ni is the neighbor set of location i (i.e., those locations for which wij = 0).
Inference and interpretation are identical to that for the original Local Moran, based on a
conditional permutation approach (see Section 16.5).
17.2.1.1 Implementation
The Median Local Moran is invoked as the second item in first group of the Cluster Maps
drop-down list from the toolbar, or as Space > Univariate Median Local Moran’s I
from the menu.
This brings up the usual variable selection dialog. All the options are the same as for the
conventional Local Moran, i.e., the randomization, significance filter and saving of the results
(see Chapter 16).
These features are illustrated using the variable p_PHA(2010) (or, p_PHA10) from the
Oaxaca data set, with queen contiguity. The significance map for 99,999 permutations is
shown in Figure 17.1, with the result for the Median Local Moran in the right-hand panel,
compared to the conventional Local Moran on the left.
Overall, with a p < 0.05 cut-off, there are 18 fewer significant locations for the Median
Local Moran compared to the conventional version. These are distributed over the different
Extensions of the Local Moran 315
categories as 10 fewer for p < 0.05, 5 fewer for p < 0.01, 2 fewer for p < 0.001 and one
at p < 0.00001, but none at p < 0.0001. However, the changes work in both directions,
with some locations becoming significant for the Median Local Moran, that were not for
the conventional Local Moran, and the other way around. In addition, there are changes in
the category of significance in both directions (both less and more significant). For example,
the blue highlight in Figure 17.1 points to a municipality that was not significant for the
conventional Local Moran, but becomes so at p < 0.01 for the Median Local Moran. In
contrast, the red highlighted location was significant for the conventional Local Moran and
becomes insignificant for the median version.
meet the criterion for the variable, followed by a Select From Current Selection for the median spatial
lag.
316 Other Local Spatial Autocorrelation Statistics
Figure 17.3. In this case, since the median is above the mean for both axes, the selection
includes observations that fall in different quadrants for the conventional Local Moran.
The same principle is applied to the spatial outliers, illustrated in Figures 17.4 and 17.5.
Again, the classification into the respective categories differs from what the conventional
scatter plot would yield. The significant observations correspond with locations classified as
spatial outliers in the cluster map on the right.
IiD = a(yi,t − yi,t−1 ) wij (yj,t − yj,t−1 ).
j
The scaling constant a can be ignored. In essence, this is the same as the conventional
Local Moran applied to the difference, but the implementation is based on selecting the two
variables and computing the difference under the hood, rather than computing the difference
separately.
As before, inference is based on conditional permutation. All the usual caveats hold about
multiple comparisons and the choice of a p-value. In all respects, the interpretation is the
same as for the conventional Local Moran statistic.
17.2.2.1 Implementation
The Differential Local Moran is invoked as the fourth item in first group of the Cluster
Maps drop-down list from the toolbar, or as Space > Differential Local Moran’s I
from the menu.
As in the global case, the variables under consideration must be time-enabled (grouped) in
the data table. The variable selection dialog is slightly different from the standard interface,
but the same as for the differential Moran scatter plot.
Continuing to use the (time-enabled) Oaxaca data set, first, the variable of interest is selected
(here, p_PHA), and then the two time periods are chosen (here, 2020 and 2010). Note
that the system is agnostic about the actual time periods, so that any combination can be
selected. The statistic is computed for the difference between the time period specified as
the first item and that given as the second item. In the example, the spatial weights are
oaxaca_q.
To provide some context, Figure 17.7 shows the cluster maps for the conventional Local
Moran for p_PHA in 2010 and 2020 (using 99,999 permutations and p < 0.05). The local
patterns in the two years are very different, with many High-High clusters from 2010 labeled
as Low-Low in 2020, and vice versa. Note that the classification is relative to the mean,
which has improved considerably between the two years (from 53.4% to 75.8%).
The significance map and cluster map for the Differential Local Moran (99,999 permutations
with p < 0.05) are shown in Figure 17.8. Note how the High-High clusters correspond to
locations that were Low-Low in 2010, but High-High in 2020, suggesting a grouping of large
increases. Reversely, the Low-Low clusters for the difference correspond to locations that
were High-High in 2010, but Low-Low in 2020, suggesting a grouping of small increases, or
even decreases.
Extensions of the Local Moran 319
Figure 17.7: Cluster maps for Local Moran in 2010 and 2020
All the options, such as randomization and significance filter are the same as for the
conventional local Moran and will not be further discussed here. The only slight difference is
in how the results are saved. Similar to the functionality for the differential Moran scatter
plot, the Save Results option includes an item to save the actual change variable (in raw
form, not in standardized form). In the dialog, this corresponds to the Diff Values item
with default variable name DIFF_VAL2. The other options are the same as for all local
spatial autocorrelation statistics, i.e., the value of the statistic (LISA_I), cluster type
(LISA_CL) and p-value (LISA_P).
Once the difference is saved as a separate variable, it can be used in a conventional univariate
Local Moran operation.
17.2.2.2 Interpretation
The significance and cluster maps for a Differential Local Moran identify the locations
where the change in the variable over time is matched by similar/dissimilar changes in the
surrounding locations. It is important to keep in mind that the focus is on change, and there
is no direct connection to whether this changes is from high or from low values.
Two situations can be distinguished, depending on whether the change variable takes on
both positive and negative values, or when all the changes are of the same sign (i.e., all
observations either increase or decrease over time).
320 Other Local Spatial Autocorrelation Statistics
When both positive and negative change values are present, the High-High locations will
tend to be locations with a large increase (positive change), surrounded by locations with
similar large increases. The Low-Low locations will be observations with a large decrease
(negative change), surrounded by locations with similar large decreases. Spatial outliers will
be locations where an increase is surrounded by a decrease and vice versa.
When all changes are of the same sign, the interpretation of High-High and Low-Low is
relative to the mean. When all the changes are positive, large increases surrounded by other
large increases will be labeled High-High, whereas small(er) increases surrounding by other
small(er) increases will be labeled Low-Low. When all the signs are negative, small(er)
decreases surrounded by other small(er) decreases will be labeled as High-High, whereas
large decreases surrounded by other large decreases will be labeled Low-Low.
IiEB = azi wij zj ,
j
The standardization of the raw rate ri is the same as before and is repeated here for
completeness (for a more detailed discussion, see Section 14.2.2):
ri − β
zi =
α + (β/Pi )
with β as an estimate of the mean and the denominator as an estimate of the standard
error.2
All inference and interpretation is the same as for the conventional case and is not further
pursued here.
17.2.3.1 Implementation
The local Moran functionality for standardized rates is invoked as the last item in the Moran
group on the Cluster Maps toolbar icon. Alternatively, it can be selected from the menu
as Space > Local Moran’s I with EB Rate.
Since the rate standardization is computed as part of the operation, the variable selection
interface is similar to that used for rate maps. To illustrate this feature, the rate is computed
with DIS20 (number of people with disabilities in 2020) as the Event Variable, and
PTOT20 (total population in 2020) as the Base Variable, as done earlier in Section
14.2.2.1. A box map of the raw rate is shown in the left hand panel of Figure 12.9. The
spatial weights are queen contiguity, oaxaca_q.
2 To recap, β =
O / i Pi , where Oi is the number of events at i and Pi is the population at risk.
i i
2
The estimate of α = [ i Pi (ri − β) ]/P − β/(P/n),, with n as the total number of observations, such that
P/n is the average population. Note that the estimate of α can be negative, in which case it is set to zero.
Extensions of the Local Moran 321
The resulting significance map is shown in the right-hand panel of Figure 17.9, next to the
corresponding map for the crude rate (for 99,999 permutations with p < 0.05). Compared
to the map for the crude rate, there is one more significant location. Interestingly, the main
difference is at the high end of the significance, where the EB Local Moran has slightly more
significant locations. For example, the location highlighted in the blue rectangle is significant
at p < 0.00001 for EB Local Moran, but at p < 0.001 for the crude rate. Similarly, the
location highlighted in the red rectangle is significant at p < 0.001 for the EB Local Moran,
but at p < 0.01 for the crude rate. Since the same number of permutations is used in both
cases, the p-values are comparable. Overall, however, the differences are minimal and rather
subtle, confirming the findings for the Moran scatter plot.
The differences between the cluster map for the crude rate and for the EB Local Moran,
shown in Figure 17.10, are equally subtle. The EB Local Moran cluster map has three more
locations in the High-High group, and one less in each of the Low-High and High-Low spatial
outliers. Otherwise, the identified locations and groups match exactly.
options are the same as for all local spatial autocorrelation statistics, i.e., the value of the
statistic, cluster type and p-value, with the same default variable names.
or as:
ci = (1/m2 ) wij (xi − xj )2 ,
j
with m2 = zi2 /n. Again, because of the squared difference, there is no need to standardize
x.
The sum over all the local statistics is:
ci = n[ wij (xi − xj )2 / zi2 ].
i i j i
This establishes the connection between the local and the global as:
ci = 2nS0 /(n − 1)c.
i
Hence, the Local Geary is a LISA statistic in the sense established in Section 16.2.
Local Geary 323
Closer examination reveals that the Local Geary statistic consists of a weighted sum of the
squared distance in attribute space for the geographical neighbors of observation i. Since
there is no cross-product involved, there is no direct relation to linear similarity. In other
words, since the Local Geary uses a different criterion of attribute similarity, it may detect
patterns that escape the Local Moran, and vice versa.
As for the Local Moran, analytical inference is based on an approximation and generally not
very reliable. Instead, the same conditional permutation procedure as for the Local Moran
is implemented. The results are interpreted in the same way, with the caveat regarding the
p-values and the notion of significance.
17.3.1 Implementation
the Local Geary can be invoked from the Cluster Maps toolbar icon, as the first item in
the fourth block in the drop-down list. Alternatively, it can be started from the main menu,
as Space > Univariate Local Geary.
The subsequent step is the same as before, bringing up the Variable Settings dialog that
contains the names of the available variables as well as the spatial weights. Everything
operates in the same way for all local statistics.
The final dialog offers window options. In the case of the local Geary, there is no Moran
scatter plot option, but only the Significance Map and the Cluster Map. The default is
that only the latter is checked, as before.
With the variable p_PHA(2010) (or p_PHA10), queen contiguity (oaxaca_q) and
99,999 permutations, the significance map and cluster map for the Local Geary statistic,
using p < 0.05 are shown in Figure 17.11.
The significance map uses the same conventions as before, but the cluster map is different.
There are four different types of associations, three for clusters and one for spatial outliers.
The classification is further elaborated upon in Section 17.3.2. Comparison with the results
for a Local Moran statistic is considered in Section 17.3.3.
Overall, there are 108 locations indicating positive spatial autocorrelation, contrasted with
only 16 locations for negative spatial autocorrelation.
All the options operate the same for all local statistics, including the randomization setting,
the selection of significance levels, the selection of cores and neighbors, the conditional map
324 Other Local Spatial Autocorrelation Statistics
option, as well as the standard operations of setting the selection shape and saving the
image.
Local Geary 325
The third case of significant positive spatial autocorrelation is illustrated in Figure 17.14.
There are six observations for which the Moran scatter plot points appear in the lower right
and upper left quadrants, suggesting a mismatch between the cross-product and squared
difference attribute measures. Those locations are classified as Other Positive in the cluster
map.
Finally, observations with significant negative spatial autocorrelation (large Local Geary)
cannot be classified by type of spatial outlier. They are labeled as Negative in the cluster
map, as shown in Figure 17.15. All but one of these can also be found in the negative spatial
autocorrelation quadrants of the Moran scatter plot, but one is in the lower left quadrant,
again suggesting a mismatch between the cross-product and squared difference.
17.3.3.1 Clusters
Figure 17.17 identifies the matches for the 51 High-High cluster locations in the Local
Geary cluster map (on the right) in the Local Moran cluster map (on the left). Only 16 of
the 42 Local Moran High-High cluster locations belong to the set identified for the Local
Geary, but there are no mismatches in the sense of Low-Low clusters identified. Instead, the
non-matches are all not significant for the Local Moran, indicated by their grey color in the
map. Similarly, the significant High-High locations in the Local Moran cluster map that are
not identified with Local Geary are not significant in the latter (not shown).
As shown in Figure 17.18, only a subset of the 48 Low-Low clusters for the Local Moran
correspond to the 51 such clusters identified by the Local Geary. The other locations are not
Local Geary 327
Figure 17.19: Local Geary and Local Moran – Other positive clusters
significant. However, in the reverse order, there is one exception. One of the locations given
as a Low-Low cluster by the Local Moran is identified as a spatial outlier (negative) by the
Local Geary. The location in question is highlighted in a red rectangle in Figure 17.16. The
others are again not significant for Local Geary.
328 Other Local Spatial Autocorrelation Statistics
Finally, the six other spatial clusters identified in the Local Geary cluster map correspond
to non-significant locations for the Local Moran, as shown in Figure 17.19.
A careful consideration of the differences and similarities between the identified cluster
locations for the two local statistics sheds light on the extent to which the association may
be nonlinear. This should be followed by an in-depth inspection of the actual statistics and
associated attribute values.
In contrast, the G∗i statistic includes the value xi in both numerator and denominator:
∗ j wij xj
Gi = .
j xj
Note that in this case, the denominator is constant across all observations and simply consists
of the total sum of all values in the data set. The statistic is the ratio of the average values
in a window centered on an observation to the total sum of observations.
The interpretation of the Getis-Ord statistics is very straightforward: a value larger than the
mean (or, a positive value for a standardized z-value) suggests a High-High cluster or hot
spot, a value smaller than the mean (or, negative for a z-value) indicates a Low-Low cluster
or cold spot. In contrast to the Local Moran and Local Geary statistics, the Getis-Ord
approach does not consider negative spatial autocorrelation (spatial outliers).3
Inference can be derived from an analytical approximation, as given in Getis and Ord (1992)
and Ord and Getis (1995). However, similar to what holds for the Local Moran and Local
Geary, such approximation may not be reliable in practice. Instead, conditional random
permutation can be employed, using an identical procedure as for the other statistics.
17.4.1 Implementation
The implementation of the Getis-Ord statistics is largely identical to that of the other local
statistics. The Local G and Local G* options can be selected from the second group in
the drop-down menu generated by the Cluster Maps toolbar icon. Alternatively, they can
be invoked from the menu as Space > Local G or Space > Local G*.
The next step brings up the Variable Settings dialog, followed by a choice of windows to
be opened. The latter is again slightly different from the previous cases. The default is to use
row-standardized weights and to generate only the Cluster Map. The Significance
Map option needs to be invoked explicitly by checking the corresponding box. In contrast
to previous cases, it is also possible to compute the Getis-Ord statistics using binary (not
row-standardized) spatial weights, yielding a simple count of the neighboring values.
3 When all observations for a variable are positive, as is the case in our examples, the G statistics
are positive ratios less than one. Large ratios (more precisely, less small values since all ratios are small)
correspond with High-High hot spots, small ratios with Low-Low cold spots.
330 Other Local Spatial Autocorrelation Statistics
Continuing with the same example, using p_PHA10 (or p_PHA(2010)) from the Oaxaca
data set, with queen contiguity weights, 99,999 permutations and p < 0.05 yields the
significance and cluster maps for the G∗i statistic shown in Figure 17.22. In this example,
the results are identical to those for the Gi statistic, which is not shown separately.
Overall, 117 locations are identified as significant, the same total as for the Local Moran. In
contrast to the latter, the cluster map for the Getis-Ord statistics only takes on two colors,
red for hot spots (High-High), and blue for cold spots (Low-Low).
The options menu contains the same items as for the other local statistics, such as the
randomization setting, the selection of significance levels, the selection of cores and neighbors,
the conditional map option, as well as the standard operations of setting the selection shape
and saving the image.
classified as High-Low spatial outliers according to the Local Moran. The extent to which
this affects the interpretation of the spatial extent of hot spots or cold spots depends on
the relative importance of spatial outliers in the sample, but it can lead to quite different
conclusions between the two types of statistics.
17.4.3.1 Clusters
Figures 17.25 and 17.26 show the hot spots and cold spots selected in the Getis-Ord cluster
map on the right and identify the corresponding locations in the Local Moran cluster map
on the left. As mentioned, all the significant locations match exactly, but their classification
does not.
In Figure 17.25, the 54 hot spots in the Getis-Ord cluster map match the 42 High-High
clusters in the Local Moran cluster map as well as the 12 Low-High spatial outliers, similar
to what the Moran scatter plot in Figure 17.23 indicated.
332 Other Local Spatial Autocorrelation Statistics
On the other hand, in Figure 17.26, the 63 cold spots in the Getis-Ord cluster map correspond
with the 48 Low-Low clusters in the Local Moran cluster map, in addition to the 15 High-Low
spatial outliers. Again, this confirms the indication from the Moran scatter plot in Figure
17.24.
statistics) are likely interesting locations. On the other hand, observations that move in and
out of significance as the criteria change are likely spurious. The only way to address this
problem confidently is though a careful sensitivity analysis.
18
Multivariate Local Spatial
Autocorrelation
In this chapter, the concept of local spatial autocorrelation is extended to the multivariate
domain. This turns out to be particularly challenging, due to the difficulty in separating
spatial effects from the pure attribute correlation among multiple variables.
Three methods are considered. First, a bivariate version of the Local Moran is introduced,
which, similar to what is the case for its global counterpart, needs to be interpreted with
great caution. Next, an extension of the Local Geary statistic to the multivariate Local
Geary is considered, proposed in Anselin (2019a). The final approach is not based on an
extension of univariate statistics, but uses the concept of distance in attribute space, in the
form of a Local Neighbor Match Test (Anselin and Li, 2020).
To illustrate these methods, the Chicago SDOH sample data set is employed. It contains
observations on socio-economic determinants of health in 2014 for 791 census tracts in
Chicago (for a detailed discussion, see Kolak et al., 2020).
k
d2ij = ||xi − xj || = (xih − xjh )2 ,
h=1
with xi and xj as vectors of observations. In some expressions, the squared distance will be
preferred, in others, the actual distance (dij , its square root) will be used.
In this approach, the overarching objective is to identify observations that are close in both
multi-attribute space and in geographical space, i.e., those pairs of observations where the
two types of distances match.
IiB = xi wij yj ,
j
A special case of the Bivariate Local Moran statistic is when the same variable is compared
in neighboring locations at different points in time. The most meaningful application is
time period t, zt , and the other variable is for the neighbors in the
where one variable is for
previous time period, j wij zt−1 . This formulation measures the extent to which the value
at a location in a given time period is correlated with the values at neighboring locations in
a previous time period, or an inward influence. An alternative view is to consider zt−1 and
the neighbors at the current time, j wij zt . This would measure the correlation between a
location and its neighbors at a later time, or an outward influence. The first specification is
more accepted, as it fits within a standard space-time regression framework.
Inference proceeds similar to the global case, but is now conditional upon the tuple xi , yi
observed at location i. This somewhat controls for the correlation between x and y at i. The
remaining values for y are randomized and the statistic is recomputed to create a reference
distribution. The usual caveats regarding the interpretation of significance apply here as
well.
18.3.1 Implementation
The Bivariate Local Moran I is invoked as the third item in the drop-down list associated with
the Cluster Maps toolbar icon, or, from the menu, as Space > Bivariate Local Moran’s
I. The next dialog is the customary Variable Settings, which now has two columns, one for
First Variable (X), and one for Second Variable (Y). Since the Bivariate Local Moran
is not symmetric, the order in which the variables are specified matters. At the bottom of
the dialog, the Weights need to be selected.
To illustrate this feature, two variables are used from the Chicago SDOH sample data set:
the percentage children in poverty in 2014 (ChldPvt14), and a crowded housing index
(EP_CROWD). The spatial weights are nearest neighbor, with k = 6 (Chi-SDOH_k6).
Before proceeding with the actual bivariate analysis, the univariate characteristics of the
spatial distribution of each variable are considered more closely.
Co-location Map, using the classification codes and a Box Map classification for the co-location map.
338 Multivariate Local Spatial Autocorrelation
Co-location Map, using the cluster codes and a LISA Map classification for the co-location map.
Bivariate Local Moran 339
Figure 18.3: Local Moran cluster map – Child Poverty and Crowded Housing
Figure 18.4: Co-location of Local Moran for Child Poverty and Crowded Housing
Figure 18.5: Bivariate Local Moran cluster map – Child Poverty and Crowded Housing
Arguably, the more interesting information follows from the location of the spatial outliers,
especially the Low-High outliers, of which there are 15 for child poverty-crowded housing
and 38 for crowded housing-child poverty. These are census tracts where a Low (good) value
for one indicator is surrounded by a High (bad) value for the other, or vice versa, more so
than expected under spatial randomness. This is information the univariate cluster maps
cannot provide.
18.3.1.4 Options
The Bivariate Local Moran has all the same options as the conventional Local Moran, as
detailed in Chapter 16. The cluster codes associated with saved results are the same as well.
18.3.2 Interpretation
The interpretation of the Bivariate Local Moran needs to be carried out very carefully. Aside
from the usual caveats about multiple comparisons and p-values, the association between
one variable and a different variable at neighboring locations needs to consider the in-situ
correlation between the two variables as well. As the discussion in the previous sections
illustrates, it is best to combine the bivariate analysis with a univariate analysis for each
variable. In addition, it is important to consider both directions of association. This will
tend to reveal strong local clustering among the two variables as well as instances where
their spatial patterns do not coincide in the form of bivariate spatial outliers.
neighboring locations are smaller or larger than what they would be under spatial randomness.
The former case corresponds to positive spatial autocorrelation, the latter to negative spatial
autocorrelation.
An important aspect of the multivariate statistic is that it is not simply the superposition of
univariate statistics. In other words, even though a location may be identified as a cluster
using the univariate Local Geary for each of the variables separately, this does not mean that
it is also a multivariate cluster, and vice versa. The univariate statistics deal with distances
in attribute space projected onto a single dimension, whereas the multivariate statistics are
based on distances in a higher dimensional space. The multivariate statistic thus provides an
additional perspective to measuring the tension between attribute similarity and locational
similarity.
The Multivariate Local Geary statistic is formally the sum of individual Local Geary statistics
for each of the variables under consideration. For example, with k variables, indexed by h,
the corresponding expression is:
k
cM
i = wij (xhi − xhj )2 .
h=1 j
Figure 18.6: Box map and Local Geary cluster map – Uninsured
18.4.1 Implementation
The Multivariate Local Geary is invoked as the second item in third block in the drop-
down list associated with the Cluster Maps toolbar icon, or, from the menu, as Space
> Multivariate Local Geary. The next dialog is a slightly different Multi-Variable
Settings, which now allows for the selection of several variables (not limited to two). As
usual, the spatial Weights need to be specified at the bottom of the dialog.
To illustrate this functionality, three variables are used from the Chicago SDOH sample
data set. In addition to the same two as for the Bivariate Local Moran, the percentage
children in poverty in 2014 (ChldPvt14), and a crowded housing index (EP_CROWD),
the percentage without health insurance (EP_UNINSUR) is included as well. The latter
has a correlation of 0.486 with child poverty and 0.632 with crowded housing. The spatial
weights are again nearest neighbor, with k = 6 (Chi-SDOH_k6).
As before, the univariate properties of these variables are considered first, but now from the
perspective of the Local Geary.
Figure 18.7: Local Geary cluster maps – Child Poverty, Crowded Housing
18.4.1.3 Options
All the options of the significance and cluster maps considered before remain the same.
Specifically, the Save Results option offers the same choices as for the univariate Local
Geary (Section 17.3.1.1).
344 Multivariate Local Spatial Autocorrelation
Figure 18.9: Multivariate Local Geary cluster map – Child Poverty, Crowded Housing,
Uninsured
Figure 18.10: Multivariate Local Geary cluster map and Univariate Local Geary co-location
18.4.2 Interpretation
To shed further light on where interesting locations can be found, the results for a range
of different p-value cut-offs can be investigated. In Figure 18.10, the 23 observations that
remain significant under the Bonferroni bound (p = 0.0000126) are linked to the univariate
co-location map from Figure 18.8. Only about half (12) of these locations match between
the two, pointing to other dimensions of association beyond the univariate overlap.
Further insight can be gained by focusing closer on those observations that are identified by
the Multivariate Local Geary, but not by the univariate overlap. This then becomes much
more of an exploratory exercise than a clean p-value interpretation. It should therefore be
carried out with caution.
Local Neighbor Match Test 345
where N = n − 1 (one less than the number of observations), k is the number of nearest
neighbors considered in the connectivity graphs, v is the number of neighbors in common
and C is the combinatorial operator. Alternatively, a pseudo p-value can be computed based
on a permutation approach, although that is not pursued in GeoDa.
346 Multivariate Local Spatial Autocorrelation
The degree of overlap can be visualized by means of a cardinality map, a special case of a
unique values map. In this map, each location indicates how many neighbors the two weights
matrices have in common. In addition, different p-value cut-offs can be employed to select
the significant locations, i.e., where the probability of a given number of common neighbors
falls below the chosen p.
In practice, the value of k may need to be adjusted (increased) in order to find meaningful
results. In addition, the k-nearest neighbor calculation becomes increasingly difficult to
implement in very high attribute dimensions, due to the empty space problem (Section
8.2.1).
The idea of matching neighbors can be extended to distances among variables obtained from
dimension reduction techniques, such as multidimensional scaling, covered in Volume 2 (see
also the more detailed discussion in Anselin and Li, 2020).
In contrast to the approach taken for the Multivariate Local Geary, the local neighbor match
test focuses on the pairwise distances directly, instead of converting these into a weighted
average. Both measures have in common that they focus on squared distances in attribute
space, rather than a cross-product as in the Moran statistics.
18.5.1 Implementation
The Nearest Neighbor Match test is invoked as the last item in the drop-down list associated
with the Cluster Maps toolbar icon, or, from the menu, as Space > Local Neighbor
Match Test. This is followed by a dialog to select the variables.
In addition to the variable names, the dialog includes options for four important parameters.
The Number of Neighbors specifies the range for which the match is explored, i.e.,
the value of k in the k-nearest neighbor weights that are computed under the hood. In
the example, 6 is used, to allow comparison with the other approaches considered in this
chapter. Next is the Variable Distance Function, with Euclidean as the default, but
Manhattan distance as the other option. A third option pertains to the Geographic
Distance Metric. Here again, Euclidean Distance is the default, but Arc Distance is
available as well, for the case where the map layer is unprojected. Finally, Transformation
Local Neighbor Match Test 347
offers six options to adjust the variables: Raw, Demean, Standardize (Z), Standardize
(MAD), Range Adjust and Range Standardize. The default of Standardize (Z) is
the recommended approach. With this information, the two k-nearest neighbor weights are
constructed and the cardinality of the intersection computed.
In the illustration, the same three variables are employed as for the Multivariate Local Geary.
Before moving to the actual test, the logic of the approach is detailed.
Even though the largest number of matches is only two, such an occurrence is very rare
under spatial randomness, corresponding with a p-value of 0.0007. The 10 such identified
observations are highlighted in Figure 18.15, with the matching links shown in red.
Figure 18.16 provides a better sense of the rarity of the coincidence between geographic and
multi-attribute nearest neighbors. The two sets of neighbors are shown for the 10 highlighted
Local Neighbor Match Test 349
locations. It is clear that several nearest neighbors in attribute space are not so close in
geographic space, confirming the tension between the two notions of similarity.
four are not. Those four are instances where the average distances in attribute space to the
geographic neighbors are much larger than the individual distances for the nearest neighbors
in attribute space (as opposed to all the neighbors).
This can be assessed by investigating the linkage structure for the affected observations in
the right hand panel of Figure 18.16. For example, for the census tract identified by the
green rectangle in Figure 18.17, several of the nearest attribute neighbors are far away in
geographic space, as illustrated by the link highlighted by the blue arrow in Figure 18.16, as
well as its other links beyond the first two neighbors.
By focusing on a different aspect of the locational-attribute similarity trade-off, the Local
Neighbor Match Test provides yet another avenue to explore multivariate local clusters.
However, as pointed out, this approach does not scale well to a large number of variables,
since the empty space problem creates a large computational burden for the calculation of
k-nearest neighbors in high-dimensional attribute space.
19
LISA for Discrete Variables
So far, the application of local spatial autocorrelation statistics has been to continuous
variables. In this chapter, discrete variables are considered, and, more specifically, binary
variables. To address this context, a univariate Local Join Count statistic (Anselin and Li,
2019) and its extension to a multivariate setting are introduced. The latter allows for a
distinction between situations where the two discrete variables can co-occur (i.e., take the
value of 1 for the same location), and where they cannot (no co-location).
The principle behind the Local Join Count statistic is broadened by applying it to a subset
of observations on a continuous variable that satisfy a given constraint. The most common
application of this idea is to a specific quantile of the observations, leading to the concept of
a Quantile LISA (Anselin, 2019b).
The Chicago SDOH sample data set is again used to illustrate these methods.
white-white) and 0 − 1 (so-called BW joins). The former two are indicators of positive spatial
autocorrelation, the latter of negative spatial autocorrelation.
The primary interest lies in identifying co-occurrences of uncommon events, i.e., situations
where observations that take on the value of 1 constitute much less than half of the sample.
The definition of what is 1 or 0 can easily be reversed to make sure this condition is met.
Therefore, the focus is on the BB join counts. While this is not an absolute requirement, the
way the inference is obtained requires that the probability of obtaining a large number of
like neighbors is small, and thus can form the basis for rejecting the null hypothesis. When
the proportion of observations with 1 is larger than half, then the probability of a small
number of neighbors with a value of 1 will be small, which is counter the overall logic.1
With the variable xi at location i taking either the value of 1 or 0, a global BB join count
statistic can be written as:
BB = wij xi xj ,
i j
where wij are the elements of a binary spatial weights matrix. In other words, a join is
counted when wij = xi = xj = 1. In all other instances, either when there is a mismatch in
x between i and j, or a lack of a neighbor relation (wij = 0), the term on the right-hand
side does not contribute to the double sum.
Following the logic in Anselin (1995), Anselin and Li (2019) recently introduced a local
version of the BB join count statistic as:
BBi = xi wij xj ,
j
where xi,j can only take on the values of 1 and 0, and, again, wij are the elements of a
binary spatial weights matrix (i.e., not row-standardized).
The statistic is only meaningful for those observations where xi = 1, since for xi = 0 the
result will always equal zero. When xi = 1, it corresponds to the sum of neighbors for which
the value xj = 1. In this sense, it is similar in spirit to to the local second-order analysis for
point patterns outlined in Getis (1984) and Getis and Franklin (1987), where the number of
points are counted within a given distance d of an observed point. The distance cut-off d
could readily form the basis for the construction of the spatial weights wij , which yields the
join count statistic as a count of events (points) within the critical distance from a given
point (xi = 1).
The main difference between the two concepts is the underlying data structure: in the point
pattern perspective, the locations themselves are considered to be random, whereas the
local join count statistic is based on a lattice perspective. The latter considers a finite set of
known locations, for which both events (xi = 1) and non-events (xi = 0) are observed. In
point patterns analysis, one does not know the locations where events might have happened,
but did not.
In addition, the local join count statistic also has the same structure as the numerator in the
local Gi statistic of Getis and Ord (1992), when applied to binary observations (and with
a binary weights matrix). The numerator in this statistic is j wij xj , which is identical
to the multiplier in the local join count statistic. However, the difference between the two
statistics is that the local Gi includes the neighbors with xj = 1 for all locations, including
the ones where xi = 0. Such observations are ignored in the computation of the local join
1 In practice, this can easily be detected when the results show locations with 1 like neighbor to be
count statistic as outlined above. In a sense, the local join count statistic could thus be
considered a constrained form of the local Gi statistic, limited to observations where xi = 1.
Inference can be based on a hypergeometric distribution, or, as before, on a permutation
approach. Given a total number of events in the sample of n observations as P , the magnitude
of interest is the number of neighbors of location i for which xi = 1, i.e., conditional upon
observing 1 at this location. The number of neighbors with xj = 1 is represented by pi . The
probability of observing exactly pi = p, conditional upon xi = 1 follows the hypergeometric
distribution for n − 1 data points and P − 1 events:
P −1N −P −2
p k −p
Prob[pi = p|xi = 1] = N −1i ,
ki
19.2.1 Implementation
To illustrate the univariate Local Join Count statistic, two variables are considered that
correspond to census tracts in Chicago with a predominant ethnic make-up. More precisely,
these are the tracts where the Hispanic population makes up more than 50% (Hisp), and
the tracts where the majority population is Black (Blk).2 A unique values map for these two
2 Note that in the original data set, the binary variables HISP50PCT and BLCK50PCT are not
computed correctly. In fact, it turns out these variables are computed in the data set using a 49% cut-off
percentage, which does not preclude co-location. The variables Blk and Hisp use the correct cut-off, so that
co-location (majority population is more than one ethnic group) is precluded by construction.
354 LISA for Discrete Variables
variables is given in Figure 19.1. The pattern reveals the strong degree of racial segregation,
a well-known characteristics of the population distribution in Chicago.
The univariate local join count statistic is invoked in the third group in the Cluster Maps
drop-down list from toolbar icon, or from the menu, as Space > Univariate Local Join
Count. This is followed by a Binary Variable Settings dialog, where the variable is
selected and the spatial weights matrix is specified. In this illustration, the spatial weights
are queen contiguity (Chi_SDOH_q). This is one of the few instances in GeoDa where
the spatial weights are not row-standardized and instead are used as binary weights.
19.3.1 Implementation
The Bivariate Local Join Count statistic is invoked from the third group in the drop-down
list associated with the Cluster Maps toolbar icon, or, from the menu, as Space >
Bivariate Local Join Count. Next is a Variable Settings dialog from which the two
binary variables are selected. The First Variable (X) (x) is selected from the left-hand
column in the dialog, the Second Variable (Y) (z) is taken from the right-hand column.
Two cases are illustrated in Figure 19.3. In the left-hand panel, the first variable is Hisp
with the second variable as Blk. In the right-hand panel, it is the other way around.
Both significance maps are based on queen contiguity (Chi_SDOH_q) and use 99,999
permutations with a 0.05 significance cut-off. Clearly, the statistic is not symmetric.
Co-Location Local Join Count Statistic 357
The statistic picks up the rare occasions where a tract with an ethnic majority of one type
is surrounded by tracts with a majority of the other type. In the example, this is only the
case in very few instances. Hisp is surrounded by Blk in 12 locations, but 11 of those are
only a p = 0.05. Blk is surrounded by Hisp in only 7 instances, but only one of these is
highly significant at p < 0.00001. As shown in Figure 19.4, where the neighbors are selected
in the unique values map on the left, this is a clear example of a spatial outlier, where a
Black majority tract is surrounded by all Hispanic majority tracts.
All the same options as before are available, including saving the results.
with wij as unstandardized (binary) spatial weights. As before, there are P observations
with xi = 1 and Q observations with zi = 1 out of a total of n.
A conditional permutation approach can be constructed for those locations with xi = zi = 1.
The permutation consists of draws of ki pairs of observations (xj , zj ) from the remaining set
of n − 1 tuples, which contain P − 1 observations with xj = 1 and Q − 1 observations with
zj = 1. In a one-sided test, the number of times are counted where the statistic equals or
exceeds the observed join count value at i.
The extension to more than two variables is mathematically straightforward. At each location
i, k variables are considered, i.e., xhi , for h = 1, . . . , k, with Πkh=1 xhi = 1, which enforces
the co-location requirement.
The corresponding statistic is then:
CLCi = Πkh=1 xhi wij Πkh=1 xhj .
j
19.4.1 Implementation
The Co-Location Local Join Count statistic is invoked from the third group in the drop-
down list associated with the Cluster Maps toolbar icon, or, from the menu, as Space >
Co-location Local Join Count. This is followed by a Multi-Variable Settings dialog,
where the variables to be considered can be specified. At the bottom of the dialog is the
customary drop-down list with the spatial weights.
358 LISA for Discrete Variables
To illustrate this statistic, the two variables are Blk, tracts with majority Black population,
and CAR, tracts where more than 50% of the commutes happen by car. The spatial weights
are again queen contiguity, Chi_SDOH_q.
A co-location map of the two binary variables is shown in the left-hand panel of Figure
19.5. Of the 791 tracts, 167 are both majority Black and majority commute by car. Of the
remainder, 254 are neither and 370 show a mismatch of the majorities. The corresponding
Co-Location Local Join Count significance map is shown in the right-hand panel, using
99,999 permutations and a cut-off of 0.05. At 0.05, there are 90 cores of clusters that overlap,
out of the 167, but at 0.01, only 57 of those remain. They confirm the impression of high
dependence on commuting by car in majority Black neighborhoods.
All the usual options are available.
with yql and yqu as lower and upper bounds for a given quantile.3 For all other observations,
xi = 0.
The new x variable (or set of such variables in a multivariate case) then forms the basis for
analysis by means of a Local Join Count statistic, or one of its multivariate extensions.
19.5.1 Implementation
The Univariate and Multivariate Quantile LISA are invoked from the next to last block in
the drop-down list associated with Cluster Maps icon on the toolbar, or, from the menu,
as Space > Univariate Quantile LISA and Space > Multivariate Quantile LISA
respectively.
The Quantile LISA Dialog contains several options. In the univariate case, the continuous
variable needs to be specified, together with the spatial weights. Next follow the criteria to
create the binary form of the variable: the Number of Quantiles, Select a Quantile for
LISA and Save Quantile Selection in Field. The default number of quantiles is 5, for
quintiles. The quantile selection consists of a drop-down list with appropriate values, i.e.,
in this example 1 to 5, with 1 for the lowest quantile (in this case, quintile), and 5 for the
highest. The resulting binary variables is added to the data table with a default variable
name of QT.
In the multivariate case, a similar interface allows for the creation of one binary variable
at a time, with for each the number of categories and the order of the selected category,
saved under default variables QTk, with k as the sequence number. Each newly created
variable must be moved to the right-hand panel of the dialog by means of the > button
(and, conversely, can be removed by means of the < button).
3 In general, this may be applied to any interval, not just a given quantile.
360 LISA for Discrete Variables
A check box determines whether No co-location must be enforced. The default is to allow
co-location. A warning message results in case of no overlap when there should be overlap,
and vice versa.
With the spatial weights selected, the analysis is run and a significance map is created.
This is illustrated with three specific cases: a univariate Quantile LISA, and a bivariate and
multivariate example. All the standard options are available.
The co-location option is the default and it allows to identify clusters of observations that
belong to specified quantiles for different variables. Typically, this will be the top or bottom
quantiles, but the tool is sufficiently flexible to allow any combination (provided it makes
sense). A critical constraint is that there need to be co-location of the quantiles for all
the variables considered. For example, if three variables are taken into account, then there
must be locations (at least one), where the respective quantiles coincide. Just as for the
multivariate local join count, this becomes harder to satisfy as more variables are being
considered.
Eight cluster cores are identified, of which three at p = 0.001. The cluster map on the right
shows the significant locations for the corresponding Multivariate Local Geary (using the
Bonferroni bound for 0.01 and 99,999 permutations). The Multivariate Local Geary cluster
map includes 35 cluster cores. Of those, only two are also identified by the Multivariate
Quantile LISA. The six others, are not found to be significant in the Multivariate Local
Geary cluster map.
In contrast to the large number of significant locations obtained with the Multivariate Local
Geary, the Quantile version focuses on a much smaller (sub)set of significant observations
and may thus provide clearer insight into the interesting locations.
20
Density-Based Clustering Methods
In this last chapter dealing with local patterns, density-based clustering methods are
considered. These approaches search the data for high density subregions of arbitrary shape,
separated by low-density regions. Alternatively, the elevated regions can be interpreted as
modes in the spatial distribution over the support of the observations.
The methods covered form a transition between the local spatial autocorrelation statistics
and the regionalization methods considered in Volume 2. They pertain primarily to point
patterns, but can also be extended to a full multivariate setting. At first sight, density-based
clustering methods may seem similar to spatially constrained clustering, but they are not
quite the same. In contrast to the regionalization methods considered in Volume 2, the result
of density based clustering does not necessarily yield a complete partitioning of the data. In
a sense, the density-based cluster methods are thus similar in spirit to the identification of
clusters by means of local spatial autocorrelation statistics, although they are not formulated
as hypothesis tests. Therefore, these methods are included in the discussion of local spatial
autocorrelation, rather than with the regionalization methods considered in Volume 2.
Attempts to discover high density regions in the data distribution go back to the classic
paper on mode analysis by Wishart (1969), and its refinement in Hartigan (1975). In the
literature, these methods are also referred to as bump hunting, i.e., looking for bumps (high
regions) in the data distribution.
In this chapter, the focus is on the application of density-based clustering methods to
the geographic location of points, but the methods can be generalized to locations in
high-dimensional attribute space as well.
Four approaches are considered. First is a simple heat map as a uniform density kernel
centered on each location. The logic behind this graph is similar to that of Openshaw’s
Geographical Analysis Machine (Openshaw et al., 1987) and the approach taken in spatial
scan statistics (Kulldorff, 1997), i.e., a simple count of the points within the given radius.
This is also the main idea behind the Getis-Ord local statistics considered in Chapter 17.
The remaining methods are all related to DBSCAN (Ester et al., 1996), i.e., Density-Based
Spatial Clustering of Applications with Noise. Both the original DBSCAN is outlined, as
well as its improved version, referred to as DBSCAN*, and its Hierarchical version, referred
to as HDBSCAN, or, sometimes, HDBSCAN* (Campello et al., 2013, 2015).
The methods are illustrated with the Italy Community Banks sample data set that contains
the locations of 261 community banks.
20.2.1 Implementation
Unlike the other methods considered in this Chapter, the Heat Map is not invoked from the
cluster menu, but as the Heat Map option on any point map. This brings up a dialog with
various options. The top item is to Display Heat Map, which, when checked, brings up
the default, with the Heat Map on Top item active.
The default Heat Map is constructed for the max-min bandwidth that ensures that each
location has at least one other location within its circle. This is the same default as used to
Heat Map 365
specify the distance-band spatial weights (see Section 11.3.1). In most applications, this is
not very informative, as shown in Figure 20.2, which depicts the locations contained in the
Italy Community Banks sample data set.1 The Heat Map Bandwidth Setup Dialog in
the example indicates a value of 124,660 (meters), the same as for the distance-band spatial
weights used as an illustration in Chapter 11. The map can be customized using the various
available options, of which the most important is to Specify Bandwidth.
20.2.1.1 Options
In Figure 20.3, a customized version is shown, where advantage is taken of several of the
options. First, the bandwidth is set to 73,000 (meters), the same as in Section 11.4.2 (Figure
11.8). In addition, the Change Fill Color is used to set the color to blue, with the Change
Transparency to 0.95.
These settings provide a much clearer indication of the varying density over the map, with
different gradations of the color corresponding to high and low density regions. In particular,
the area in the North of the country is highlighted as a high density location.
A final option, Specify Core Distance, only applies to the output of an HDBSCAN
clustering routine (see Section 20.5.5.3).
1 The map shown in the figure has the default point colors changed to black, with the point size as 3. In
addition a second layer consisting of the outlines of the Italian regions is added, of which the Only Map
Boundary option is selected.
366 Density-Based Clustering Methods
20.3 DBSCAN
The DBSCAN algorithm was originally outlined in Ester et al. (1996) and Sander et al.
(1998) and was more recently elaborated upon in Gan and Tao (2017) and Schubert et al.
(2017). Its logic is similar to that just outlined for the uniform kernel. In essence, the method
again consists of placing circles of a given radius on each point in turn, and identifying those
groupings of points where a lot of locations are within each others range.
A second critical concept is the number of points that need to be included in the distance
band in order for the spatial distribution of points to be considered as dense. This is the
so-called minimum number of points, or MinPts criterion. In the example, this is set to four.
However, in contrast to the convention used before in defining k-nearest neighbors (Section
11.3.2), the MinPts criterion includes the point itself. So a MinPts = 4 corresponds to a
point having 3 neighbors within the Eps radius .3
In Figure 20.4, all red points with associated red circles have at least four points within the
critical range (including the central point). They are labeled as Core points (the letter C in
the graph) and are included in the cluster (connected by the dashed blue lines).
The points in magenta (labeled B), with associated magenta circles have some points within
the distance range (one, to be precise), but not sufficient to meet the MinPts criterion. They
are potential Border points and may or may not be included in the cluster.
Finally, the blue point (labeled N) with associated blue circle does not have any points
within the critical range and is labeled Noise. Such a point cannot become part of any
cluster.
20.3.1.2 Reachability
A point is directly density reachable from another point if it belongs to the Eps neighborhood
of that point and is one of MinPts neighbors of that point. This is not a symmetric
relationship.
In the example, any red point C is directly density reachable from at least two other red
points. Any such point pairs are within each others critical range. However, for border points
B, the relationship only holds in one direction, as shown by the arrow. They are directly
density reachable from a neighbor C, but since the B points only have one neighbor, their
range does not meet the minimum criterion. Therefore, the neighbor C is not directly density
reachable from B.
3 This definition of MinPts is from the original paper. In some software implementations, the minimum
points pertain to the number of nearest neighbors, i.e., MinPts − 1. GeoDa follows the definition from the
original papers.
368 Density-Based Clustering Methods
A chain of points in which each point is directly density reachable from the previous one is
called density reachable. In order to be included, each point in the chain has to have at least
MinPts neighbors and could serve as the core of a cluster. All the points labeled C in the
figure are density reachable, highlighted by the dashed connections between them.
In order to decide whether a border point should be included in a cluster, the concept of
density connected is introduced. A point becomes density connected if it is connected to a
density reachable point. For example, the points B are each within range of a point C that
is itself density reachable. As a result, the Border points B become included in the cluster.
different cluster, even though it might actually be closer to the corresponding core point. In
a later version, labeled DBSCAN* (considered in Section 20.4), the notion of border points
is dropped, and only Core points are considered to form clusters.
The algorithm systematically moves through all the points in the data set and assesses their
range. The search for neighbors is facilitated by using an efficient spatial data structure,
such as an R* tree. When two clusters are density connected, they are merged. The process
continues until all points have been evaluated.
Two critical parameters in the DBSCAN algorithm are the distance range and the minimum
number of neighbors. In addition, sometimes a tolerance for noise points is specified as well.
The latter constitute zones of low density that are not deemed to be interesting.
In order to avoid any noise points, the critical distance must be large enough so that every
point is at least density connected. An analogy is the specification of a max-min distance band
in the creation of spatial weights, which ensures that each point has at least one neighbor. In
practice, this is typically not desirable, but in some implementations a maximum percentage
of noise points can be set to avoid too many low density areas.4
Ester et al. (1996) recommend that MinPts be set at 4 and the critical distance adjusted
accordingly to make sure that sufficient observations can be classified as Core. As in other
cluster methods, some trial and error is typically necessary. Having to find a proper value
for Eps is often considered a major drawback of the DBSCAN algorithm.
20.3.2.1 Illustration
To illustrate how the DBSCAN proceeds, a toy example of nine points is depicted in Figure
20.5. The inter-point distances are included on the graph. The typical point of departure is to
consider a connectivity structure that ensures that each point has at least one neighbor, i.e.,
the max-min critical distance. In the example, this distance is 29.1548, with the corresponding
number of neighbors in the connectivity graph ranging from 1 to 5.
The concept of Noise points (i.e., unconnected points for a given critical distance) becomes
clear when the Eps distance is set to 20. In Figure 20.5, this results in one unconnected
point, labeled 3, shown by the dashed blue line between 3 and 5. The green lines correspond
with inter-point distances that do not meet the Eps criterion either, but they do not result
in isolates. The effective connectivity graph for Eps = 20 is shown in solid blue.
With MinPts as 4 (i.e., 3 neighbors for a core point), the algorithm proceeds through each
point, one at a time. For example, it could start with point 1, which only has one neighbor
(point 2) and thus does not meet the MinPts criterion. Therefore, it is initially labeled
as Noise. Next, point 2 is considered. It has 3 neighbors, meaning that it meets the Core
criterion, and therefore it is labeled cluster 1. All its neighbors, i.e., 1, 4 and 5 are also
labeled cluster 1 and are no longer considered. Note that this changes the status of point 1
from Noise to Border. Point 3 has no neighbors and is therefore labeled Noise.
Next is point 6, which is a neighbor of point 4 that belongs to cluster 1. Since 6 is therefore
density connected, it is added to cluster 1 as a Border point. The same holds for point 7,
which is similarly added to cluster 1.
Point 8 has two neighbors, which is insufficient to reach the MinPts criterion. Even though
it is connected to point 7, which belongs to the cluster, 7 is not a Core point, so therefore 8
4 GeoDa currently does not support this option.
370 Density-Based Clustering Methods
is not density connected and is labeled Noise. Finally, point 9 has only point 8 as neighbor
and is therefore labeled Noise as well.
This results in one cluster consisting of 6 points, shown as the dark blue points in Figure
20.6. The solid blue lines show the connections between the Core points, the dashed lines
show the Border points. The light blue points for 3, 8 and 9 indicate Noise points.
20.3.3 Implementation
DBSCAN is invoked from the cluster toolbar icon (Figure 20.1) as the first item in the density
cluster group, or from the menu as Clusters > DBScan . This brings up the DBScan
Clustering Settings dialog, which combines two functions. One is the specification of
variables and various cluster options, shown in the left-hand panel of Figure 20.7. The other
is the presentation of Summary characteristics of the resulting clusters, and, for DBSCAN*,
the Dendrogram (see Section 20.4.2.1).
The Select Variables panel includes the variables that must be chosen as X and Y
coordinates. For the Italy Community Banks sample data set, the variables XKM and
YKM are specified, which correspond with the projected coordinates expressed in kilometers
(the original is in meters). The selected Method is DBScan.
The selection of the coordinates immediately populates some of the fields in the Parameters
panel. The default Transformation is Standardize (Z), with the Distance Threshold
(epsilon) as 0.443177. However, in this instance, the original coordinates should be used,
hence the Raw option is specified in Figure 20.7. This yields a critical distance of 124.659981.
DBSCAN 371
As it turns out, a slight correction is needed, and the value is rounded to 124.660 in the
dialog.
The other options consist of the Min Points, which is initially left to the default of 4,
and the Distance Function, set to Euclidean (the other option is Manhattan distance).
The Min Cluster Size option is not available for DBSCAN.
A final item to be specified is the variable that will contain the cluster identifier (Save
Cluster in Field), with CL as the default. In GeoDa the identifiers are assigned in decreasing
order of the size of the cluster, i.e., with 1 for the largest cluster. Points that remain as
Noise points are assigned a value of 0.
two main parameters of Distance Threshold and Min Points play a crucial role. As an
illustration, the results of a cluster analysis with a bandwidth of 50 km and the minimum
points set to 10 are given in Figure 20.9. In addition to the cluster points, identified by
different colors, the connectivity graph for a 50 km distance band is shown as well. Seven
clusters are identified, ranging in size from 110 to 10 observations. A total of 70 locations
are labeled Noise, either because they become isolates (not connected to the graph), or they
have insufficient neighbors. The overall fit of the clusters is much better, with a between to
total ratio of 0.957 (not shown), due to a grouping of locations that are relatively close by.
20.4 DBSCAN*
One of the potentially confusing aspects of DBSCAN is the inclusion of so-called Border
points in a cluster. The DBSCAN* algorithm, outlined in Campello et al. (2013) and further
elaborated upon in Campello et al. (2015), does away with the notion of border points, and
only considers Core and Noise points.
Similar to the approach in DBSCAN, a Core object is defined with respect to a distance
threshold (Eps) as the center of a neighborhood that contains at least Min Points other
observations. All non-core objects are classified as Noise (i.e., they do not have any other
point within the distance range Eps). Two observations are Epsilon reachable if each is part
of the Epsilon-neighborhood of the other. Core points are density-connected when they are
part of a chain of Epsilon-reachable points. A cluster is then any largest (i.e., all eligible
points are included) subset of density connected pairs. In other words, a cluster consists of a
chain of pairs of points that belong to each others Epsilon-neighborhoods.
This mutual reachability distance replaces the original distance dAB in the connectivity
graph that forms the basis for the determination of clusters (e.g., a graph like Figure 20.5).
Unless A and B are mutual k-nearest neighbors, this effectively replaces the inter-point
distances by a core distance.5
In DBSCAN*, the adjusted connectivity graph is reduced to a minimum spanning tree or
MST (see Section 3.4) to facilitate the derivation of clusters. Rather than finding a single
solution for a given distance range as in DBSCAN, the longest edges in the MST are cut
sequentially to provide a solution for each value of inter-point mutual reachability distance.
This boils down to finding a cut for smaller and smaller values of the core distance. It yields
a hierarchy of clusters for decreasing values of the core distance.
5 The largest k-nearest neighbor distance will always be larger than the inter-point distance, unless A and
B are mutual k-nearest neighbors, in which case dmr (A, B) = dcore (A) = dcore (B) = dAB .
374 Density-Based Clustering Methods
In practice, one starts by applying a cut between two edges that are the furthest apart. This
either results in a single point splitting off (a Noise point), or in the single cluster to split
into two (two subtrees of the MST). Subsequent cuts are applied for smaller and smaller
distances. Decreasing the critical distance is the same as increasing the density of points,
referred to as λ in this context (see Section 20.5).
As Campello et al. (2015) show, applying cuts to the MST with decreasing distance (or
increasing density) is equivalent to applying cuts to a dendrogram obtained from single
linkage hierarchical clustering using the mutual reachability distance as the dissimilarity
matrix (hierarchical clustering methods are considered in Volume 2).
DBSCAN* derives a set of flat clusters by applying a cut to the dendrogram associated with
a given distance threshold. The main difference with the original DBSCAN is that border
points are no longer considered. In addition, the clustering process is visualized by means of
a dendrogram. However, in all other respects it is similar in spirit, and it still suffers from
the need to select a distance threshold.
DBSCAN* 375
20.4.2 Implementation
DBSCAN* is invoked in the same way as DBSCAN, but with DBScan* checked as the
Method in the DBScan Clustering Settings dialog (Figure 20.7). All other settings are
the same, except that now the Min. Cluster Size option becomes available. However, in
practice, this is typically set equal to the Min Points parameter and it is seldom used.
The functionality operates largely the same as for DBSCAN, except that the first result
for a given Min Points is a Dendrogram, shown in the right panel of the dialog. Figure
20.10 depicts the dendrogram for the same settings as in Figure 20.9, i.e., with a threshold
distance of 50 km and 10 minimum points. However, the results are very different. Only two
clusters are identified, compared to 7 clusters before. They are visualized by selecting the
Save/Show Map button at the bottom of the dialog.
The resulting cluster map is shown in Figure 20.11. One cluster is very large, containing 100
observations, the other is small, with only 12 observations. A total of 149 observations are
classified as Noise. Relative to the DBSCAN solution in Figure 20.9, the fit declined to a
ratio of 0.894.
For example, in the right-hand panel of Figure 20.12, the dashed red line corresponding to
the cut-off distance is moved to the left, to a distance of 73 km (also used in Section 11.4.2).
The corresponding cluster map is shown in the left-hand panel. The number of clusters has
doubled to four, ranging in size from 151 to 11 observations, with 53 unclustered Noise
points. The fit ratio decreases to 0.855.
The map may give the impression that seemingly disconnected points are contained in the
same cluster. In the example, the points within the red rectangle contain both observations
that belong to cluster 1 (dark blue) as well as unclustered points (light blue). This seeming
contradiction is due to the fact that the core distance is used in the dendrogram, and not
the actual inter-point distance.
The dendrogram needs to be reconstructed for each different value of Min Points, since
this critically affects the mutual reachability and core distances.
The main disadvantage of DBSCAN*, just as for DBSCAN, remains that the same fixed
threshold distance must be applied to all clusters, yielding a so-called flat cluster. This is
remedied by HDBSCAN.
20.5 HDBSCAN
HDBSCAN was originally proposed by Campello et al. (2013), and more recently elaborated
upon by Campello et al. (2015) and McInnes and Healy (2017).6 As in DBSCAN*, the
6 This method is variously referred to as HDBSCAN or HDBSCAN*. Here, for simplicity, the former will
be used.
HDBSCAN 377
algorithm implements the notion of mutual reachability distance. However, rather than
applying a fixed value of the cut-off distance to produce a cluster solution, a hierarchical
process is implemented that finds an optimal cluster combination, using a different critical
cut-off distance for each cluster. In order to accomplish this, the concept of cluster stability
or persistence is introduced. Optimal clusters are selected based on their relative excess of
mass value, which is a measure of their persistence.
HDBSCAN keeps the notion of Min Points from DBSCAN and also uses the concept of core
distance of an object (dcore ) from DBSCAN*. The core distance is the distance between
an object and its k-nearest neighbor, where k = Min Points − 1 (in other words, as for
DBSCAN, the object itself is included in Min Points).
Intuitively, for the same k, the core distance will be much smaller for densely distributed
points than for sparse distributions. Associated with the core distance for each object x
is the concept of density or λp , as the inverse of the core distance. As the core distance
becomes smaller, λ increases, indicating a higher density of points.
As in DBSCAN*, the mutual reachability distance is employed to construct a minimum
spanning tree, or, rather, the associated dendrogram from single linkage clustering. From
this, HDBSCAN derives an optimal solution, based on the notion of persistence or stability
of the cluster.
where λmin (Ci ) is the minimum density level at which Ci exists, and λmax (xj , Ci ) is the
density level after which point xj no longer belongs to the cluster.
The persistence of a cluster becomes the basis to decide whether to accept a cluster split or
to keep the larger entity as the cluster.
The inter-point distances form the basis for a minimum spanning tree (MST), which is
essential for the hierarchical nature of the algorithm. The corresponding connectivity graph
is shown in Figure 20.15, with both the node identifiers and the edge weights listed (the
distance between the two nodes the edge connects). The HDBSCAN algorithm essentially
consists of a succession of cuts in the connectivity graph. These cuts start with the highest
edge weight and move through all the edges in decreasing order of the weight. The process
of selecting cuts continues until all the points are singletons and constitute their own cluster.
Formally, this is equivalent to carrying out single linkage hierarchical clustering based on
the mutual reachability distance matrix.
The stability or persistence of each cluster is defined as the sum over all the points of the
difference between λp and λmin , where λmin is the λ level where the cluster forms.
For example, cluster C3 is formed (i.e., splits off from C1) for λ = 0.055463 and consists of
points 8 and 9. Therefore, the value of λmin for C3 is 0.055463. The value of λp for points 8
and 9 is 0.066667, corresponding with a distance of 15.0 at which they become singletons
(or leaves in the tree). The contribution of each point to the persistence of the cluster C3
382 Density-Based Clustering Methods
is thus 0.066667 – 0.055463 = 0.011204. The total value of the persistence of cluster C3 is
therefore the sum of this contribution for both points, or 0.022408. Similar calculations yield
the persistence of cluster C2 as 0.041400. This is summarized in Figure 20.17. For the sake
of completeness, the results for cluster C1 are included as well, although those are ignored
by the algorithm, since C1 is reset as root of the tree.
The condensed tree only includes C1 (as root), C2 and C3. For example, the split of C2 into
C4 and 7 is based on the following rationale. Since the persistence of a singleton is zero,
and C4 includes one less point than C2, but with the same value for λmin , the sum of the
persistence of C4 and 7 is less than the value for C2, hence the condensed tree stops there.
The condensed tree allows for different values of λ to play a role in the clustering mechanism,
potentially corresponding with a separate critical cut-off distance for each cluster.
20.5.4 Outliers
Related to the identification of density clusters is a concern to find points that do not belong
to a cluster and are therefore classified as outliers. Parametric approaches are typically
based on some function of the spread of the underlying distribution, such as the well-known
3-sigma rule for a normal density.7
An alternative approach is based on non-parametric principles. Specifically, the distance of
an observation to the nearest cluster can be considered as a criterion to classify outliers.
Campello et al. (2015) proposed a post-processing of the HDBSCAN results to characterize
the degree to which an observation can be considered an outlier. The method is referred to
as GLOSH, which stands for Global-Local Outlier Scores from Hierarchies.
7 These approaches are sensitive to the influence of the outliers on the estimates of central tendency and
spread. This works in two ways. On the one hand, outliers may influence the parameter estimates such that
their presence could be masked, e.g., when the estimated variance is larger than the true variance. The
reverse effect is called swamping, where observations that are legitimately part of the distribution are made
to look like outliers.
HDBSCAN 383
The logic underlying the outlier detection is closely related to the notion of cluster membership
just discussed. In fact, the probability of being an outlier is the complement of the cluster
membership.
The rationale behind the index is to compare the density threshold for which the point is
still attached to a cluster (λp in the previous discussion) and the highest density threshold
λmax for that cluster. The GLOSH index for a point p is then:
λmax − λp
GLOSHp = ,
λmax
or, equivalently, 1 − λp /λmax , the complement of the cluster membership for all but Noise
points. For the latter, since the cluster membership was set to zero, the actual λmin needs
to be used.
Given the inverse relationship between density and distance threshold, an alternative
formulation of the outlier index is as:
dmin
GLOSHp = 1 − .
dp,core
20.5.5 Implementation
HDBSCAN is invoked from the Menu or the cluster toolbar icon (Figure 20.7) as the second
item in the density clusters subgroup, Clusters > HDBSCAN. The overall setup is
very similar to that for DBSCAN, except that there is no threshold distance. Again, the
coordinates are taken as XKM and YKM.
The two main parameters are Min Cluster Size and Min Points. These are typically
set to the same value, but there may be instances where a larger value for Min Cluster
Size is desired. In the illustration, these values are set to 10. The Min Points parameter
drives the computation of the core distance for each point, which forms the basis for the
construction of the minimum spanning tree. As before, Transformation is set to Raw and
Distance Function to Euclidean.
The Run button brings up a cluster map and the associated dendrogram. The cluster map,
shown in Figure 20.18, reveals 7 clusters, ranging in size from 67 to 10, with 69 Noise points.
The Summary button reveals an overall fit of 0.970, which is the best value achieved so far.
In and of itself, the raw dendrogram is not of that much interest. One interesting feature of
the dendrogram in GeoDa is that it supports linking and brushing, which makes it possible
to explore what branches in the tree subsets of points or clusters belong to (selected in one
of the maps). The reverse is supported as well, so that branches in the tree can be selected
and their counterparts identified in other maps and graphs.
In Figure 20.19, the 19 observations are selected that form cluster 5, consisting of bank
locations on the island of Sicily. The corresponding branches in the dendrogram are high-
lighted. This includes all the splits that are possible after the formation of the final cluster,
in contrast to the condensed tree, considered next.
root (top), with λ = 0, the left hand axis shows how increasing values of λ (going down)
results in branching of the tree. The shaded areas are not rectangles, but they gradually
taper off, as single points are shredded from the main cluster.
The optimal clusters are identified by an oval with the same color as the cluster label in the
cluster map. The shading of the tree branches corresponds to the number of points contained
HDBSCAN 385
in the branch, with lighter suggesting more points. In Figure 20.20, the same 19 members of
cluster 5 are selected as before. In contrast to the dendrogram, no splits occur below the
formation of the cluster, highlighted in red.
In GeoDa, the condensed tree supports linking and brushing as well as a limited form of zoom
operations. The latter may be necessary for larger data sets, where the detail is difficult
386 Density-Based Clustering Methods
in the HDBSCAN algorithm must be the same. In the current example, this means that COORD_X and
COORD_Y should be used instead of their kilometer equivalents. The cluster solutions are the same, since
it is a simple rescaling of the coordinates.
HDBSCAN 387
is one of the Heat Map options, invoked by right clicking on the cluster point map. The
core distance variable must be specified in the Specify Core Distance option. This then
yields a heat map as in Figure 20.22, where the core distances associated with each cluster
are shown in the corresponding cluster color.
The overall pattern in the figure corresponds to gradations in the cluster membership map.
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
Part VI
Epilogue
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
21
Postscript – The Limits of Exploration
In this first volume, devoted to exploring spatial data, I have outlined a progression of
techniques to aid in finding interesting patterns of dependence and heterogeneity in geospatial
data, supported by the GeoDa software. The highly interactive process of discovery moves
from simple maps and graphs to a more structured visualization of clusters, outliers and
indications of potential structural breaks.
The main topic in this volume is the investigation of spatial correlation, i.e., the match or
tension between attribute similarity and locational similarity. In that quest, the primary
focus is on the identification of univariate locations of clusters and spatial outliers by means
of a range of local indicators of spatial association (LISA).
For a single variable, Tobler’s first law of geography suggests the omnipresence of positive
spatial autocorrelation, with a distance decay effect for the dependence. This is indeed
observed for many variables in empirical applications. However, Tobler’s law does not
necessarily generalize to a bivariate or multivariate setting, where the situation is considerably
more complex (Anselin and Li, 2020). As discussed in more detail in Chapter 18, in a
multivariate setting, not only is there the tension between attribute and locational similarity,
but there is an additional dimension of inter-attribute similarity, which is typically not
uniform across space.
The GeoDa software is designed to make this process of discovery easy and intuitive. However,
this may also unintentionally facilitate potentially pointless clicking through the range of
interfaces until one finds an outcome one likes, without fully understanding the limitations
of the techniques involved. The extensive discussion of the methods in the book is aimed at
remedying the second aspect. Nevertheless, some caution is needed.
First and foremost, exploration is not the same as explanation. Interesting patterns may be
discovered, but that does not necessarily identify the process(es) that yielded the patterns.
As stated strongly by Heckman and Singer (2017): “data never speak for themselves.” As
argued in the discussion of global spatial autocorrelation in Section 13.5.3, spatial data are
characterized by the inverse problem, in the sense that the same pattern may be generated by
very different spatial processes. More specifically, there is an important distinction between
true contagion and apparent contagion. In cross-sectional data, both types of processes yield
a clustered pattern. However, the pattern that follows from apparent contagion is generated
by a spatially heterogeneous process and not a dependent process, as is typically assumed.
Without further information, such as a time dimension, it is impossible to distinguish between
the two generating processes.
The interpretation and validation of outcomes of an exploratory (spatial) data analysis has
been the subject of growing discussion in the literature, as reviewed in Section 4.2.1. The
exploratory process is neither inductive nor deductive, but rather abductive, involving a
move back and forth between data analysis, hypothesis generation and reformulation, as
well as the addition of new information, an approach sometimes referred to as the “Sherlock
Holmes method” (Gahegan, 2009; Heckman and Singer, 2017).
How such a discovery process is carried out has become increasingly important with the
advent of big data, and the associated argument for data-driven discovery as the fourth
paradigm in scientific reasoning (see, e.g., Hey et al., 2009; Gahegan, 2020).
In this final chapter, I want to briefly consider three broader issues that run the risk of
getting lost in the excitement of the discovery process driven by interactive software such as
GeoDa: (1) the potential pitfalls of data science and their implications for scientific reasoning;
(2) the limitations intrinsic to spatial analysis; and (3) reproducible research.
A critical aspect of the abductive approach is how to deal with surprising results, or with
results that run counter to pre-conceived ideas. As outlined in detail in Nuzo (2015), among
others, it is easy to fool oneself by finding patterns where there are none, by focusing on
explanations that fit one’s prior convictions, or by failing to consider potential alternative
hypotheses. This can result in confirmation bias (i.e., finding what one sets out to find), as
well as disconfirmation bias (tendency to reject results that are counter to one’s priors). In
GeoDa, the extensive use of the permutation approach to represent spatial randomness in
the data is one way to partially address this concern, but by itself, it is insufficient.
Even without nefarious motivations (e.g., driven by the pressure to publish), problems
such as p-hacking (searching until significant results are found), HARKing (hypotheses
after the results are known), JARKing (justifying after the results are known), and the like
can unwittingly infiltrate an exploratory spatial data analysis. This has led to extensive
discussions in the literature as to how these issues affect the process of scientific discovery
(among others, Kerr, 1998; Simmons et al., 2011; Gelman and Loken, 2014; Gelman and
Hennig, 2017; Rubin, 2017). Sound scientific reasoning is a way of reasoning that looks for
being wrong, a systematic and enduring search for what might be wrong and an iterative
process that goes back and forth between potential explanations and evidence.1 Ideally, the
process of data exploration should be guided by these principles.
A critical notion in this regard is the researcher degrees of freedom, a reference to the many
decisions one makes with respect to the data that are included, the hypotheses considered
(and not considered), and methods selected, in the so-called garden of forking paths (Gelman
and Loken, 2014). In the context of the methods considered here and implemented in GeoDa,
aspects such as the selection of spatial scale, how to deal with outliers, whether to apply
imputation for missing values, the choice of spatial weights, the treatment of unconnected
observations (islands or isolates), and various tuning parameters are prime examples of
decisions one has to make that may affect the outcome of the data exploration. Careful
attention to these decisions, ideally accompanied by a sensitivity analysis can partially
remedy the problem.
Finally, as in any data science, spatial data science similarly may suffer from well-known
generic pitfalls, for example as outlined in the book by Smith and Cordes (2019). Their book
specifically lists “using bad data, putting data before theory, worshipping math, worshipping
computers, torturing data, fooling yourself, confusing correlation with causation, regression
toward the mean and doing harm” as examples of such pitfalls. Any serious student of
exploratory spatial data analysis should become familiar with these potential traps and learn
to recognize and avoid them.
1 For examples of how this applies to data science and spatial data science, see, e.g., https://puttingscien
ceintodatascience.org.
393
In addition to these aspects associated with data science in general, spatial data science
also faces its own special challenges. These include the ecological fallacy (Robinson, 1950;
Goodman, 1953; 1959), the modifiable areal unit problem, or MAUP (Openshaw and Taylor,
1979), the change of support problem (Gotway and Young, 2002), as well as the more general
issue of the importance of spatial scale (Goodchild, 2011; Oshan et al., 2022). Many of
these are just special cases of well-known challenges associated with any type of statistical
analysis. For example, the ecological fallacy was first raised in sociology and it cautions
against interpreting the results of aggregate (spatial) analysis to infer individual behavior.
The change of support problem concerns the combination of data at various spatial scales,
such as point data (e.g., measurements by environmental sensors) and areal data (e.g., health
outcomes at the census tract level), and associated issues of data aggregation and imputation,
shared with the attention to scale in geographical analysis.
The MAUP stands out as particular to spatial analysis. It includes both aspects of data
aggregation as well as spatial arrangement, or, zonation. In essence, MAUP suggests that
different spatial scales and different areal boundaries will yield different and sometimes
conflicting statistical results. In many instances in the social sciences, the boundaries are
pre-set (e.g., administrative districts) and there is little one can do about it, other than being
careful in phrasing the findings of an analysis. This is particularly important when areal
boundaries do not align with the spatial scale of the processes investigated. For example,
in the U.S., census tracts are typically assumed to correspond with neighborhoods, even
though this is seldom the case. Similarly, counties are identified with labor markets, but this
is clearly invalid in metro areas (consisting of multiple counties) or in the sparsely populated
large counties in the west.
In terms of the spatial exploration, this means that the discovery of clusters needs to be
interpreted with caution. A cluster may be nothing more than an indication of poor spatial
scale (e.g., spatial units of observation much smaller than the process of interest), especially
when the size of the areal units is heterogeneous across the data set. On the other hand,
when individual spatial observations are available, the regionalization methods from Volume
2 may be applied to group them into meaningful spatial aggregates.
MAUP and its associated challenges does not invalidate spatial analysis, but it may limit the
relevance of some of the findings. Where possible, sensitivity analysis should be implemented.
A second type of limit to spatial analysis follows from the enormous advances made in
machine learning and artificial intelligence. For example, deep learning techniques (e.g.,
Goodfellow et al., 2016) are able to identify patterns without much prior (spatial) structure,
as long as they are trained on very large data sets. GeoAI pertains to the application of
these new methods to the analysis of geospatial data (Janowicz et al., 2020). Most current
applications have been in the physical domain, such as land cover classification by means of
remote sensing, landscape feature extraction and way finding. The focus in GeoAI has been
on so-called feature engineering, i.e., identifying those spatial characteristics that should
be included in the training data sets to yield good predictive performance. To date, much
of GeoAI can be viewed as applying data science to spatial data, rather than spatial data
science as conceived of in this book. Nevertheless, this raises an important question about
the relevance of spatial constructs and spatial thinking in AI. Does this imply that spatial
analysis as considered here will become irrelevant?
While it may be tempting to reach this conclusion, the powerful new deep learning methods
are not a panacea for all empirical situations. Specifically, in order to perform well, these
techniques require very large training data sets, consisting of millions (and even billions)
of data points. Such data sets are still rather rare in the contexts in empirical practice,
especially in the social sciences. In addition, the objective in exploratory spatial data analysis
394 Postscript – The Limits of Exploration
is to discover patterns and generate hypotheses, not prediction, the typical focus in AI. So,
the insights gained from the methods covered in this book will remain relevant for some time.
Nevertheless, the extent to which spatial structure and explicit spatial features contribute to
the performance of deep learning methods, or, instead becomes discovered by these methods
(and thus irrelevant) still remains to be answered.
A final concern pertaining to the exploration of spatial data is the extent to which it can
be reproducible and replicable. Reproducibility refers to obtaining the same results in a
repetition of the analysis on the same data (by the researchers themselves or by others),
replicability to obtaining the same type of results using different data.
The growing demands for transparency and openness in the larger scientific community
(so-called TOP guidelines, for transparency and openness promotion) have resulted in
requirements of open data, open science and open software code. These concerns have
also been echoed in the context of spatial data science (among others, by Rey, 2009; 2023;
Singleton et al., 2016; Brunsdon and Comber, 2021; Kedron et al., 2021). Common approaches
to accomplish this in spatial data analysis is the codification of workflows by means of visual
modeling languages, e.g., as implemented in various GIS software (see also Kruiger et al.,
2021). More formally, data, text and code are combined in notebooks, such as the well-known
Jupyter (originally for Python) and RStudio (originally for R) implementations (see Rowe
et al., 2020).
To what extent does the highly dynamic and interactive visualization in GeoDa meet these
requirements? At one level, it does not and cannot. The thrill of discovery can easily result in
a rapid succession of linked graphs and maps, in combination with various cluster analyses, in
a fairly unstructured manner that does not lend itself to codification in workflows. However,
this intrinsic lack of reproducibility can be remedied to some extent.
For example, it may be possible to embed elementary analyses (e.g., a Local Moran) into
a specific workflow (e.g., select variable, specify weights, identify clusters), although this
will necessarily result in a loss of spontaneity. Alternatively, as outlined in Appendix C,
the geodalib architecture allows specific analyses to be embedded in notebook workflows
through the new RGeoDa and PyGeoda wrappers.
In addition, there are some simple ways to obtain a degree of reproducibility for major
aspects of the analysis, such as the specification of spatial weights, custom classifications and
various variable transformations through the project file (see, e.g., Sections 4.6.3, 9.2.1.1 and
10.3.5). Nevertheless, obtaining full reproducibility for dynamic and interactive visualization
remains a challenge.
In spite of these challenges, I believe that the exploratory perspective developed in the book
and implemented in the GeoDa software can be very efficient at generating useful insights. As
long as the approach is applied with caution, such as by including ample sensitivity analyses,
pitfalls can be avoided. Time will tell.
The second volume of this introduction to spatial data science is devoted to what is referred
to in the machine learning literature as unsupervised learning. The objective is to reduce
the complexity in multivariate data by means of dimension reduction, both in the attribute
dimension (e.g., principal components and multidimensional scaling), as well as in the
observational dimension (clustering and regionalization). The distinctive perspective offered
is to emphasize the relevance of spatial aspects, by spatializing the results of classic methods,
either through mapping in combination with linking and brushing, or by imposing spatial
constraints in the clustering algorithms themselves. This builds upon the foundations
developed in the current volume.
A
Appendix A – GeoDa Preference Settings
The Preferences item on the main GeoDa menu includes several options that can be set to
fine tune the appearance and performance of the software. They appear under two different
tabs, one for System and one for Data. The contents are shown in Figures A.1 and A.2. In
a MAC OSX operating system, the preferences are accessed under the GeoDa item on the
menu bar. In Windows and Linux, the item is under File.
The System preferences pertain to detailed settings for the map and plot properties, support
for multiple languages, general appearance and computational customization.
The map and plot appearance settings control the transparency of selected items, font size,
and an option to revert to the legacy method of highlighting selected observations using
cross-hatching (this was the method used in legacy GeoDa version 0.9.5-i).
In addition to the default English, GeoDa currently has language customization in all menus
and dialogs for Simplified Chinese, Russian, Spanish, Portuguese and French, with more
languages slated to be added in the future.
Several minor options determine the appearance of a few dialogs dealing with data entry,
and other options such as automatic crash detection and software updates, as well as Retina
display support (on Mac only). Finally, some features are the default settings for the random
number seed, and the number of cores and/or GPU used in computation. Most of these are
best left to their default specification.
Under the Data tab, the most useful setting is the supported formats for dates and time (see
also Section 2.4.1.2 in Chapter 2). The other items pertain to the interaction with remote
data base servers, as well as the numerical stopping criterion used in the auto-weighting
method for spatially constrained clustering (covered in Volume 2). Here as well, the default
should be fine for most situations in practice.
This Appendix lists the complete menu structure of GeoDa, with the exception of items
pertaining to spatial regression (covered in Anselin and Rey, 2014), but including the
clustering methods covered in Volume 2. The listing below shows the organization for a
MacOSX operating system. In Windows and Linux, Preferences appear under the File
menu. Otherwise, the structure is identical across operating systems.
• GeoDa
– About GeoDa
– Preferences
– Quit GeoDa
• File
– New from Recent
– New
– Save
– Save As
– Save Selected As
– Open Project
– Save Project
– Project Information
– Close
• Edit
– Time
– Category
• Tools
– Weights Manager
– Shape
∗ Points from Table
∗ Create Grid
– Spatial Join
– Dissolve
• Table
– Aggregate
– Merge
– Selection Tool
– Invert Selection
– Clear Selection
– Save Selection
– Move Selected to Top
– Calculator
– Add Variable
– Delete Variable(s)
– Hierarchical
– DBScan
– HDBScan
– SC K Means
– SCHC
– skater
– redcap
– AZP
– max-p
– Cluster Match Map
– Make Spatial
– Validation
• Space
– Univariate Moran’s I
– Bivariate Moran’s I
– Differential Moran’s I
– Moran’s I with EB Rate
– Univariate Local Moran’s I
– Bivariate Local Moran’s I
– Differential Local Moran’s I
– Local Moran’s I with EB Rate
– Local G
– Local G*
– Univariate Local Join Count
– Bivariate Local Join Count
– Co-location Join Count
– Univariate Local Geary
– Multivariate Local Geary
– Univariate Quantile LISA
– Multivariate Quantile LISA
– Local Neighbor Match Test
– Spatial Correlogram
– Distance Scatter Plot
• Time
– Time Player
– Time Editor
Taylor & Francis
Taylor & Francis Group
http://taylorandfrancis.com
C
Appendix C – Scripting with GeoDa via
the geodalib Library
The architecture of the GeoDa software is based on a very tight integration of the graphical user
interface and the actual computations. While effective in a traditional desktop environment,
this approach becomes less efficient when moving the functionality to a different computing
platform, such as a browser in a web-GIS, or a cyberGIS environment (Anselin et al., 2004;
Wang et al., 2013).
An alternative to the GUI-driven approach toward spatial data science taken in desktop
GeoDa is to leverage open-source software development environments, such as R (Pebesma
and Bivand, 2023) or Python (Rey et al., 2023). Such environments facilitate scripting and
stress reproducibility, which has become of increased relevance and importance in spatial
data science (see the discussion in Chapter 21).
The GeoDa desktop environment does not lend itself to scripting. It is also ill-suited for
repeated execution of the same application, such as in a simulation experiment. Finally, apart
from the limited record in the project file, there is no explicit way to ensure reproducibility.
In light of these limitations of the original design, a major refactoring effort was embarked
upon to separate the user interaction in the software from the core computational functionality,
and to collect the latter in a library, named libgeoda. The library contains the same C++
code as in the computations underlying desktop GeoDa, but has a more limited range of
functionality. The focus has been on methods that are (still) unique to GeoDa, such as some
of the recent LISA statistics. In addition, applications are included where the reliance on
C++ yields large performance improvements in terms of speed and scalability. Examples
include weights creation, and permutation tests from this Volume, as well as regionalization
methods covered in Volume 2.
The libgeoda library has a clearly defined Application Programming Interface (API),
which allows other C++ code to access its functionality directly. In fact, this is what
currently happens under the hood for part of desktop GeoDa, and in the experimental
web-GeoDa (jsgeoda, implemented through javascript). In addition to achieving a more
flexible interaction with different graphical user interface implementations, the API also
allows other software, such as R or Python programs to access the functionality through
well-defined wrapper code. The overall architecture is illustrated in Figure C.1.
The primary focus in this effort so far has been to create an R package, rgeoda, and a Python
module, pygeoda. These provide easy access to the functionality in libgeoda through a
native interface and designated middleware. The interaction between R and Python and the
C++ library is implemented under the hood, so that from a user’s perspective, everything
works natively as in any other R package or Python module.
As shown in Figure C.1, the core of the libgeoda library consists of three broad categories
of functionality: spatial weights, LISA, and spatial clustering (regionalization). In addition,
there are a number of helper functions, such as support for different map classifications (to
facilitate visualization) and variable standardization (for use in the cluster routines). The
functionality in pygeoda and rgeoda is the same, with only minor differences to reflect the
particular characteristics of each software environment. For example, in Python, methods
are attributes of a class (e.g., a spatial weights class) and invoked as such. In contrast, in R,
the typical approach is to apply a function to an object to extract the relevant information
(e.g., spatial weights characteristics).
Extensive details and specific examples can be found in Anselin et al. (2022) and in the
documentation on the GitHub site. Since the functionality of libgeoda mimics what has
been covered in the book, it is not further considered here. The software development effort
is ongoing.
Bibliography
Akbari, K., Winter, S., and Tomko, M. (2023). Spatial causality: A systematic review on
spatial causal inference. Geographical Analysis, 55: 56–89.
Algeri, C., Anselin, L., Forgione, A. F., and Migliardo, C. (2022). Spatial dependence in the
technical efficiency of local banks. Papers in Regional Science, 101:385–416.
Amaral, P., de Carvalho, L. R., Rocha, T. A. H., da Silva, N. C., and Vissoci, J. R. N. (2019).
Geospatial modeling of microcephaly and zika virus spread patterns in Brazil. PLoS ONE,
14.
Andrienko, G., Andrienko, N., Keim, D., MacEachren, A., and Wrobel, S. (2011). Challenging
problems of geospatial visual analytics. Journal of Visual Languages and Computing,
22:251–256.
Andrienko, N., Lammarsch, T., Andrienko, G., Fuchs, G., Keim, D., Miksch, S., and Rind, A.
(2018). Viewing visual analytics as model building. Computer Graphics Forum, 37:275–299.
Angrist, J. and Pischke, J.-S. (2015). Mastering ’Metrics, The Path from Cause to Effect.
Princeton University Press, Princeton, New Jersey.
Anselin, L. (1988). Spatial Econometrics: Methods and Models. Kluwer Academic Publishers,
Dordrecht, The Netherlands.
Anselin, L. (1990). What is special about spatial data? Alternative perspectives on spatial
data analysis. In Griffith, D. A., editor, Spatial Statistics, Past, Present and Future, pages
66–77. Institute of Mathematical Geography, (IMAGE), Ann Arbor, MI.
Anselin, L. (1992). SpaceStat, a Software Program for Analysis of Spatial Data. National
Center for Geographic Information and Analysis (NCGIA), University of California, Santa
Barbara, CA.
Anselin, L. (1994). Exploratory spatial data analysis and geographic information systems. In
Painho, M., editor, New Tools for Spatial Analysis, pages 45–54. Eurostat, Luxembourg.
Anselin, L. (1995). Local indicators of spatial association — LISA. Geographical Analysis,
27:93–115.
Anselin, L. (1996). The Moran scatterplot as an ESDA tool to assess local instability in
spatial association. In Fischer, M., Scholten, H., and Unwin, D., editors, Spatial Analytical
Perspectives on GIS in Environmental and Socio-Economic Sciences, pages 111–125. Taylor
and Francis, London.
Anselin, L. (1998). Exploratory spatial data analysis in a geocomputational environment. In
Longley, P. A., Brooks, S., Macmillan, B., and McDonnell, R., editors, Geocomputation: A
Primer, pages 77–94. John Wiley, New York, NY.
Anselin, L. (1999). Interactive techniques and exploratory spatial data analysis. In Lon-
gley, P. A., Goodchild, M. F., Maguire, D. J., and Rhind, D. W., editors, Geographical
405
406 Bibliography
Assuncao, R. M. and Reis, E. A. (1999). A new proposal to adjust Moran’s I for population
density. Statistics in Medicine, 18:2147–2161.
Banerjee, S., Carlin, B. P., and Gelfand, A. E. (2015). Hierarchical Modeling and Analysis
for Spatial Data, 2nd Edition. Chapman & Hall/CRC, Boca Raton.
Bavaud, F. (1998). Models for spatial weights: A systematic look. Geographical Analysis,
30:153–171.
Becker, R. A. and Cleveland, W. (1987). Brushing scatterplots. Technometrics, 29:127–142.
Becker, R. A., Cleveland, W., and Shyu, M.-J. (1996). The visual design and control of
Trellis displays. Journal of Computational and Graphical Statistics, 5:123–155.
Becker, R. A., Cleveland, W., and Wilks, A. (1987). Dynamic graphics for data analysis.
Statistical Science, 2:355–395.
Bellman, R. E. (1961). Adaptive Control Processes. Princeton University Press, Princeton,
N.J.
Benjamin, D. and 72 others (2018). Redefine statistical significance. Nature Human Behavior,
2:6–10.
Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical
and powerful approach to multiple testing. Journal of the Royal Statistical Society B,
57:289–300.
Bjornstad, O. N. and Falck, W. (2001). Nonparametric spatial covariance functions: Estima-
tion and testing. Environmental and Ecological Statistics, 8:53–70.
Blommestein, H. J. (1985). Elimination of circular routes in spatial dynamic regression
equations. Regional Science and Urban Economics, 15:121–130.
Brewer, C. A. (1997). Spectral schemes: controversial color use on maps. Cartography and
Geographic Information Systems, 49:280–294.
Brewer, C. A. (2016). Designing Better Maps. A Guide for GIS Users, 2nd Edition. ESRI
Press, Redlands, CA.
Brewer, C. A., Hatchard, G., and Harrower, M. A. (2003). ColorBrewer in print: a catalog
of color schemes for maps. Cartography and Geographic Information Science, 30:5–32.
Brock, W. A. and Durlauf, S. N. (2001). Discrete choice with social interactions. Review of
Economic Studies, 59:235–260.
Brunsdon, C. and Comber, L. (2015). Geocomputation, a Practical Primer. Sage, Thousand
Oaks, CA.
Brunsdon, C. and Comber, L. (2019). An Introduction to R for Spatial Analysis and Mapping,
2nd Edition. Sage, Thousand Oaks, CA.
Brunsdon, C. and Comber, L. (2021). Opening practice: supporting reproducibility and
critical spatial data science. Journal of Geographical Systems, 23:477–496.
Buja, A., Cook, D., Hofmann, H., Lawrence, M., Lee, E.-K., Swayne, D. F., and Wickham,
H. (2009). Statistical inference for exploratory data analysis and model diagnostics.
Philosophical Transactions of the Royal Society A, 367:4361–4383.
Buja, A., Cook, D., and Swayne, D. (1996). Interactive high dimensional data visualization.
Journal of Computational and Graphical Statistics, 5:78–99.
408 Bibliography
Campello, R. J., Moulavi, D., and Sander, J. (2013). Density-based clustering based on
hierarchical density estimates. In Pei, J., Tseng, V. S., Cao, L., Motoda, H., and Xu, G.,
editors, Advances in Knowledge Discovery and Data Mining. PAKDD 2013. Lecture Notes
in Computer Science, Vol. 7819, pages 160–172.
Campello, R. J., Moulavi, D., Zimek, A., and Sandler, J. (2015). Hierarchical density
estimates for data clustering, visualization, and outlier detection. ACM Transactions on
Knowledge Discovery from Data, 10,1.
Carr, D. B. and Pickle, L. W. (2010). Visualizing Data Patterns with Micromaps. Chapman
& Hall/CRC, Boca Raton, FL.
Case, A. (1991). Spatial patterns in household demand. Econometrica, 59:953–965.
Case, A., Rosen, H. S., and Hines, J. R. (1993). Budget spillovers and fiscal policy interde-
pendence: Evidence from the states. Journal of Public Economics, 52:285–307.
Chen, C., Härdle, W., and Unwin, A. (2008). Handbook of Data Visualization. Springer,
Berlin, Germany.
Chilès, J.-P. and Delfiner, P. (1999). Geostatistics, Modeling Spatial Uncertainty. John Wiley
& Sons, New York, NY.
Chow, G. (1960). Tests of equality between sets of coefficients in two linear regressions.
Econometrica, 28:591–605.
Clayton, D. and Kaldor, J. (1987). Empirical Bayes estimates of age-standardized relative
risks for use in disease mapping. Biometrics, 43:671–681.
Cleveland, W. S. (1979). Robust locally weighted regression and smoothing scatterplots.
Journal of the American Statistical Association, 74:829–836.
Cleveland, W. S. (1993). Visualizing Data. Hobart Press, Summit, NJ.
Cleveland, W. S., Grosse, E., and Shyu, W. M. (1992). Local regression models. In Chambers,
J. M. and Hastie, T. J., editors, Statistical Models in S, pages 309–376. Wadsworth and
Brooks/Cole, Pacific Grove, CA.
Cleveland, W. S. and McGill, M. (1988). Dynamic Graphics for Statistics. Wadsworth,
Pacific Grove, CA.
Cliff, A. and Ord, J. K. (1973). Spatial Autocorrelation. Pion, London.
Cliff, A. and Ord, J. K. (1981). Spatial Processes: Models and Applications. Pion, London.
Comber, L. and Brunsdon, C. (2021). Geographical Data Science and Spatial Analysis. An
Introduction in R. Sage, Thousand Oaks, CA.
Cressie, N. (1993). Statistics for Spatial Data. Wiley, New York.
Dasu, T. and Johnson, T. (2003). Exploratory Data Mining and Data Cleaning. John Wiley,
Hoboken, NJ.
de Berg, M., Cheong, O., van Kreveld, M., and Overmars, M. (2008). Computational
Geometry, Algorithms and Applications, 3rd Edition. Springer Verlag, Berlin.
de Castro, M. C. and Singer, B. H. (2006). Controlling the false discovery rate: An
application to account for multiple and dependent tests in local statistics of spatial
association. Geographical Analysis, 38:180–208.
Bibliography 409
Deutsch, C. and Journel, A. (1998). GSLIB: Geostatistical Software Library and User’s
Guide. Oxford University Press, New York, NY.
Dorling, D. (1996). Area Cartograms: Their Use and Creation. CATMOG 59, Institute of
British Geographers.
Dray, S., Saïd, S., and Débias, F. (2008). Spatial ordination of vegetation data using a
generalization of Wartenberg’s multivariate spatial correlation. Journal of Vegetation
Science, 19:45–56.
Dykes, J. (1997). Exploring spatial data representation with dynamic graphics. Computers
and Geosciences, 23:345–370.
Dykes, J., MacEachren, A. M., and Kraak, M.-J. (2005). Exploring Geovisualization. Elsevier,
Oxford, United Kingdom.
Efron, B. and Hastie, T. (2016). Computer Age Statistical Inference. Algorithms, Evidence,
and Data Science. Cambridge University Press, Cambridge, UK.
Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. (1996). A density-based algorithm for
discovering clusters in large spatial databases with noise. In KDD-96 Proceedings, pages
226–231.
Farah Rivadeneyra, I. (2017). A geospatial analysis of Mexico’s distal effects on food
insecurity. Master’s thesis, University of Chicago, Chicago, IL.
Fisher, W. D. (1958). On grouping for maximum homogeneity. Journal of the American
Statistical Association, 53:789–798.
Fotheringham, A. S., Brunsdon, C., and Charlton, M. (2002). Geographically Weighted
Regression. John Wiley, Chichester.
Friendly, M. (2008). A brief history of data visualization. In Chun-houh Chen, C., Härdle,
W., and Unwin, A., editors, Handbook of Data Visualization, pages 15–56. Springer, Berlin,
Germany.
Gahegan, M. (2009). Visual exploration and explanation in geography: analysis with lights.
In Miller, H. and Han, J., editors, Geographic Data Mining and Knowledge Discovery,
pages 291–315. Taylor and Francis, Boca Raton, FL.
Gahegan, M. (2020). Fourth paradigm GIScience? prospects for automated discovery and
explanation from data. International Journal of Geographical Information Science, 34:1–21.
Gan, J. and Tao, Y. (2017). On the hardness and approximation of Euclidean DBSCAN.
ACM Transactions on Database Systems (TODS), 42:14.
Gao, S. (2021). Geospatial artificial intelligence (GeoAI). Oxford Bibliographies, Oxford
University Press.
Geary, R. (1954). The contiguity ratio and statistical mapping. The Incorporated Statistician,
5:115–145.
Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., and Rubin, D. B. (2014).
Bayesian Data Analysis, 3rd Edition. Chapman & Hall, Boca Raton, FL.
Gelman, A. and Hennig, C. (2017). Beyond subjective and objective in statistics. Journal of
the Royal Statistical Society A, 180:967–1033.
Gelman, A. and Loken, E. (2014). The statistical crisis in science. American Scientist,
102:460–465.
410 Bibliography
Hubert, L. J., Golledge, R., Costanzo, C. M., and Gale, N. (1985). Measuring association
between spatially defined variables: An alternative procedure. Geographical Analysis,
17:36–46.
Hullman, J. and Gelman, A. (2021). Designing for interactive exploratory data analysis
requires theories of graphical inference. Harvard Data Science Review, 3.
Inselberg, A. (1985). The plane with parallel coordinates. Visual Computer, 1:69–91.
Inselberg, A. and Dimsdale, B. (1990). Parallel coordinates: A tool for visualizing multi-
dimensional geometry. Proceedings of the IEEE Visualization 90, pages 361–378.
Isaaks, E. H. and Srivastava, R. M. (1989). An Introduction to Applied Geostatistics. Oxford
University Press, New York, NY.
Isard, W. (1960). Methods of Regional Analysis. MIT Press, Cambridge, MA.
Isard, W. (1969). General Theory. MIT Press, Cambridge, MA.
James, W. and Stein, C. (1961). Estimation with quadratic loss. Proceedings of the Fourth
Berkeley Symposium on Mathematical Statistics and Probability, 1:361–379.
Janowicz, K., Gao, S., McKenzie, G., Hu, Y., and Bhaduri, B. (2020). GeoAI: spatially
explicit artificial intelligence techniques for geographic knowledge discovery and beyond.
International Journal of Geographical Information Science, 34:625–636.
Jenks, G. (1977). Optimal data classification for choropleth maps. Occasional. paper no. 2,
Department of Geography, University of Kansas, Lawrence, KS.
Journel, A. and Huijbregts, C. (1978). Mining Geostatistics. Academic Press, London.
Kafadar, K. (1996). Smoothing geographical data, particularly rates of disease. Statistics in
Medicine, 15:2539–2560.
Kafadar, K. (1997). Geographic trends in prostate cancer mortality: An application of
spatial smoothers and the need for adjustment. Annals of Epidemiology, 7:35–45.
Kaufman, L. and Rousseeuw, P. (2005). Finding Groups in Data: An Introduction to Cluster
Analysis. John Wiley, New York, NY.
Kedron, P., Frazier, A. A., Trgovac, A. B., Nelson, T., and Fotheringham, A. S. (2021).
Reproducibility and replicability in geographical analysis. Geographical Analysis, 53:135–
147.
Kelejian, H. H. and Prucha, I. R. (2007). HAC estimation in a spatial framework. Journal
of Econometrics, 140:131–154.
Kerr, N. (1998). HARKing: hypothesizing after the results are known. Personality and
Social Psychology Review, 2:196–217.
Kessler, F. C. and Battersby, S. E. (2019). Working with Map Projections: A Guide to Their
Selection. CRC Press, Boca Raton, FL.
Kielman, J., Thomas, J., and May, R. (2009). Foundations and frontiers in visual analytics.
Information Visualization, 8:239–246.
Kolak, M. and Anselin, L. (2020). A spatial perspective on the econometrics of program
evaluation. International Regional Science Review, 43:128–153.
412 Bibliography
Kolak, M., Bhatt, J., Park, Y. H., Padrón, N. A., and Molefe, A. (2020). Quantification of
neighborhood-level social determinants of health in the continental united states. JAMA
Network Open, 3(1):e1919928–e1919928.
Kraak, M. and MacEachren, A. (2005). Geovisualization and GIScience. Cartography and
Geographic Information Science, 32:67–68.
Kraak, M. and Ormeling, F. (2020). Cartography. Visualization of GeoSpatial Data, 4th
Edition. CRC Press, Boca Raton, FL.
Kruiger, J., Kasalica, V., Meerlo, R., Lamprecht, A.-L., Nyamsuren, E., and Schneider, S.
(2021). Loose programming of GIS workflows with geo-analytical concepts. Transactions
in GIS, 25:424–449.
Kulldorff, M. (1997). A spatial scan statistic. Communications in Statistics – Theory and
Methods, 26:1481–1496.
Lancaster, T. (2000). The incidental parameter problem since 1948. Journal of Econometrics,
95:391–413.
Law, M. and Collins, A. (2018). Getting to Know ArcGIS Desktop, 5th Edition. Environmental
Systems Research Institute, Redlands, CA.
Lawson, A. B., Browne, W. J., and Rodeiro, C. L. V. (2003). Disease Mapping with
WinBUGS and MLwiN. John Wiley, Chichester.
Lee, J. A. and Verleysen, M. (2007). Nonlinear Dimensionality Reduction. Springer-Verlag,
New York, NY.
Lee, S.-I. (2001). Developing a bivariate spatial association measure: An integration of
Pearson’s r and Moran’s I. Journal of Geographical Systems, 3:369–385.
Lin, J. (2020). A local model for multivariate analysis: extending Wartenberg’s multivariate
spatial correlation. Geographical Analysis, 52:190–210.
Loader, C. (1999). Local Regression and Likelihood. Springer-Verlag, Heidelberg.
Loader, C. (2004). Smoothing: Local regression techniques. In Gentle, J. E., Härdle, W., and
Mori, Y., editors, Handbook of Computational Statistics: Concepts and Methods, pages
539–563. Springer-Verlag, Berlin.
Lovelace, R., Nowosad, J., and Muenchow, J. (2019). Geocomputation with R. CRC Press,
Boca Raton, FL.
MacEachren, A. M. and Kraak, M.-J. (1997). Exploratory cartographic visualization:
Advancing the agenda. Computers and Geosciences, 23:335–343.
MacEachren, A. M., Wachowicz, M., Edsall, R., Haug, D., and Masters, R. (1999). Con-
structing knowledge from multivariate spatiotemporal data: Integrating geographical
visualization with knowledge discovery in database methods. International Journal of
Geographical Information Science, 13(4):311–334.
Madry, S. (2021). Introduction to QGIS: Open Source Geographic Information System.
Locate Press, Chugiak, AK.
Mantel, N. (1967). The detection of disease clustering and a generalized regression approach.
Cancer Research, 27:209–220.
Marshall, R. J. (1991). Mapping disease and mortality rates using Empirical Bayes estimators.
Applied Statistics, 40:283–294.
Bibliography 413
McCann, P. (2001). Urban and Regional Economics. Oxford University Press, New York,
NY.
McInnes, L. and Healy, J. (2017). Accelerated hierarchical density clustering. In 2017 IEEE
International Conference on Data Mining Workshops (ICDMW), New Orleans, LA.
Monmonier, M. (1989). Geographic brushing: Enhancing exploratory analysis of the scatter-
plot matrix. Geographical Analysis, 21:81–4.
Monmonier, M. (1993). Mapping it Out: Expository Cartography for the Humanities and
Social Sciences. University of Chicago Press, Chicago, IL.
Monmonier, M. (2018). How to Lie with Maps, 3rd Edition. University of Chicago Press,
Chicago, IL.
Moran, P. A. (1948). The interpretation of statistical maps. Journal of the Royal Statistical
Society, B, 10:243–251.
Müller, D. and Sawitzki, G. (1991). Excess mass estimates and tests for multimodality.
Journal of the American Statistical Association, 86:738–746.
Munasinghe, R. L. and Morris, R. D. (1996). Localization of disease clusters using regional
measures of spatial autocorrelation. Statistics in Medicine, 15:893–905.
Newman, M. (2018). Networks. Oxford University Press, Oxford, United Kingdom.
Nuzo, R. (2015). Fooling ourselves. Nature, 526:182–185.
Oden, N. L. and Sokal, R. R. (1986). Directional autocorrelation: an extension of spatial
correlograms to two dimensions. Systematic Zoology, 35:608–617.
Okabe, A., Boots, B., Sugihara, K., and Chiu, S. N. (2000). Spatial Tessellations: Concepts
and Applications of Voronoi Diagrams, 2nd Edition. Wiley, New York, NY.
Openshaw, S., Charlton, M. E., Wymer, C., and Craft, A. (1987). A mark I geographical
analysis machine for the automated analysis of point data sets. International Journal of
Geographical Information Systems, 1:359–377.
Openshaw, S. and Taylor, P. (1979). A million or so correlation coefficients. In Wrigley, N.,
editor, Statistical Methods in the Spatial Sciences, pages 127–144. Pion, London.
Ord, J. K. and Getis, A. (1995). Local spatial autocorrelation statistics: Distributional issues
and an application. Geographical Analysis, 27:286–306.
Ord, J. K. and Getis, A. (2001). Testing for local spatial autocorrelation in the presence of
global autocorrelation. Journal of Regional Science, 41:411–432.
Oshan, T. M., Wolf, L., Sachdeva, M., Bardin, S., and Fotheringham, A. S. (2022). A scoping
review of the multiplicity of scale in spatial analysis. Journal of Geographical Systems,
24:293–324.
Pebesma, E. and Bivand, R. (2023). Spatial Data Science: With Applications in R. Chapman
& Hall/CRC, Boca Raton, FL.
Preston, S. H., Heuveline, P., and Guillot, M. (2001). Demography, Measuring and Modeling
Population Processes. Blackwell Publishers Ltd., Oxford, UK.
Prim, R. (1957). Shortest connection networks and some generalizations. Bell System
Technical Journal, 36:1389–1401.
414 Bibliography
Singleton, A., Spielman, S., and Brunsdon, C. (2016). Establishing a framework for open
geographic information science. International Journal of Geographical Information Science,
30:1507–1521.
Slocum, T. A., McMaster, R. B., Kessler, F. C., and Howard, H. H. (2023). Thematic
Cartography and Geovisualization, 4th Edition. CRC Press, Boca Raton, FL.
Smith, G. and Cordes, J. (2019). The 9 Pitfalls of Data Science. Oxford University Press,
Oxford, UK.
Snyder, J. P. (1993). Flattening the Earth: Two Thousand Years of Map Projections.
University of Chicago Press, Chicago, IL.
Sokal, R. R. (1979). Testing statistical significance of geographic variation patterns. System-
atic Zoology, 28:227–232.
Sokal, R. R., Oden, N. L., and Thompson, B. A. (1998a). Local spatial autocorrelation in a
biological model. Geographical Analysis, 30:331–354.
Sokal, R. R., Oden, N. L., and Thompson, B. A. (1998b). Local spatial autocorrelation in
biological variables. Biological Journal of the Linnean Society, 65:41–62.
Stein, M. L. (1999). Interpolation of Spatal Data, Some Theory for Kriging. Springer-Verlag,
New York.
Stewart, J. Q. (1947). Empirical mathematical rules concerning the distribution and
equilibrium of population. Geographical Review, 37:461–485.
Stuetzle, W. (1987). Plot windows. Journal of the American Statistical Association, 82:466–
475.
Stuetzle, W. and Nugent, R. (2010). A generalized single linkage method for estimating the
cluster tree of a density. Journal of Computational and Graphical Statistics, 19:397–418.
Thomas, J. and Cook, K. (2005). Illuminating the Path: The Research and Development
Agenda for Visual Analytics. IEEE Computer Society Press, Los Alamitos, CA.
Tiefelsdorf, M. (2002). The saddlepoint approximation of Moran’s I and local Moran’s Ii ’s
reference distribution and their numerical evaluation. Geographical Analysis, 34:187–206.
Tiefelsdorf, M. and Boots, B. (1995). The exact distribution of Moran’s I. Environment and
Planning A, 27:985–999.
Tobler, W. (1970). A computer movie simulating urban growth in the Detroit region.
Economic Geography, 46:234–240.
Tobler, W. (1973). Choropleth maps without class intervals? Geographical Analysis, 5:262–
265.
Tobler, W. (2004). Thirty five years of computer cartograms. Annals of the Association of
American Geographers, 94:58–73.
Tomlin, C. D. (1990). Geographic Information Systems and Cartographic Modeling. Prentice-
Hall, Englewood Cliffs, NJ.
Tufte, E. R. (1983). The Visual Display of Quantitative Information. Graphics Press,
Cheshire, CT.
Tufte, E. R. (1997). Visual Explanations: Images and Quantities, Evidence and Narrative.
Graphics Press, Cheshire, CT.
416 Bibliography
Tukey, J. (1962). The future of data analysis. Annals of Mathematical Statistics, 33:1–67.
Tukey, J. (1977). Exploratory Data Analysis. Addison Wesley, Reading, MA.
Tukey, J. and Wilk, M. B. (1966). Data analysis and statistics: an expository overview. In
Proceedings of the November 7-10, 1966, Fall Joint Computer Conference, pages 695–709.
ACM.
Wall, P. and Devine, O. (2000). Interactive analysis of the spatial distribution of disease
using a geographic information system. Journal of Geographical Systems, 2(3):243–256.
Waller, L. A. and Gotway, C. A. (2004). Applied Spatial Statistics for Public Health Data.
John Wiley, Hoboken, NJ.
Wang, S. (2010). A cyberGIS framework for the synthesis of cyberinfrastructure, GIS, and
spatial analysis. Annals of the Association of American Geographers, 100:535–557.
Wang, S., Anselin, L., Badhuri, B., Crosby, C., Goodchild, M., Liu, Y., and Nyerges, T.
(2013). CyberGIS software: a synthetic review and integration roadmap. International
Journal of Geographical Information Science, 27:2122–2145.
Wartenberg, D. (1985). Multivariate spatial correlation: A method for exploratory geograph-
ical analysis. Geographical Analysis, 17:263–283.
Wegman, E. J. (1990). Hyperdimensional data analysis using parallel coordinates. Journal
of the American Statistical Association, 85:664–675.
Wegman, E. J. and Dorfman, A. (2003). Visualizing cereal world. Computational Statistics
and Data Analysis, 43(4):633–649.
Wickham, H. (2016). ggplot2 Elegant Graphics for Data Analysis. Springer Nature, London,
UK.
Wickham, H., Cook, D., Hofmann, H., and Buja, A. (2010). Graphical inference for Infovis.
IEEE Transactions on Visualization and Computer Graphics, 16:973–979.
Wishart, D. (1969). Mode analysis: a generalization of nearest neighbor which reduces
chaining effects. In Cole, A., editor, Numerical Taxonomy, pages 282–311, New York, NY.
Academic Press.
Worboys, M. and Duckham, M. (2004). GIS, A Computing Perspective. CRC Press, Boca
Raton, FL.
Index
417
418 Index