Spatial Statistics For Understanding Tissue Organization
Spatial Statistics For Understanding Tissue Organization
Department of Information Technology and SciLifeLab, Centre for Image Analysis, Uppsala University, Uppsala, Sweden
FIGURE 1 | Schematic representations of objects, such as cells or mRNAs, in microscopy images, where each dot represents an object, and the color reflects the
object type (where gray is an unspecified type). (A) Simple representation, where each dot has a specific location in 2D tissue space. (B) The same data represented
as a graph, where each dot is a node, and nodes are connected based on a maximum distance criterion. (C) Dots can also be represented by a probability density
map, where warmer colors represent more dense dots, or (D) as counts in fixed spatial bins. Here, bins are squares and warmer colors represent higher object counts
per bin. Spatial statistics are used to prove four different hypothesis (with the top row representing the random case): (H1) Visualization of hypothesis H1: Objects of
type A (green) are non-randomly distributed. (H2) Visualization of hypothesis H2: Objects of type A (green) are non-randomly distributed as compared to the
distribution of other objects (gray) in the same tissue sample. (H3) Visualization of hypothesis H3: Objects of type A (green) and B (blue) are non-randomly distributed
in relation to one another within the distribution of other objects (gray) in the same tissue sample. (H4) Visualization of hypothesis H4: There are groups of object types
(multiple colors in “niches”) that are non-randomly distributed within the tissue sample.
of connections or distance, reflecting a hypothesis on a maximum cells in the presence or absence of an infection. If we take the
distance for interaction. The density map representation tissue context (all objects of other types) into account, as shown
(Figure 1C) translates the object distribution into a probability in Figure 1H2, the hypothesis becomes H2: Objects of type A
map, where high values represent high object concentrations, are non-randomly distributed as compared to the distribution of
but the exact spatial location of objects is lost. Finally, different other objects in the same tissue sample. In a biological context
types of binning can be applied (Figure 1D), providing a lower- this could be, e.g., distribution of a certain cell type in tumor and
resolution map with counts of objects per bin. stroma areas of a tissue. Next, we consider two types of objects,
We review methods that explore the null hypotheses of and their potential interaction or repulsion. This is illustrated in
randomness for either a single type of objects, pairs of Figure 1H3, and the hypothesis is H3: Objects of type A and B
objects, or multiple types of objects. We have created a set are non-randomly distributed in relation to one another within
of synthetic images describing different scenarios of object the distribution of other objects in the same tissue sample. In
distributions within a tissue section, illustrating that the question a biological context the question could, e.g., be whether cancer
of randomness is often relative. We first explore a single type of cells interact with endothelial cells or not. Finally, if there are
objects, as shown in Figure 1H1, and propose the hypothesis H1: multiple types of objects, we may want to see if certain groups
Objects of type A are non-randomly distributed. In a biological of objects tend to coincide and form so-called ’niches’ of unique
context this could be, e.g., quantifying the distribution of immune combinations of objects in the tissue, as shown in Figure 1H4.
In this case, we pose hypothesis H4: There are groups of object can be created by different techniques, such as k-nearest
types (‘niches’) that are non-randomly distributed within the neighbors or Delaunay triangulation.
distribution of other objects in the tissue sample. This could
be used for finding mRNAs that are co-expressed, where niches
would then correspond to different cell types (Partel and Wählby,
2.3. Centrality Scores
Centrality scores (Everett and Borgatti, 1999) are based on
2020).
computational analysis to show object patterns in a graph
In the following review, we group different spatial statistics
representation (see Figure 1B). This provides awareness of
methods according to what types of tissue patterns they
complicated relations in large graphs. Figure 1H2 can be used as
investigate, and also summarize and discuss their theoretical
an example where green dots represent one object type (group
ability to answer the four hypotheses we pose above.
members, e.g., immune cells) and gray dots represent members
of all the other object types (non-group members, e.g., all types
2. SPATIAL STATISTICS ON A SINGLE of tumor cells). This method can be used to test hypothesis
TYPE OF OBJECT H2. There are four different centrality scores: Group degree
centrality is interpreted as a ratio of non-group members (gray)
In this section, we describe methods which are capable to that are connected to group members (green). Higher values
test hypothesize H1 (non-random distribution) and H2 (non- reveal random distribution. Lower values indicate more grouped
random distribution, compared to other objects). The input data objects. This measure helps to identify crucial clusters in a
can be described as points in space determining the presence of graph. Group closeness centrality computes how close the group
an object. The main idea is to identify and characterize spatially (green) is to the non-group members (gray). It is defined as the
variable objects. amount of non-group members (gray) divided by the sum of
all distances from the group (green) to all non-group members
2.1. Ripley’s Function (gray). Higher values reveal random distribution. Lower values
Ripley’s function (Ripley, 1976) measures whether objects with indicate more grouped objects. Group betweenness centrality
discrete positions in space (see Figure 1A) follow random, calculates the quantity of shortest paths connecting two non-
dispersed, or clustered patterns. For each object, the function group members (gray) while passing through the group (green).
counts how many other objects of the same type appear within This can be thought of as a measure of cell infiltration. Average
a given distance. Subsequently, the object counts are averaged clustering coefficient measures how likely the group members
over the whole dataset and the number is compared with favor to cluster together.
the number of objects one would expect to find based on a
completely spatially random pattern (null hypothesis). If the
average number of objects found within the given distance is 3. SPATIAL STATISTICS ON TWO TYPES
greater than for a random distribution, the dataset is clustered
(see green dots in Figure 1H1-down). If the number is smaller,
OF OBJECTS
the dataset is dispersed. Ripley’s K function is generally calculated In this section, we describe methods capable of testing hypothesis
at multiple distances allowing detection of pattern distributions H3: objects of types A and B are non-randomly distributed in
at multiple scales. For example, at short distances, the objects relation to one another within the distribution of other objects
may be clustered, while at long distances, objects may be in the same tissue sample. The main idea is to identify if different
dispersed. This method can be used to test hypothesis H1 types of objects are closer than what would be expected by chance.
(non-random distribution). It is worth noting that physical closeness is no guarantee for
interaction, but a non-random pattern may indicate involvement
2.2. Newman’s Assortativity in similar processes.
Newman’s assortativity (Newman, 2002) evaluates spatial
organization using a graph (see Figure 1B) as input. The
principle is to count existing connections between objects of 3.1. Cluster Co-occurrence Ratio
the same category and compare these counts to the number Cluster co-occurrence ratio (Tosti et al., 2021) describes co-
of connections expected at random object distribution (null occurrence of two types of objects in the tissue. It measures
hypothesis). This method can be used to test hypothesis H1. the probability that an object of type A appears in a given
Figure 1H1-up shows no significant difference in the number distance from an object of type B by taking the ratio between
of connections compared to a random distribution. However, occurrences of object type A within a distance from object
Figure 1H1-down indicates that there would be a significant type B and occurrences of object type A within a distance
difference in the number of connections than under the null from object type B at random (null hypothesis). It is computed
hypothesis. The difference between Ripley’s function and across multiple distances across the tissue area. It measures
Newman’s assortativity is that Ripley’s forms an overall cluster the probability that an object of type A appears in a given
analysis providing various evaluations using various distances distance from conditioned object type B. Figure 1H3-up shows
while Newman’s tests the dataset as one object determining example of low cluster co-occurrence ratio and Figure 1H3-
clustered patterns. However, the graph structure in Newman’s down shows example of high cluster co-occurrence ratio within a
assortativity provides more flexibility since graph connections short distance.
3.2. Neighborhood Enrichment Test together into modules, and averaging them creates meta-object
Neighborhood enrichment test (Schapiro et al., 2017) identifies types to represent the similarly co-expressed object types.
two non-randomly distributed objects types in relation to one
another. The first step is to create a graph (see Figure 1B). 4.2. Spage2vec
Then two object types are selected (A and B) and the count of Spage2vec (Partel and Wählby, 2020) analyzes the spatial
connections between A and B object types (nAB ) is compared heterogeneity of complex patterns of objects. The input data
to random permutations of the objects (null hypothesis). The is a graph (see Figure 1B), and it uses a graph representation
random configuration is set by keeping the object locations learning technique based on a graph neural network (GNN).
and reshuffling the object identities. Based on these estimates, During training, the GNN learns the topological structure of each
expected means (µAB ) and standard deviations (σAB ) are object’s local neighborhood. It does not require labeled training
calculated for each pair in the randomized dataset. Subsequently, data, but learns to find re-occurring patterns by comparing
a Z-score is calculated as, ZAB = nABσ−µ AB
AB
. The Z-score to a randomization of the data. After training, the observed
indicates if an object type pair is over-represented (positive patterns are summarized in a lower-dimensional embedding
Z-score, see Figure 1H3-down) or over-depleted (negative Z- space that encapsulates high-dimensional information about
score, see Figure 1H3-up) in the connectivity graph. The each object’s neighborhood. The last step is to cluster the
difference between cluster co-occurrence ratio and neighborhood multidimensional space using an unsupervised classification
enrichment test is that cluster co-occurrence ratio evaluates method (i.e., Leiden, Traag et al., 2019). Clusters represent
various distances when determining if two objects types are in combinations of object types that can be identified as specific
relation to one another while neighborhood enrichment test domain types or ‘niches’. Figure 1H4-down shows an example,
examines the dataset as one object determining object relation. where different neighborhood compositions were identified as
However, the graph structure in Neighborhood enrichment test different niches. The types of discovered niches can be further
again provides flexibility since graph connections can be created identified by correlation between the object composition of the
by different techniques. niches and e.g., in the case of in situ sequencing data an external
dataset of scRNA-seq signatures. The approach has also been
3.3. Object-Object Correlation Analysis applied to detect niches in multiplex fluorescence microscopy
Object-Object Correlation Analysis (Stoltzfus et al., 2020) data of tissue micro arrays (Solorzano et al., 2021).
investigates the correlation of different object types within
neighborhoods over the tissue. A neighborhood is a composition 4.3. Spot-Based Spatial Cell-Type Analysis
of objects inside a circular area. The neighborhoods’ locations are by Multidimensional mRNA Density
uniformly allocated in a grid pattern throughout the space. The Estimation (SSAM)
next step is to calculate the Pearson correlation coefficient of two SSAM (Park et al., 2021) was defined to identify tissue niches
types of objects within the neighborhoods. This method reveals in transcriptomics data. The first step is to create probability
which types of objects are associated with each other or unrelated maps of the object types. Kernel Density Estimation (KDE) with
to each other. Figure 1D shows an example of this neighborhood a Gaussian kernel is applied to every object type resulting in a
representation. The idea is to create this representation of two density map for each object type (see Figure 1C). Then all the
object types and then estimate the correlation coefficient across images are put into a stack creating a multi-channel image where
all overlapping neighborhoods. each pixel is a vector describing the local expression profile.
Next, group type signatures are computed by clustering using
4. SPATIAL STATISTICS ON MULTIPLE Louvain (Blondel et al., 2008) or DBSCAN (Ester et al., 1996),
TYPES OF OBJECTS and outliers (vectors far from their cluster medoid) are removed.
The cluster centroids represent the group-type signatures. The
In this section, we describe methods which are capable to test third step is to generate a group-type map. Each pixel in the
hypothesis H4 (existence of “niches”). The input data can be vector field is classified according to the maximum correlation
described as points in space determining the presence of the with the group-type signatures. The group-type signatures can
object types. The main idea is to identify if there are reoccurring be taken from the previous step or an external dataset, such as
spatial patterns, or ’niches’ of objects, in the tissue. scRNA-seq. The fourth step is to identify the tissue niches with
definite group-type composition. The composition is computed
4.1. Spatial Co-expression Patterns in a circular sliding window over the tissue and clustered by
Spatial co-expression patterns (Dries et al., 2021) identify robust agglomerative hierarchical clustering, merging highly correlating
patterns of object types that follow correlated spatial expression clusters. Finally, each cluster represents a unique tissue niche, an
arrangements throughout the tissue. The first step is to smooth example can be seen in Figure 1H4-down where two different
the object expression over the space by averaging in a grid niche types were found.
or k-nearest neighbor technique. This results in a one density
map for every object type as illustrated in Figure 1C. The 4.4. Vector Approach
next step is to calculate the Pearson correlation coefficient of Describing local neighborhoods as vectors of counts of object
the pair combinations of all object types (e.i., density maps). types has been suggested in several publications under multiple
Subsequently, similarly co-expressed object types are clustered names (Stoltzfus et al., 2020; He et al., 2021; Salas et al., 2021).
Here we refer to it as the vector approach. Its goal is to identify toolboxes and lists the hypotheses that each of the methods is
similar neighborhoods across the tissue sample. The first step is capable of testing.
to define the neighborhoods. A neighborhood is a composition
of object types inside a fixed area. The neighborhoods’ locations
can be uniformly allocated in a grid pattern throughout the 6. DISCUSSION
space, constructed around each object from the dataset (Stoltzfus
et al., 2020), based on Density peak clustering (He et al., 2021), There are many published methods for spatial statistics. However,
or be defined by previously segmented tissue structures (Salas they differ in the type of input data they can handle. In this
et al., 2021). Next, each neighborhood is presented as a vector review, we focused on methods where the input data can
containing counts of object types normalized, for example, be described as points in 2D tissue space representing the
by dividing each object count by the sum of all counts in presence of different object types. Another type of input data
the neighborhood (local normalization) or by dividing each consists of coordinates and quantitative information on multiple
object count by the sum of all the counts in the sample measurements per location, as in e.g., spatial transcriptomics
(global normalization). The normalized vectors are projected (Larsson et al., 2021). Spatial statistics for exploring this type
to a multidimensional space followed by clustering to identify of data can focus on a single type of objects, with methods
niches. Examples of supervised clustering methods are common such as Binary Spatial extracts (BinSpect, Dries et al., 2021),
methods such as k-means and hierarchical clustering, or more Getis-Ord General G (Getis and Ord, 2010), Spatial pattern
advanced methods such as Self-Organizing Maps (Kohonen, recognition via kernels (SPARK, Sun et al., 2020), spatialDE
1982), Gaussian Distribution Model, or DBSCAN (Ester et al., (Svensson et al., 2018), Trendsceek (Edsgärd et al., 2018), Geary’s
1996). Other clustering possibilities are unsupervised approaches c (Geary, 1954) or Moran’s I (Moran, 1950). In the case of
such as Leiden (Traag et al., 2019) or Louvain (Blondel et al., more than a single type of object, there are other methods,
2008). such as Spatially informed ligand-receptor pairing (Dries et al.,
2021), Object-Object Correlation Analysis (Stoltzfus et al., 2020)
and Spatial domain detection (Dries et al., 2021) that can be
5. TOOLBOXES applied for exploring co-locations, potential interactions and
niche discovery.
Several toolboxes simplifying spatial statistics are available. The methods mentioned above are also applicable on the
Squidpy (Palla et al., 2021) includes four methods from type of data we present in this paper (input data as points in
this review: Ripley’s function, Centrality scores, Cluster co- space determining the presence of the object types) but the data
occurrence ratio, and the Neighborhood enrichment test. The would have to be pre-processed by transferring dots into spatially
toolbox PySpacell (Rose et al., 2019) includes methods such as binned counts for all object types, as exemplified for a single
Ripley’s function and Newman’s assortativity. CytoMap (Stoltzfus object type in Figure 1D. With such a representation, spatial
et al., 2020) includes Ripley’s function, Object-Object correlation resolution would be lost, but data could be analyzed by methods
analysis and the Vector approach. Giotto focuses mostly on the such as Trendsceek and SPARK.
data consisting of coordinates and quantitative information on Many of the methods for analyzing multiple object types
multiple measurements per location, but also includes techniques include clustering as a final step of the analysis. Different
as such as the Neighborhood enrichment test and Spatial co- clustering algorithms might lead to different results when applied
expression patterns. The recently published Matisse (Salas et al., to the same data, and should be carefully selected. It should
2021) includes the Neighborhood enrichment test and the Vector also be noted, that proving or disproving a hypothesis regarding
approach. The toolbox histoCAT (Schapiro et al., 2017) includes spatial statistics will depend on quality and amount of input data.
the Neighborhood enrichment test, and Clustermap (He et al., One should also keep in mind that a 2D section may not always
2021) includes the Vector approach. Table 1 summarizes these be a good representation of a true 3D structure such as an organ.