0% found this document useful (0 votes)

11 views

scRNAseq_clustering_Asa_Bjorklund_2021

Uploaded by

alirizauber

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views

scRNAseq_clustering_Asa_Bjorklund_2021

Uploaded by

alirizauber

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 53

scRNAseq clustering tools

Åsa Björklund
[email protected]
Cell identity

Wagner et al. Nat. Biotech 2016

What is a cell type?

• A cell that performs a specific function?

• A cell that performs a specific function at a specific
location/tissue?
• Not clear where to draw the line between cell types
and subpopulations within a cell type.
• Also important to distinguish between cell type and
cell state.
– A cell state may be infected/non infected
– Metabolically active/inactive
– Cell cycle stages
– Apoptotic
Outline

• Basic clustering theory

• Graph theory introduction
• scRNAseq clustering with graphs
• Examples of different tools
How can we identify populations?
Considerations for clustering

• Hypotheses:
– What is a cell type? What cell types are in my tissue?
– What is the number of clusters k?
• Choices:
– Gene set selection
– Similarity measure / Space to calculate similarity
– Algorithm and hyper parameters of that algorithm.
• Different choice leads to different results. Validate,
interpret and repeat steps.
What is clustering?

• “The process of organizing objects into groups whose

members are similar in some way”
• Typical methods are:
– Hierarchical clustering
– K-means clustering
– Density based clustering
– Graph based clustering
The main idea

• Structure when:
1) Samples within cluster
resemble each other (within
variance, σW(i))
2) Clusters deviate from each
other
(between variance, σB)
Group samples such that:
Hierarchical clustering

• Builds on distances between data points

• Agglomerative – starts with all data points as
individual clusters and joins the most similar ones in
a bottom-up approach
• Divisive – starts with all data points in one large
cluster and splits it into 2 at each step. A top-down
approach
• Final product is a dendrogram representing the
decisions at each merge/division of clusters
Hierarchical clustering
Hierarchical clustering

Clusters are obtained by cutting the tree at a desired level

Hierarchical clustering

Clusters are obtained by cutting the tree at a desired level

Linkage criteria
• Calculation of similarities between 2 clusters (or a
cluster and a data point)

http://www.slideshare.net/uzairjavedsiddiqui/malhotra20
• Ward (minimum variance method). Similarity of two clusters is
based on the increase in squared error when two clusters are
merged.
http://www.slideshare.net/uzairjavedsiddiqui/malhotra20
K-means clustering
1. Starts with random selection of cluster centers
(centroids)
2. Then assigns each data points to the nearest cluster
3. Recalculates the centroids for the new cluster
definitions
4. Repeats steps 2-3 until no more changes occur.
Can use same distance measures as in hclust.

https://en.wikipedia.org/wiki/K-means_clustering
Network/graph clustering
Node/Vertice
Community

Edge –
(weighted
& directed)

Hubs

Connectivity
- # of edges

(http://www.lyonwj.com/2016/06/26/
graph-of-thrones-neo4j-social-network-analysis/)
Types of graphs

• The k-Nearest Neighbor (kNN) graph is a graph in

which two vertices p and q are connected by an
edge, if the distance between p and q is among the
k-th smallest distances from p to other objects from
P.
• The Shared Nearest Neighbor (SNN) graph has
weights that defines proximity, or similarity between
two edges in terms of the number of neighbors (i.e.,
directly connected vertices) they have in common.
SNN graph

(Ertöz et al. Semantic scholar, 2002)

SNN graph

• A common measure of shared neighbors is the

Jaccard index:
– Shared neighbors / Total neighbors for both.
• Other measures includes rank, number of shared
neighbors, overlap coefficient
• Common to do pruning – remove all edges
between nodes with e.g. Jaccard similarity <
cutoff.
Graphs, adjacency and weight matrices
Community detection
Communities, or clusters, are usually groups of vertices
having higher probability of being connected to each other
than to members of other groups.
Community detection

• Main objective is to find a group (community) of

vertices with more edges inside the group than
edges linking vertices of the group with the rest of
the graph.
Graph cuts

• Graph cut partitions a graph

into subgraphs
• Cut cost is the sum of weights
of the edges.
• Clustering by graph cuts: find
the smallest cut that bi-
partitions the graph
• The smallest cut is not always
the best cut – may give many
small disjoint cluster
Normalized cut
• Normalized cut computes the cut cost as a fraction of
the total edge connections to all the nodes in the
graph.
Normalized cut

• Searching for the best normalized cut is NP-hard

• We need a heuristic method to solve the problem:
– Spectral clustering
– Markov clustering
– Louvain
– Leiden
– ...
For single cell data

• Can start with distances based on correlation,

euklidean distances in PCA space etc. Same as for
hclust/k-means.
• Buld a KNN graph with cells as vertices.
– Find k nearest neighbors to each cell.
– The size of k will strongly influence the network structure.
• Can create weighted network based on shared
neighbors (SNN).
• Find clusters with community detection method.
• Graphs can also be used for trajectory analysis etc.
How to work with networks

• Igraph package – implemented for both R, python and Ruby

• Has most commonly used layout optimization methods and
community detection methods implemented.

• Simple R example at:

https://jef.works/blog/2017/09/13/graph-based-community-
detection-for-clustering-analysis/
• Tutorial to igraph at:
http://kateto.net/networks-r-igraph
• Example how to build your own graph with scRNAseq data:
https://github.com/NBISweden/workshop-scRNAseq/blob/master/oldlabs/igraph.md
Distance between cells

• All clustering methods need to define distances

between cells. Things to consider are:
– What gene set should be included?
• Commonly used: Highly variable genes
– What space to calculate distance in?
• Commonly done in PCA space
• Can also be full space, tSNE, UMAP etc.
– How many dimensions to include?
– What distance measure?
Different distance measures

• Most commonly used in scRNA-seq:

– Euclidean distance
– Inverted pairwise correlations (1-correlation)
• Other common methods are:
– Manhattan distance
– Mahalanobis distance
– Maximum distance
Selection of principal components

• To overcome the extensive

technical noise in scRNA-seq
data, it is common to cluster
cells based on their PCA
scores
• Each PC represents a
‘metagene’ that (linearly)
combines information across
a correlated gene set
• Depending on the
heterogeneity of your data
more/less PCs should be
selected.
Bootstrapping

• How confident can you be that

the clusters you see are real?
• You can always take a random
set of cells from the same cell
type and manage to split them
into clusters.
• Most scRNAseq packages do
not include any bootstrapping.
Scran has function
bootstrapCluster.

(Rosvall et al. Plos One 2010 )

Seurat clustering

• FindNeighbors:
– First construct a KNN (k-nearest neighbor) graph – default is based on
the euclidean distance in PCA space
– Then SNN graph the edge weights between any two cells based on the
shared overlap in their local neighborhoods (Jaccard distance) and
pruning of distant edges.
• Important parameters:
– reduction: default is “pca”
– dims: number of PCs
– k.param: Number of neighbors in KNN graph
– prune.snn: Cutoff for pruning

(http://satijalab.org/seurat/)
Seurat clustering

• FindClusters: To cluster the cells, modularity optimization:

– Louvain
– Louvain with multilevel refinement
– Leiden
– SLM
• Important parameters:
– resolution: default 0.8, larger values give more communities, smaller
gives less.
– Algorithm

(http://satijalab.org/seurat/)
Scran clustering

• buildKNNGraph: Constructs the KNN graph

• buildSNNGraph: Constructs KNN and then SNN
graph. OBS! Adds weighted edges to cells that shares
neighbors.
– Allows for different similarity measures, default is “rank”
but also includes “jaccard” or “number”
• Important parameters:
– use.dimred: dimensionality reduction to use
– k: number of neighbors
– type: weighting method
Scran clustering

• Community detection is done with the igraph

package.
– cluster_louvain
– cluster_leiden
– cluster_infomap
– Many more…
• No resolution parameters, can mainly tweak cluster
number with the k parameter.
Scanpy clustering

• sc.pp.neighbors – creates KNN graph

– Has many different options for distance calculation, default
is euclidean.
– No SNN graph construction
• Clustering:
– sc.tl.leiden
– sc.tl.louvain
– Can specify resolution like in Seurat.
Pagoda – Pathway And Geneset OverDispersion Analysis
Implemented in the SCDE package

(Fan et al. Nature Methods 2016)

Pagoda – Pathway And Geneset OverDispersion Analysis

• Helps with biological interpretation of data

• Important to have good and relevant gene sets
• High memory consumption when running Pagoda
• Also has methods for removing batch effect, detected genes, cell cycle etc

(Fan et al. Nature Methods 2016)

Pagoda2

• Similar error modelling

• Now include KNN graph clustering
• largeViz for dimensionality reduction
• Can visualize gene sets.
• https://github.com/hms-dbmi/pagoda2
HDBSCAN

• Hierarchical DBSCAN – density based clustering on

tSNE / Umap
Loupe – Cell Browser, from 10X Genomics
Which clustering method is best?

• Depends on the input data

• Consistency between several methods gives
confidence that the clustering is robust
• The clustering method that is most consistent – best
bootstrap values is not always best
• In a simple case where you have clearly distinct
celltypes, simple hierarchical clustering based on
euclidean or correlation distances will work fine.
Comparison of clustering methods
How many clusters do you really have?

• It is hard to know when to stop clustering – you can

always split the cells more times.
• Can use:
– Do you get any/many significant DE genes from the next
split?
– Some tools have automated predictions for number of
clusters – may not always be biologically relevant
• Always check back to QC-data – is what your splitting
mainly related to batches, qc-measures (especially
detected genes)
Clustree – R package

https://cran.r-project.org/web/packages/clustree/vignettes/clustree.html
Subclustering

• Most of the variation in a heterogeneous data set

will be between broad celltypes.
• By selecting one celltype and rerunning HVG-
selection and PCA – most of the variation will be
differences between subtypes.
Check QC data
Check QC data
Check QC data
Check batches / conditions
Considerations for clustering

• Clearly distinct celltypes will give similar results

regardless of method
• Subclustering within celltypes may require careful
selection of variable genes, dim reduction etc.
• Consistent results from different methods and
agreement with tSNE/UMAP layout is always best!
• Use your biological knowledge to evaluate the results
– but try to be unbiased!

Business Analytics
92% (12)
Business Analytics
34 pages
Graph based clustering
No ratings yet
Graph based clustering
78 pages
Nearset Clustering
No ratings yet
Nearset Clustering
15 pages
CMMB 461 Dna Microarray 2 2019 For D2L
No ratings yet
CMMB 461 Dna Microarray 2 2019 For D2L
27 pages
Cluster Analysis For Gene Expression Data: Jiong Yang Eecs Case Western Reserve University
No ratings yet
Cluster Analysis For Gene Expression Data: Jiong Yang Eecs Case Western Reserve University
34 pages
RDataMining Reference Card
No ratings yet
RDataMining Reference Card
5 pages
Visualizing Hierarchies in scRNA-seq Data Using A Density Tree-Biased Autoencoder
No ratings yet
Visualizing Hierarchies in scRNA-seq Data Using A Density Tree-Biased Autoencoder
18 pages
Social Network Analysis Unit-3
No ratings yet
Social Network Analysis Unit-3
28 pages
Introduction To Graph Cluster Analysis
No ratings yet
Introduction To Graph Cluster Analysis
48 pages
Sathyabama Institute of Science and Technology SIT1301-Data Mining and Warehousing
No ratings yet
Sathyabama Institute of Science and Technology SIT1301-Data Mining and Warehousing
22 pages
F1000research-257532 Genome Graphs
No ratings yet
F1000research-257532 Genome Graphs
1 page
03 Clustering
No ratings yet
03 Clustering
63 pages
Graph Mining: Anuraj Mohan 13MZ01, CSED
No ratings yet
Graph Mining: Anuraj Mohan 13MZ01, CSED
50 pages
clustering
No ratings yet
clustering
8 pages
Bbac 625
No ratings yet
Bbac 625
12 pages
How Does Gene Expression Clustering Work?: Primer
No ratings yet
How Does Gene Expression Clustering Work?: Primer
3 pages
Gene and Sample Clustering
No ratings yet
Gene and Sample Clustering
5 pages
Lecture 6
No ratings yet
Lecture 6
55 pages
Clustering 2
No ratings yet
Clustering 2
13 pages
Chameleon: A Hierarchical Clustering Algorithm Using Dynamic Modeling
No ratings yet
Chameleon: A Hierarchical Clustering Algorithm Using Dynamic Modeling
18 pages
Unsuper
No ratings yet
Unsuper
15 pages
Clustering With Shallow Trees
No ratings yet
Clustering With Shallow Trees
17 pages
GraphBasedDataScience
No ratings yet
GraphBasedDataScience
37 pages
5 Microarray PDF
No ratings yet
5 Microarray PDF
79 pages
Unit I Graph Theory and concepts
No ratings yet
Unit I Graph Theory and concepts
35 pages
Clustering 2
No ratings yet
Clustering 2
17 pages
UNIT5
No ratings yet
UNIT5
60 pages
Igraph Tutorial
No ratings yet
Igraph Tutorial
64 pages
NetSciX 2016 Workshop
No ratings yet
NetSciX 2016 Workshop
64 pages
Cluster Analysis Using Dicer: Install - Packages
No ratings yet
Cluster Analysis Using Dicer: Install - Packages
8 pages
L21 Mining Social Network Graphs
No ratings yet
L21 Mining Social Network Graphs
30 pages
Unit 3 DVA
No ratings yet
Unit 3 DVA
50 pages
Birch
No ratings yet
Birch
6 pages
Lecture 4 - Analyzing Massive Graphs Part I
No ratings yet
Lecture 4 - Analyzing Massive Graphs Part I
27 pages
Clustering
No ratings yet
Clustering
75 pages
s42003-023-05480-z
No ratings yet
s42003-023-05480-z
12 pages
Clustering
No ratings yet
Clustering
45 pages
_Clustering
No ratings yet
_Clustering
41 pages
LecN10_R
No ratings yet
LecN10_R
9 pages
ML-07-clustering
No ratings yet
ML-07-clustering
56 pages
代表性期刊论文
No ratings yet
代表性期刊论文
20 pages
IN4089 - Lecture 05 - Graphs and Dimensionality Reduction-Pdfjam
No ratings yet
IN4089 - Lecture 05 - Graphs and Dimensionality Reduction-Pdfjam
13 pages
Linkage Based Face Clustering Via Graph Convolution Network
No ratings yet
Linkage Based Face Clustering Via Graph Convolution Network
9 pages
4 Clustering
No ratings yet
4 Clustering
21 pages
Original GNN
No ratings yet
Original GNN
22 pages
Tutorial On Spectral Clustering
No ratings yet
Tutorial On Spectral Clustering
26 pages
Clustering
No ratings yet
Clustering
20 pages
sc3: Consensus Clustering of Single-Cell Rna-Seq Data: Brief Communications
No ratings yet
sc3: Consensus Clustering of Single-Cell Rna-Seq Data: Brief Communications
8 pages
Data mining and machine learning
No ratings yet
Data mining and machine learning
48 pages
P3 - Graph Theory - 19-10-2022
No ratings yet
P3 - Graph Theory - 19-10-2022
23 pages
Ijcet 10 01 005 PDF
No ratings yet
Ijcet 10 01 005 PDF
10 pages
An Introduction To Network Inference and Mining
No ratings yet
An Introduction To Network Inference and Mining
27 pages
Social Network Analysis Unit-4
No ratings yet
Social Network Analysis Unit-4
21 pages
1 s2.0 S0031320317303497 Main
No ratings yet
1 s2.0 S0031320317303497 Main
14 pages
Unit 5
No ratings yet
Unit 5
10 pages
3-KMEANS - An Efficient Community Detection Method Based On Rank Centrality-2012
No ratings yet
3-KMEANS - An Efficient Community Detection Method Based On Rank Centrality-2012
13 pages
KDD Tutorial Part2 Network Embedding and GCN
No ratings yet
KDD Tutorial Part2 Network Embedding and GCN
38 pages
A Divisive Hierarchical Structural Clustering Algorithm for Networks
No ratings yet
A Divisive Hierarchical Structural Clustering Algorithm for Networks
6 pages
"These Are Just Rough Notes For References" What Is K-Means Clustering
No ratings yet
"These Are Just Rough Notes For References" What Is K-Means Clustering
9 pages
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Real Gases - Annotated
No ratings yet
Real Gases - Annotated
43 pages
November 2018 (v2) QP - Paper 4 CIE Maths IGCSE
No ratings yet
November 2018 (v2) QP - Paper 4 CIE Maths IGCSE
2 pages
6 AVO Analysis
No ratings yet
6 AVO Analysis
64 pages
Optics PHY F213: Dr. Manjuladevi.V Assistant Professor Department of Physics BITS Pilani 333031
0% (1)
Optics PHY F213: Dr. Manjuladevi.V Assistant Professor Department of Physics BITS Pilani 333031
36 pages
Hand Note - (Chapter-4.3) - Higher Order Homogeneous Linear Differential Equation (Exercise) PDF
No ratings yet
Hand Note - (Chapter-4.3) - Higher Order Homogeneous Linear Differential Equation (Exercise) PDF
19 pages
Autobiography Math
No ratings yet
Autobiography Math
5 pages
Monte Carlo Oil and Gas Reserve Estimation 080601
100% (1)
Monte Carlo Oil and Gas Reserve Estimation 080601
24 pages
LAB 1 Strain Measurement
No ratings yet
LAB 1 Strain Measurement
8 pages
1 6 Trigonometric Functions 1
No ratings yet
1 6 Trigonometric Functions 1
24 pages
Dokumen - Tips - Spss Lecture Notes
100% (1)
Dokumen - Tips - Spss Lecture Notes
58 pages
2.inverse Trigonometric Functions
No ratings yet
2.inverse Trigonometric Functions
121 pages
The Great Pyramid - Edgar Morton (1924)
100% (8)
The Great Pyramid - Edgar Morton (1924)
218 pages
Proposal Bikash
No ratings yet
Proposal Bikash
12 pages
Permutation and Combination
100% (1)
Permutation and Combination
9 pages
2017 Sec 1 Express Maths SA1 Anglo Chinese School
No ratings yet
2017 Sec 1 Express Maths SA1 Anglo Chinese School
18 pages
C
No ratings yet
C
23 pages
4-5. Mathematical Analysis of Recursive and NonRecursive Techniques
No ratings yet
4-5. Mathematical Analysis of Recursive and NonRecursive Techniques
59 pages
Math Hw 2024
No ratings yet
Math Hw 2024
39 pages
Integrating Dynamic Data Into High-Resolution Reservoir Models Using Streamline-Based Analytic Sensitivity Coefficients
No ratings yet
Integrating Dynamic Data Into High-Resolution Reservoir Models Using Streamline-Based Analytic Sensitivity Coefficients
11 pages
180 Days of Math Grade 3
100% (1)
180 Days of Math Grade 3
422 pages
TutorialCW - Concrete Wall
No ratings yet
TutorialCW - Concrete Wall
21 pages
A Data-Driven Approach For Classifying and Predicting DDoS Attacks With Machine Learning
100% (1)
A Data-Driven Approach For Classifying and Predicting DDoS Attacks With Machine Learning
13 pages
Mbe2036 1
No ratings yet
Mbe2036 1
43 pages
4.1 Trigonometric Fourier Series
No ratings yet
4.1 Trigonometric Fourier Series
8 pages
Superficies Cuadraticas
No ratings yet
Superficies Cuadraticas
1 page
Ansys Lab Record Experiments-April-2012
No ratings yet
Ansys Lab Record Experiments-April-2012
41 pages
1988 Head Posture and Hyo-Mandibular Function in Man, A Synchronized Electromyographic and Video Fluorographic Study of The Open-Close-Clench Cycle
No ratings yet
1988 Head Posture and Hyo-Mandibular Function in Man, A Synchronized Electromyographic and Video Fluorographic Study of The Open-Close-Clench Cycle
12 pages
A Signal Analysis Primer For The MATLAB Naïve
No ratings yet
A Signal Analysis Primer For The MATLAB Naïve
29 pages
3) 6A Ratio Summative Assessment Criterion C and D
No ratings yet
3) 6A Ratio Summative Assessment Criterion C and D
7 pages

scRNAseq_clustering_Asa_Bjorklund_2021

Uploaded by

scRNAseq_clustering_Asa_Bjorklund_2021

Uploaded by

scRNAseq clustering tools

Wagner et al. Nat. Biotech 2016

• A cell that performs a specific function?

• Basic clustering theory

• “The process of organizing objects into groups whose

• Builds on distances between data points

Clusters are obtained by cutting the tree at a desired level

Clusters are obtained by cutting the tree at a desired level

• The k-Nearest Neighbor (kNN) graph is a graph in

(Ertöz et al. Semantic scholar, 2002)

• A common measure of shared neighbors is the

• Main objective is to find a group (community) of

• Graph cut partitions a graph

• Searching for the best normalized cut is NP-hard

• Can start with distances based on correlation,

• Igraph package – implemented for both R, python and Ruby

• Simple R example at:

• All clustering methods need to define distances

• Most commonly used in scRNA-seq:

• To overcome the extensive

• How confident can you be that

(Rosvall et al. Plos One 2010 )

• FindClusters: To cluster the cells, modularity optimization:

• buildKNNGraph: Constructs the KNN graph

• Community detection is done with the igraph

• sc.pp.neighbors – creates KNN graph

(Fan et al. Nature Methods 2016)

• Helps with biological interpretation of data

(Fan et al. Nature Methods 2016)

• Similar error modelling

• Hierarchical DBSCAN – density based clustering on

• Depends on the input data

• It is hard to know when to stop clustering – you can

• Most of the variation in a heterogeneous data set

• Clearly distinct celltypes will give similar results

You might also like