Lecture 1. Dimension Reduction
Lecture 1. Dimension Reduction
Lecture 1. Introduction
Sangsoo Lim
Assistant Professor
School of AI Software Convergence
Dongguk University-Seoul
Course Overview
Lecture 1. Lecture 3.
Lecture 2. Lecture 4.
vertical axis
height
depth
• Brief list:
• Noise reduction
• Data visualization
1
1
2
2
Challenges in High-Dimensional Data
• As dimensions increase, the volume of space increases exponentially
data to become sparse
• Implications:
• Overfitting
• Computational inefficiency
• Hard to visualize
What Matters if (# Dimensions) > (# Samples)
• Curse of Dimensionality
: the explosive nature of increasing data dimensions and its resulting exponential
increase in computational efforts required for its processing and/or analysis.
Shashmi Karanam, ‘Curse of Dimensionality — A “Curse” to Machine Learning’ Towards Data Science (2021)
Curst of Dimensionality
• What is the optimal number of dimensions vs samples?
Curse of Dimensionality
• What is the optimal number of dimensions vs samples?
Curse of Dimensionality
• What is the optimal number of dimensions vs samples?
Curse of Dimensionality
• What is the optimal number of dimensions vs samples?
Curse of Dimensionality
• What is the optimal number of dimensions vs samples?
Complexity of Biological Data
• Multi-faceted nature of bio-data: genomics, proteomic, metabolomic, …
Kreitmaier, Peter, Georgia Katsoula, and Eleftheria Zeggini. "Insights from multi-omics integration in complex disease primary tissues." Trends in Genetics (2023).
Where are we heading?
• Multi-faceted nature of bio-data: genomics, proteomic, metabolomic, …
Gene Description Cell 1 Cell 2 Cell 3 Cell 4 Cell 5
Inpp5d inositol polyphosphate-5-phosphatase D 7.00 5.45 5.89 6.03 5.75
Aim2 absent in melanoma 2 3.01 4.37 4.59 4.38 4.18
Gldn gliomedin 3.48 3.63 4.61 4.70 4.74
Frem2 Fras1 related extracellular matrix protein 2 4.75 4.66 3.46 3.74 3.45
Rps3a1 ribosomal protein S3A1 6.10 7.23 7.44 7.36 7.34
Slc38a3 solute carrier family 38, member 3 1.90 3.16 3.52 3.61 3.19
Mt1 metallothionein 1 5.07 6.49 6.46 6.04 6.05
C1s1 complement component 1, s subcomponent 1 2.74 3.02 3.86 4.10 4.10
Cds1 CDP-diacylglycerol synthase 1 4.55 4.22 3.80 3.16 3.12
Ifi44 interferon-induced protein 44 4.82 4.52 3.87 3.42 3.59
Lefty2 left-right determination factor 2 6.95 6.28 5.88 5.60 5.61
Fmr1nb fragile X mental retardation 1 neighbor 4.28 2.78 3.10 3.25 2.57
Tagln transgelin 7.93 7.91 7.20 7.02 6.68
https://www.biologyexams4u.com/2023/02/10-types-of-biological-databases.html#google_vignette
Data Types in Bioinformatics
• Which ‘omics data types are popular?
Ebrahim, Ali, et al. "Multi-omic data integration enables discovery of hidden biological regularities." Nature communications 7.1 (2016): 13091.
Characteristics of Single-Omics Data Sets
• Dimension: How many Biological Features?
• Size: How many Samples?
Gene
Sample
Feldner-Busztin, Dylan, et al. "Dealing with dimensionality: the application of machine learning to multi-omics data." Bioinformatics 39.2 (2023): btad021.
Popularity of Omics Data Types
• Which ‘omics data types are popular?
Feldner-Busztin, Dylan, et al. "Dealing with dimensionality: the application of machine learning to multi-omics data." Bioinformatics 39.2 (2023): btad021.
Dimensionality of Bioinformatics Data Sets
• The Cancer Genome Atlas (TCGA)
www.cancer.gov
Weinstein, John N., et al. "The cancer genome atlas pan-cancer analysis project." Nature genetics 45.10 (2013): 1113-1120.
Dimension Reduction techniques for the integrative
analysis of multi-omics data (BiB, 2016)
Biton, Anne, et al. "Independent component analysis uncovers the landscape of the bladder tumor transcriptome and reveals insights into luminal and basal subtypes." Cell reports 9.4 (2014): 1235-1245.
Dimension Reduction techniques for the integrative
analysis of multi-omics data (BiB, 2016)
• Exploratory data analysis (EDA) is an important early step in omics data analysis.
• Goal: Summarizes batch effects and outliers
• Dimension reduction: considers the global variance of the data set, highlighting general gradients or patterns
• Dimension reduction approaches decompose the data into a few new variables
(called components) that explain most of the differences in observations.
Reducing Matrix Dimension
• Often, our data can be represented by an m-by-n matrix.
• And this matrix can be closely related approximated by the product of three
matrices that share a small common dimension r.
n r r n
x Σ x VT r
m A ≈ U m
• The second dimension is the direction, orthogonal to the first, in which points show the 2nd
greatest variance.
• And so on…, until you have enough dimensions that variance is really low.
• Σ: Singular values
• r x r diagonal matrix (strength of each ‘concept’)
𝑇𝑇
•A ≈ UΣVT = ∑𝑖𝑖 𝜎𝜎𝑖𝑖 𝑢𝑢𝑖𝑖 ∘ 𝑣𝑣𝑖𝑖
• 𝜎𝜎𝑖𝑖 : 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠
• 𝑢𝑢𝑖𝑖 : 𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣
• 𝑣𝑣𝑖𝑖𝑇𝑇 : 𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣
• U: user-to-concept matrix
• Σ: Singular values
• Primary goal:
• Simplify data by reducing its dimensions
• Retain as much information by focusing on maximizing variance
• SVD vs PCA:
• SVD: Produces three matrices U, Σ, and V*. U and V are orthogonal matrices,
and Σ is a diagonal matrix with singular values in decreasing order.
• PCA: Produces principal components, which are linear combinations of the
original features. The coefficients of these linear combinations are the
eigenvectors of the data's covariance matrix.
Mathematics behind PCA
• Eigenvalues: 12.4, 9.5, 1.3
Mathematics behind PCA
• When we reduce the data dimension from 3 to 2,
12.4 + 9.5
= 0.944
12.4 + 9.5 + 1.3
• 94.4% of total variation can be explained by using two PCs (PC1 & PC2).
Mathematics behind PCA
• How to compute PCs?
𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷−𝜇𝜇
1. Normalize data 𝑋𝑋 =
𝜎𝜎
• Since we only plot 2 dimensions we’d like to know that these are a good
explanation
PCA
Eigenfaces (PCs)
• Dimensionality reduction
PCA in Genomics
• Simple example using 2 genes and 10 cells
PCA in Genomics
• Find line of best fit, passing through the origin
PCA in Genomics
• Assigning Loadings to Genes
Single Vector or
‘eigenvector’
Loadings:
• Gene1 = 0.82
• Gene2 = 0.57
• Each PC always explains some proportion of the total variance in the data.
Between them they explain everything
• PC1 always explains the most
• PC2 is the next highest etc. etc.
• Since we only plot 2 dimensions we’d like to know that these are a good
explanation
• Project onto PC
• Calculate distance to the origin
PCi = (GeneA*10)+(GeneB*3)+(GeneC*-4)+(GeneD*-20)…
Di Nardo, Lucia, et al. "Molecular alterations in basal cell carcinoma subtypes." Scientific Reports 11.1 (2021): 13206.
PCA in Transcriptomics
• Single-cell RNA-seq analysis identified multiple cell types in mammary tumors.
Yuan, Wenlin, et al. "S100a4 upregulation in Pik3ca H1047R; Trp53 R270H; MMTV-Cre-driven mammary tumors promotes metastasis." Breast Cancer Research 21 (2019): 1-11.
PCA: Limitations
• The directions with largest variance are assumed to be of the most interest
• Only considers linear transformations of the original variables
• If the variables are correlated, PCA can achieve dimension reduction. If not, PCA
just orders them according to their variances
PCA: Limitations
• Kind of…
• Curse of Dimensionality
• Lecture 2:
• t-SNE, UMAP and others…