0% found this document useful (0 votes)
27 views

Lecture 1. Dimension Reduction

차원축소 알고리즘 1

Uploaded by

김상우
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views

Lecture 1. Dimension Reduction

차원축소 알고리즘 1

Uploaded by

김상우
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 67

Dimensionality Reduction

Lecture 1. Introduction

Sangsoo Lim
Assistant Professor
School of AI Software Convergence
Dongguk University-Seoul
Course Overview
Lecture 1. Lecture 3.

Introduction to Dimensionality Reduction Specialized Techniques & Real-World Applications


- Pathway based Dimension Reduction
in Bioinformatics
- Benchmarking joint dimension reduction
- Introduction on dimensionality reduction
- Deep learning based methods
- Challenges in Bioinformatics
- Basics of Principal Component Analysis (PCA)
- PCA in Omics data

Lecture 2. Lecture 4.

Advanced Linear & Non-linear Methods Practices


- Linear Discriminant Analysis (LDA) in Bioinformatics
- t-SNE and UMAP
- Regression
- Canonical Correlation Analysis (CCA)
- Dimension reduction in Multi-omics Analysis
Lecture 1.

Introduction to Dimensionality Reduction


What is Dimension?
: the measure of a specific aspect of a physical object or a mathematical or
conceptual construct

vertical axis
height

depth

horizontal axis width


Dimension vs Dimensionality
• Dimension: the measure of a specific aspect of a physical object or a mathematical
or conceptual construct
• Synonyms: variable, feature, attribute, column, field, …

• Dimensionality: the number of variables or attributes in a data set

• ex) In protein structure prediction, each AA in the protein sequence could be


considered a dimension, and the "dimensionality" would then be the total number of
AAs in the protein.
Importance of Dimensionality Reduction
• Definition of Dimensionality Reduction

• Brief list:
• Noise reduction

• Data visualization

• Efficient storage and computation


What is Dimensionality Reduction?
• An analogy: Shadow of a 3D object on the ground is its 2D representation

• Emphasis on retaining maximum information

1
1
2

2
Challenges in High-Dimensional Data
• As dimensions increase, the volume of space increases exponentially
 data to become sparse

• Implications:
• Overfitting

• Computational inefficiency

• Hard to visualize
What Matters if (# Dimensions) > (# Samples)
• Curse of Dimensionality
: the explosive nature of increasing data dimensions and its resulting exponential
increase in computational efforts required for its processing and/or analysis.

Shashmi Karanam, ‘Curse of Dimensionality — A “Curse” to Machine Learning’ Towards Data Science (2021)
Curst of Dimensionality
• What is the optimal number of dimensions vs samples?
Curse of Dimensionality
• What is the optimal number of dimensions vs samples?
Curse of Dimensionality
• What is the optimal number of dimensions vs samples?
Curse of Dimensionality
• What is the optimal number of dimensions vs samples?
Curse of Dimensionality
• What is the optimal number of dimensions vs samples?
Complexity of Biological Data
• Multi-faceted nature of bio-data: genomics, proteomic, metabolomic, …

Kreitmaier, Peter, Georgia Katsoula, and Eleftheria Zeggini. "Insights from multi-omics integration in complex disease primary tissues." Trends in Genetics (2023).
Where are we heading?
• Multi-faceted nature of bio-data: genomics, proteomic, metabolomic, …
Gene Description Cell 1 Cell 2 Cell 3 Cell 4 Cell 5
Inpp5d inositol polyphosphate-5-phosphatase D 7.00 5.45 5.89 6.03 5.75
Aim2 absent in melanoma 2 3.01 4.37 4.59 4.38 4.18
Gldn gliomedin 3.48 3.63 4.61 4.70 4.74
Frem2 Fras1 related extracellular matrix protein 2 4.75 4.66 3.46 3.74 3.45
Rps3a1 ribosomal protein S3A1 6.10 7.23 7.44 7.36 7.34
Slc38a3 solute carrier family 38, member 3 1.90 3.16 3.52 3.61 3.19
Mt1 metallothionein 1 5.07 6.49 6.46 6.04 6.05
C1s1 complement component 1, s subcomponent 1 2.74 3.02 3.86 4.10 4.10
Cds1 CDP-diacylglycerol synthase 1 4.55 4.22 3.80 3.16 3.12
Ifi44 interferon-induced protein 44 4.82 4.52 3.87 3.42 3.59
Lefty2 left-right determination factor 2 6.95 6.28 5.88 5.60 5.61
Fmr1nb fragile X mental retardation 1 neighbor 4.28 2.78 3.10 3.25 2.57
Tagln transgelin 7.93 7.91 7.20 7.02 6.68

Each dot is a cell

Groups of dots are similar cells

Separation of groups could be interesting biology


Source from Simon Andrews of Babraham Bioinformatics
Too Much Data
• 5000 cells and 2500 measured genes
• Realistically only 2 dimensions we can plot (x,y)

Source from Simon Andrews of Babraham Bioinformatics


Complexity of Biological Data

https://www.biologyexams4u.com/2023/02/10-types-of-biological-databases.html#google_vignette
Data Types in Bioinformatics
• Which ‘omics data types are popular?

Ebrahim, Ali, et al. "Multi-omic data integration enables discovery of hidden biological regularities." Nature communications 7.1 (2016): 13091.
Characteristics of Single-Omics Data Sets
• Dimension: How many Biological Features?
• Size: How many Samples?

Gene

Sample

Feldner-Busztin, Dylan, et al. "Dealing with dimensionality: the application of machine learning to multi-omics data." Bioinformatics 39.2 (2023): btad021.
Popularity of Omics Data Types
• Which ‘omics data types are popular?

Feldner-Busztin, Dylan, et al. "Dealing with dimensionality: the application of machine learning to multi-omics data." Bioinformatics 39.2 (2023): btad021.
Dimensionality of Bioinformatics Data Sets
• The Cancer Genome Atlas (TCGA)

• “The field is heavily influenced by the use of TCGA dataset.”


Feldner-Busztin, Dylan, et al. "Dealing with dimensionality: the application of machine learning to multi-omics data." Bioinformatics 39.2 (2023).

www.cancer.gov

Weinstein, John N., et al. "The cancer genome atlas pan-cancer analysis project." Nature genetics 45.10 (2013): 1113-1120.
Dimension Reduction techniques for the integrative
analysis of multi-omics data (BiB, 2016)

• A recent dimension reduction analysis of


bladder cancers identified
• components associated with batch effects, GC
content in the RNA sequencing data, in
addition to seven components that were
specific to tumor cells and three components
associated with tumor stroma

• novel and known canc3er-specific pathways

Biton, Anne, et al. "Independent component analysis uncovers the landscape of the bladder tumor transcriptome and reveals insights into luminal and basal subtypes." Cell reports 9.4 (2014): 1235-1245.
Dimension Reduction techniques for the integrative
analysis of multi-omics data (BiB, 2016)

• Exploratory data analysis (EDA) is an important early step in omics data analysis.
• Goal: Summarizes batch effects and outliers

• Methods: cluster analysis and dimension reduction


• Cluster analysis: investigates pairwise distances between objects looking for fine relationships

• Dimension reduction: considers the global variance of the data set, highlighting general gradients or patterns

• Dimension reduction approaches decompose the data into a few new variables
(called components) that explain most of the differences in observations.
Reducing Matrix Dimension
• Often, our data can be represented by an m-by-n matrix.

• And this matrix can be closely related approximated by the product of three
matrices that share a small common dimension r.

n r r n

x Σ x VT r

m A ≈ U m

Source: Stanford CS246


Dimensionality Reduction
• There are hidden, or latent factors, latent dimensions that – to a close
approximation – explain why the values are as they appear in the data matrix.

Source: Stanford CS246


Dimensionality Reduction
The axes of these dimensions can be chosen by:
• The first dimension is the direction in which the points exhibit the greatest variance.

• The second dimension is the direction, orthogonal to the first, in which points show the 2nd
greatest variance.

• And so on…, until you have enough dimensions that variance is really low.

Source: Stanford CS246


Rank is “Dimensionality”
• Q: What is rank of the matrix A?

• A: Number of linearly independent rows of A

• Cloud of points in 3D space:


• Think of point coordinates as a matrix: 1 2 1 a
−2 −3 1 b
1 row per point: 3 5 0 c

• We can rewrite the coordinates more efficiently.


• Old basis vector: [1 0 0], [0 1 0], [0 0 1]

• New basis vector: [1 2 1], [-2 -3 1]

• Then a has new coordinates: [1,0], b: [0,1], c: [1, -1]


• Notice: We reduced the number of dimensions/coordinates!
Source: Stanford CS246
Singular Value Decomposition

• A: Input data matrix


• m x n matrix (e.g., m documents, n terms)

• U: Left singular vectors


• m x r matrix (m documents, r concepts)

• Σ: Singular values
• r x r diagonal matrix (strength of each ‘concept’)

(r : rank of the matrix A)

• V: Right singular vectors


• n x r matrix (n terms, r concepts)
Source: Stanford CS246
Singular Value Decomposition

𝑇𝑇
•A ≈ UΣVT = ∑𝑖𝑖 𝜎𝜎𝑖𝑖 𝑢𝑢𝑖𝑖 ∘ 𝑣𝑣𝑖𝑖

• 𝜎𝜎𝑖𝑖 : 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠
• 𝑢𝑢𝑖𝑖 : 𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣
• 𝑣𝑣𝑖𝑖𝑇𝑇 : 𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣

Source: Stanford CS246


Singular Value Decomposition

Source: Stanford CS246


Singular Value Decomposition
A = UΣVT  example: Users to Movies
Singular Value Decomposition
U: ‘user-to-concept’ factor matrix
Singular Value Decomposition
U: ‘user-to-concept’ factor matrix
Singular Value Decomposition
V: ‘movie-to-concept’ factor matrix
Singular Value Decomposition

Movies, Users and Concepts:

• U: user-to-concept matrix

• Σ: Singular values

• V: Right singular vectors


SVD – Dimension Reduction

Instead of using two coordinates (𝒙𝒙, 𝒚𝒚) to describe point positions,

let’s use only one coordinate

• Point’s position is its location along vector 𝑣𝑣1


SVD – Dimension Reduction
SVD – Dimension Reduction
SVD – Dimension Reduction
SVD – Dimension Reduction

How is dimension reduction done?


SVD – Dimension Reduction

Q: How exactly is dimension reduction done?

A: Set smallest singular values to zero.


SVD – Dimension Reduction

Q: How exactly is dimension reduction done?

A: Set smallest singular values to zero.


SVD – Dimension Reduction

Q: How exactly is dimension reduction done?


This is Rank 2 approximation to A.
We could also do Rank 1 approx.
A: Set smallest singular values to zero.
The larger the rank the more
accurate the approximation.
SVD – Dimension Reduction

Q: How exactly is dimension reduction done?


This is Rank 2 approximation to A.
We could also do Rank 1 approx.
A: Set smallest singular values to zero.
The larger the rank the more
accurate the approximation.
SVD – Dimension Reduction This is Rank 2 approximation to A.
We could also do Rank 1 approx.
The larger the rank the more
Q: How exactly is dimension reduction done? accurate the approximation.

A: Set smallest singular values to zero.

Reconstruction error is quantified by the Frobenius norm.


2
∑𝑖𝑖𝑗𝑗 𝑀𝑀𝑖𝑖𝑖𝑖 ∑𝑖𝑖𝑗𝑗(𝐀𝐀𝑖𝑖𝑖𝑖 − 𝐁𝐁𝑖𝑖𝑖𝑖 )2
M 𝐹𝐹 =  A − 𝐵𝐵 𝐹𝐹 =
SVD – Dimension Reduction
Basics of Principal Component Analysis (PCA)

• Definition: PCA can be thought of as a specific application of SVD

• Primary goal:
• Simplify data by reducing its dimensions
• Retain as much information by focusing on maximizing variance

• SVD vs PCA:
• SVD: Produces three matrices U, Σ, and V*. U and V are orthogonal matrices,
and Σ is a diagonal matrix with singular values in decreasing order.
• PCA: Produces principal components, which are linear combinations of the
original features. The coefficients of these linear combinations are the
eigenvectors of the data's covariance matrix.
Mathematics behind PCA
• Eigenvalues: 12.4, 9.5, 1.3
Mathematics behind PCA
• When we reduce the data dimension from 3 to 2,

• We choose two Principal components that maximally describe data variation.

12.4 + 9.5
= 0.944
12.4 + 9.5 + 1.3

• 94.4% of total variation can be explained by using two PCs (PC1 & PC2).
Mathematics behind PCA
• How to compute PCs?
𝐷𝐷𝐷𝐷𝐷𝐷𝐷𝐷−𝜇𝜇
1. Normalize data 𝑋𝑋 =
𝜎𝜎

2. Compute covariance matrix 𝐾𝐾𝑋𝑋𝑋𝑋 of dataset 𝑋𝑋

3. Find eigenvalue & eigenvectors of 𝐾𝐾𝑋𝑋𝑋𝑋

• Eigenvectors: direction of PCs

• Eigenvalues: variance along PCs


PCA: Explaining Variance
• Each PC always explains some proportion of the total variance in the data.
Between them they explain everything

• PC1 always explains the most

• PC2 is the next highest etc. etc.

• Since we only plot 2 dimensions we’d like to know that these are a good
explanation

• How do we calculate this?


PCA in Face Recognition
• 20 images of 45x40 pixel size (1,800 dimensional)

PCA

Eigenfaces (PCs)

Common face Distinct face Noise


PCA in Face Recognition
• Each face image can be represented as weighted sum of eigenfaces (PCs)
• By discarding lower PCs, we can denoise faces

• Dimensionality reduction
PCA in Genomics
• Simple example using 2 genes and 10 cells
PCA in Genomics
• Find line of best fit, passing through the origin
PCA in Genomics
• Assigning Loadings to Genes

Single Vector or
‘eigenvector’

Loadings:
• Gene1 = 0.82
• Gene2 = 0.57

Higher loading equals


more influence on PC
PCA in Genomics
• More Dimensions

• The same idea extends to larger


numbers of dimensions (n)

• First PC rotates in (n-1) dimensions


• Next PC is perpendicular to PC2, but
rotated similarly (n-2)
• Last PC is remaining perpendicular (no
choice)
• Same number of PCs as genes
PCA in Genomics
• Explaining Variance

• Each PC always explains some proportion of the total variance in the data.
Between them they explain everything
• PC1 always explains the most
• PC2 is the next highest etc. etc.

• Since we only plot 2 dimensions we’d like to know that these are a good
explanation

• How do we calculate this?


PCA in Genomics

• Project onto PC
• Calculate distance to the origin

• Calculate sum of squared differences (SSD)


• This is a measure of variance called the
‘eigenvalue’

• Divide by (points-1) to get actual variance


Explaining Variance – Scree Plots
PCA in Genomics
• Example plot of the first two principal components from a genomics dataset

PCi = (GeneA*10)+(GeneB*3)+(GeneC*-4)+(GeneD*-20)…

Di Nardo, Lucia, et al. "Molecular alterations in basal cell carcinoma subtypes." Scientific Reports 11.1 (2021): 13206.
PCA in Transcriptomics
• Single-cell RNA-seq analysis identified multiple cell types in mammary tumors.

Yuan, Wenlin, et al. "S100a4 upregulation in Pik3ca H1047R; Trp53 R270H; MMTV-Cre-driven mammary tumors promotes metastasis." Breast Cancer Research 21 (2019): 1-11.
PCA: Limitations
• The directions with largest variance are assumed to be of the most interest
• Only considers linear transformations of the original variables
• If the variables are correlated, PCA can achieve dimension reduction. If not, PCA
just orders them according to their variances
PCA: Limitations
• Kind of…

Non-linear separation of values


PCA: Limitations
• Kind of…

Not optimised for 2-dimensions


Summary
• Dimension: the measure of a specific aspect of a physical object or a mathematical
or conceptual construct

• Curse of Dimensionality

• Singular Value Decomposition & Principal Component Analysis

• Lecture 2:
• t-SNE, UMAP and others…

You might also like