INF2008 Lecture09
INF2008 Lecture09
Week 09
Learning Objectives
• By the end of this lecture, students should be able to:
• Understand the Need for PCA
• Explain why dimensionality reduction is useful.
• Identify real-world applications where PCA is used.
• That’s a lot of information to think about for every meal! Hence it is important to simplify these to just maybe just a few key
numbers.
Motivating Example: PCA and Understanding Food Choices
• Step 1: Finding the Important Patterns
• You realize that some of these numbers tend to be related:
• Foods high in carbohydrates are often high in sugar too (like cakes and soft drinks).
• Foods high in fat are often high in calories (like fried food).
• Foods with high sodium usually don’t have much sugar (like potato chips or processed meats).
• Instead of worrying about all these numbers separately, you could group them into a few key factors that summarize
the most important health aspects!
• Now, instead of tracking six different numbers, you’ve reduced everything to just three key factors.
• This is exactly what PCA does with complex data—it finds patterns and reduces the number of things you need to
focus on while keeping the most important information intact.
Motivating Example: PCA and Understanding Food Choices
• Step 3: Applying This Knowledge
• If you need more energy, you can focus on the Energy intake factor instead of separately checking calories, carbs, and
sugar.
• If you're watching your heart health, you focus on the Heart health factor instead of stressing over fat and sodium
separately.
• If you need more muscle recovery, you check the Protein intake factor.
• What is the advantage of doing this? You make smarter and faster decisions without getting lost in too much detail.
• Just like we grouped similar food components together, PCA takes large amounts of raw data (like all the food label
numbers) and finds the most meaningful patterns. It then reduces the complexity so we can focus on the key factors that
matter most.
Components in PCA
• In PCA, components are the original individual variables (features) in the dataset.
• In our food example, these are the detailed nutritional values for each food:
• Calories
• Carbohydrates
• Sugar
• Fat
• Protein
• Sodium (Salt)
• Each of these components contributes differently to a food’s overall nutritional profile, just like how different variables
contribute to a dataset.
Principal Components in PCA
• PCA finds patterns in the data and creates Principal Components (PCs), which are new, combined variables that capture the
most important information in a simplified way.
• PCA analyzes the relationships among components and finds which ones tend to vary together.
• For example:
• Calories, Carbohydrates, and Sugar are often high together (think of soft drinks and cakes).
• Fat and Sodium tend to be high together (think of fried food and processed meat).
• Protein behaves independently (e.g., chicken breast is high in protein but low in sugar and carbs).
Step 1: Standardize the Data
Food Item Calories Carbohydrates Sugar Fat Sodium Protein
Chicken 165.0 0.0 0.0 3.6 74.0 31.0
Apple 95.0 25.0 19.0 0.3 2.0 0.5
Cake 350.0 50.0 35.0 15.0 300.0 5.0
Food Item Original Value Mean Std Dev Std Scaled Value 𝑋 − 𝑋ത
Chicken 165.0 203.3333 107.57426375 -0.35634
𝑋𝑠𝑐𝑎𝑙𝑒𝑑 =
𝜎
Apple 95.0 203.3333 107.57426375 -1.00706
Chocolate Cake 350.0 203.3333 107.57426375 1.3634
Calories_scaled
-0.35634
-1.00706
1.3634
Test Yourself 1: Calculate the scaled values for sugar
Food Item Calories Carbohydrates Sugar Fat Sodium Protein
Chicken 165.0 0.0 0.0 3.6 74.0 31.0
Apple 95.0 25.0 19.0 0.3 2.0 0.5
Cake 350.0 50.0 35.0 15.0 300.0 5.0
Food Item Original Value Mean Std Dev Std Scaled Value
Chicken
Apple
Chocolate Cake
Test Yourself 1: Calculate the scaled values for sugar
Food Item Calories Carbohydrates Sugar Fat Sodium Protein
Chicken 165.0 0.0 0.0 3.6 74.0 31.0
Apple 95.0 25.0 19.0 0.3 2.0 0.5
Cake 350.0 50.0 35.0 15.0 300.0 5.0
Food Item Original Value Mean Std Dev Std Scaled Value 𝑋 − 𝑋ത
Chicken 0.0 18 14.30617582 -1.2582
𝑋𝑠𝑐𝑎𝑙𝑒𝑑 =
𝜎
Apple 19.0 18 14.30617582 0.0699
Chocolate Cake 35.0 18 14.30617582 1.188298
• The covariance matrix tells us how different features (Calories, Carbohydrates, Sugar, etc.) relate to each other:
• Positive values indicate a strong positive relationship (i.e., when one value increases, the other also increases).
• Negative values indicate an inverse relationship (i.e., when one value increases, the other decreases).
• Values close to zero suggest little or no relationship between the features.
• The size of the covariance matrix in PCA is determined by the number of features (attributes), not the number of samples
(rows).
• Since your dataset has 5 attributes, the covariance matrix will be 5 × 5 because it captures the variance and relationships
between the attributes.
Step 3: Compute Eigenvalues and Eigenvectors
• The next step calculates something known as eigenvectors and eigenvalues through a step known as eigen decomposition.
𝐶𝑊 = 𝜆𝑊
Principal Component 1 (PC1) PC1 represents a general measure of energy-dense foods. Foods high in
• Strongly influenced by: Calories, Carbohydrates, Sugar, Fat, and Sodium all contribute positively,
• Calories (0.4227) meaning they increase along PC1. Protein has a smaller negative
• Carbohydrates (0.44288) contribution, suggesting that foods higher in protein tend to have lower
• Sugar (0.43458) energy density.
• Fat (0.43303)
• Sodium (0.42967)
3.2 A food item that is high in Protein but low in Sugar and Carbohydrates is likely to have…
a) A high score on PC1
b) A low score on PC1
c) A high score on PC2
d) A low score on PC2
Test Yourself 3:
3.1 What does a high score in PC1 indicate?
a) The food item is high in Protein but low in Sugar and Carbohydrates.
b) The food item is high in Calories, Carbohydrates, Sugar, Fat, and Sodium.
c) The food item is mostly water and contains minimal nutrients.
d) The food item is primarily rich in Sodium but low in Calories.
Answer: B (PC1 represents overall energy density, so a high score means a food is high in Calories, Carbs, Sugar, Fat, and
Sodium.)
3.2 A food item that is high in Protein but low in Sugar and Carbohydrates is likely to have…
a) A high score on PC1
b) A low score on PC1
c) A high score on PC2
d) A low score on PC2
Answer: C (PC2 contrasts Protein-rich foods against Carbohydrate/Sugar-rich foods, so a high-PC2 score means a food is rich in
Protein.)
Step 4: Compute Principal Components
• In this step, we project the standardized data onto the eigenvectors (principal component directions).
• Each row in the result represents a transformed data point in the new PCA space.
• Each row of 6 features will be converted into the PC numbers (PC1, PC2 and PC3)
• Instead of representing food items by Calories, Carbs, Sugar, etc., they are now represented in terms of PC1, PC2, and
PC3.
Step 4: Compute Principal Components
Calories_scaled Carbohydrates_scaled Sugar_scaled Fat_scaled Sodium_scaled Protein_scaled
Chicken -0.35634 -1.22474 -1.2582 -0.42873 -0.40433 1.400946
Apple -1.00706 0 0.0699 -0.95274 -0.97145 -0.86784
Cake 1.363399 1.224745 1.188298 1.381475 1.375788 -0.5331
X
Calories 0.4227 0.3601 -0.466
X
Calories 0.4227 0.3601 -0.466
X
Apple PC2 ?
Test Yourself 4:
Calories_scaled Carbohydrates_scaled Sugar_scaled Fat_scaled Sodium_scaled Protein_scaled
Chicken -0.35634 -1.22474 -1.2582 -0.42873 -0.40433 1.400946
Apple -1.00706 0 0.0699 -0.95274 -0.97145 -0.86784
Cake 1.363399 1.224745 1.188298 1.381475 1.375788 -0.5331
X
Calories 0.4227 0.3601 -0.466
X
Calories 0.4227 0.3601 -0.466
=
Feature PC1 PC2 PC3
The visualization of the variance can help you see how much variance each principal component captures.
Code Analysis (I)
https://colab.research.google.com/drive/1VlOo4BL3V4aV14i15sXUiePy4EKi6oHW?usp=sharing
Code Analysis (I)
https://colab.research.google.com/drive/1VlOo4BL3V4aV14i15sXUiePy4EKi6oHW?usp=sharing
INF2008: Machine Learning
Unsupervised Learning (II)
t-SNE
Week 09
Learning Objectives
• By the end of this section, learners should be able to:
• Understand the Motivation Behind t-SNE
• Explain why t-SNE is used for dimensionality reduction.
• Critically Analyze the Behavior of t-SNE.
t-SNE Simple Intuition.
• Step 1: Measuring Probabilities in the Marble Bag (High-Dimensional Space)
• Some marbles feel really similar (same size, texture), while others feel very different.
• A soft probability rule:
• If two marbles are similar, the probability of them being neighbors is high.
• If two marbles are far apart, the probability is much lower.
• Mathematically, this is done using a Gaussian distribution (a bell curve), where distances between marbles are
transformed into probabilities.
• So from the 5x9 matrix we started off with, we will create instead a 5x5 matrix.
• We will use the example of the distance between Img1 and Img2 as an example.
Table showing the distance metric between Img1 and Img2.
P1 P2 P3 P4 P5 P6 P7 P8 P9
Img1 102 179 92 14 106 71 188 20 102
Img2 121 210 214 74 202 87 116 99 103
Img1-Img2 -19 -31 -122 -60 -96 -16 72 -79 -1
Img1-Img2 𝟐 361 961 14884 3600 9216 256 5184 6241 1
Sum Square 40564
Root Sum Square 201.41
• Pj∣i is the conditional probability that point j is a neighbor of point i in the original high-dimensional space.
2
𝑒𝑥𝑝 −𝑑𝑖𝑗 Τ2𝜎𝑖2
𝑃𝑗|𝑖 = 2 Τ
σ𝑘≠𝑖 −𝑑𝑖𝑘 2𝜎𝑖2
• The formula looks like nightmareish but actually is quite easy to understand.
2
• 𝑑𝑖𝑗 is the distance between image i and image j that we calculated in Step 1.
• We would need d12 = 201.7523 and row 1 of the distance metric. Exponential Unnormalized.
Img1 Img2 Img3 Img4 Img5 Sum
Img1 0.00000 0.13065 0.10036 0.03029 0.00016 0.261474
• Let’s assume a fixed variance 𝜎1=100. Img2 0.13065 0.00000 0.02437 0.07786 0.00026 0.23314
Img3 0.10036 0.02437 0.00000 0.00083 0.00195 0.127513
2 Τ
𝑒𝑥𝑝 −𝑑12 2𝜎12 = 𝑒𝑥𝑝 − 201.7523 2 Τ 2 ∗ 100 2
= 𝑒𝑥𝑝 − 40703.9905 Τ 2 ∗ 100 2
Exponential Unnormalized.
𝑃3|1 = Img1 Img2 Img3 Img4 Img5 Sum
Img1 0.00000 0.13065 0.10036 0.03029 0.00016 0.261474
Img2 0.13065 0.00000 0.02437 0.07786 0.00026 0.23314
Img3 0.10036 0.02437 0.00000 0.00083 0.00195 0.127513
Img1, Img3 has a x% chance of being a neighbor. Img4 0.03029 0.07786 0.00083 0.00000 0.00031 0.109297
Img5 0.00016 0.00026 0.00195 0.00031 0.00000 0.002675
Normalized Probabilities.
Img1 Img2 Img3 Img4 Img5
Img1 0 0.49968 0.38384 0.11586 0.00062
Img2 0.56041 0 0.10451 0.33398 0.0011
Img3 0.78709 0.19109 0 0.00653 0.01529
Img4 0.27717 0.7124 0.00762 0 0.00281
Img5 0.06047 0.09574 0.72897 0.11482 0
Test Yourself 2: Calculate 𝑃3|1
Table showing the pairwise distance metrics.
Img1 Img2 Img3 Img4 Img5 2
Img1 0 201.7523 214.4271 264.4542 417.8397 𝑒𝑥𝑝 −𝑑𝑖𝑗 Τ2𝜎𝑖2
Img2 201.7523 0 272.5638 225.9557 406.6915 𝑃𝑗|𝑖 = 2 Τ
Img3 214.4271 272.5638 0 376.5833 353.269 σ𝑘≠𝑖 −𝑑𝑖𝑘 2𝜎𝑖2
Img4 264.4542 225.9557 376.5833 0 402.199
Img5 417.8397 406.6915 353.269 402.199 0
2 Τ
𝑒𝑥𝑝 −𝑑13 2𝜎12 = 𝑒𝑥𝑝 − 214.4271 2 Τ 2 ∗ 100 2
= 𝑒𝑥𝑝 − 45978.98 Τ 2 ∗ 100 2
= 𝑒𝑥𝑝 − 45978.98 Τ 20000
= 𝑒𝑥𝑝 −2.298949 Exponential Unnormalized.
≈0.10036 Img1 Img2 Img3 Img4 Img5 Sum
Img1 0.00000 0.13065 0.10036 0.03029 0.00016 0.261474
Img2 0.13065 0.00000 0.02437 0.07786 0.00026 0.23314
𝑃3|1 = 0.10036Τ0.261474 Img3 0.10036 0.02437 0.00000 0.00083 0.00195 0.127513
=0.38384 Img4 0.03029 0.07786 0.00083 0.00000 0.00031 0.109297
Img5 0.00016 0.00026 0.00195 0.00031 0.00000 0.002675
Normalized Probabilities.
Img1 Img2 Img3 Img4 Img5
Img1 0 0.49968 0.38384 0.11586 0.00062
Img2 0.56041 0 0.10451 0.33398 0.0011
Img3 0.78709 0.19109 0 0.00653 0.01529
Img4 0.27717 0.7124 0.00762 0 0.00281
Img5 0.06047 0.09574 0.72897 0.11482 0
Step 3: Generate random (x, y) coordinates for each points in 2D space.
• We generate random (x, y) coordinates for each of our 5 points in 2D space:
• Yi∼N(0,0.01)
• This ensures a small random spread around zero, allowing gradual movement during optimization.
Random Points in 2D Space.
X (2D) Y (2D)
Img1 49.67142 -13.8264
Img2 64.76885 152.303
Img3 -23.4153 -23.4137
Img4 157.9213 76.74347
Img5 -46.9474 54.256
Step 4: Compute 𝑄𝑖𝑗 (Low-Dimensional Similarities).
Random Points in 2D Space. Squared Euclidean Distance
X (2D) Y (2D) Img1 Img2 Img3 Img4 Img5
Img1 0 27826.92 5433.589 19920.94 13970.42
Img1 49.67142 -13.8264 Img2 27826.92 0 38652.8 14386.61 22093.74
Img2 64.76885 152.303 Img3 5433.589 38652.8 0 42914.43 6586.342
Img4 19920.94 14386.61 42914.43 0 42476.88
Img3 -23.4153 -23.4137
Img5 13970.42 22093.74 6586.342 42476.88 0
Img4 157.9213 76.74347
Img5 -46.9474 54.256
2 −1 2 −1
1 + 𝑌𝑖 − 𝑌𝑗 Numerator = 1 + 𝑌𝑖 − 𝑌𝑗
𝑄𝑖𝑗 = −1
σ𝑘≠𝑙 1 + 𝑌𝑘 − 𝑌𝑙 2 −1 = 1 + 27826.92
= 3.59351E-05
Compute Squared Euclidean Distance
Denominator = 0.001362 (working shown later)
𝑌1 − 𝑌2 2 = 𝑥2 − 𝑥1 2 + 𝑦2 − 𝑦1 2
= 49.67142 − 64.76885 2 + -13.8264 − (152.303) 2 3.59351E-05
𝑄12 = 0.001362
= -15.0974 2 + 0.016613 2
= 227.9326 + 27598.98 = 0.026384096
= 27826.92
2= 2+ 2
𝑄13 = 0.001362
𝑌1 − 𝑌2 𝑥2 − 𝑥1 𝑦2 − 𝑦1
=
=
Q normalized.
Img1 Img2 Img3 Img4 Img5
Img1 0 0.026384 0.135101 0.036855 0.052551
Img2 0.026384 0 0.018995 0.051031 0.03323
So, the squared distance is Img3 0.135101 0.018995 0 0.017108 0.111459
Img4 0.036855 0.051031 0.017108 0 0.017285
Img5 0.052551 0.03323 0.111459 0.017285 0
Test Yourself 3: Calculate Q13
Random Points in 2D Space. Squared Euclidean Distance
X (2D) Y (2D) Img1 Img2 Img3 Img4 Img5
Img1 0 27826.92 5433.589 19920.94 13970.42
Img1 49.67142 -13.8264 Img2 27826.92 0 38652.8 14386.61 22093.74
Img2 64.76885 152.303 Img3 5433.589 38652.8 0 42914.43 6586.342
Img4 19920.94 14386.61 42914.43 0 42476.88
Img3 -23.4153 -23.4137
Img5 13970.42 22093.74 6586.342 42476.88 0
Img4 157.9213 76.74347
2 −1
Img5 -46.9474 54.256 Numerator = 1 + 𝑌𝑖 − 𝑌𝑗
2 −1 = 1 + 5433.585 −1
1 + 𝑌𝑖 − 𝑌𝑗 = 0.000184
𝑄𝑖𝑗 = 2 −1
σ𝑘≠𝑙 1 + 𝑌𝑘 − 𝑌𝑙
Denominator = 0.001362 (working shown later)
Compute Squared Euclidean Distance
0.000184
𝑄12 = 0.001362
𝑌1 − 𝑌2 2 = 𝑥2 − 𝑥1 2 + 𝑦2 − 𝑦1 2
= 49.67142 − (-23.4153) 2
2 = 0.1351
+ -13.8264 − (-23.4137)
= 73.08672 2 + 9.5873 2 Q normalized.
= 5341.669 + 91.91632 Img1 Img2 Img3 Img4 Img5
0 0.026384 0.135101 0.036855 0.052551
= 5433.585 Img1
Img2 0.026384 0 0.018995 0.051031 0.03323
Img3 0.135101 0.018995 0 0.017108 0.111459
So, the squared distance is 5433.585 . Img4 0.036855 0.051031 0.017108 0 0.017285
Img5 0.052551 0.03323 0.111459 0.017285 0
Step 5: Updating 2D Positions Using Gradient Descent.
Pij, Normalized Probabilities. Q normalized.
Img1 Img2 Img3 Img4 Img5 Img1 Img2 Img3 Img4 Img5
Img1 0 0.49968 0.38384 0.11586 0.00062 Img1 0 0.026384 0.135101 0.036855 0.052551
Img2 0.56041 0 0.10451 0.33398 0.0011 Img2 0.026384 0 0.018995 0.051031 0.03323
Img3 0.78709 0.19109 0 0.00653 0.01529 Img3 0.135101 0.018995 0 0.017108 0.111459
Img4 0.27717 0.7124 0.00762 0 0.00281 Img4 0.036855 0.051031 0.017108 0 0.017285
Img5 0.06047 0.09574 0.72897 0.11482 0 Img5 0.052551 0.03323 0.111459 0.017285 0
x_j x1 - x_j P_1j (Reference) Q_1j P_1j - Q_1j (P_1j - Q_1j) * Q_1j (P_1j - Q_1j) * Q_1j * (x1 - x_j)
Img1 49.67142 0 0 0 0 0 0
Img2 64.76885 -15.0974 0.49968 0.026384 0.473296 0.012488 -0.18853
Img3 -23.4153 73.08672 0.38384 0.135101 0.248739 0.033605 2.456071
Img4 157.9213 -108.25 0.11586 0.036855 0.079005 0.002912 -0.31519
Img5 -46.9474 96.61882 0.00062 0.052551 -0.05193 -0.00273 -0.26368
Sum 1.688668
Sum * 4 6.75
• The 4 comes from the derivative of KL divergence with respect to the Student’s t-distribution kernel.
• If Q1j is large, the points are already close move them less to prevents overcorrections.