Here
Here
outlined algorithmically:
1. Purpose of Pruning
1.1. Objective:
Reduce tree complexity to improve generalization on testing data.
1.2. Key Goal:
Prevent overfitting the training data while ensuring the tree performs well on testing
data.
1.3. Trade-Off:
Balancing between reducing residuals on training data and minimizing tree
complexity.
3. Cost-Complexity Pruning
3.1. Definition:
Combine SSR with a penalty for tree complexity:
Tree Score=SSR+α⋅Leaves\text{Tree Score} = \text{SSR} + \alpha \cdot \text{Leaves}
where α\alpha is a tuning parameter controlling the complexity penalty.
3.2. Process:
For each subtree, calculate the Tree Score using different α\alpha values.
Example:
o α=10,000\alpha = 10,000:
Tree Score (Full Tree)=543.8+10,000⋅4=40,543.8\text{Tree Score (Full Tree)}
= 543.8 + 10,000 \cdot 4 = 40,543.8
o Repeat for subtrees with fewer leaves.
3.3. Select Optimal Subtree:
Choose the subtree with the lowest Tree Score for a given α\alpha.
3.4. Varying α\alpha:
As α\alpha increases, larger penalties are imposed for complexity, leading to smaller
trees.
Example:
o α=10,000\alpha = 10,000: Subtree with 2 leaves is optimal.
o α=22,000\alpha = 22,000: Subtree with 1 leaf is optimal.
5. Key Observations
5.1. Effect of α\alpha:
Smaller α\alpha: More leaves (complex tree).
Larger α\alpha: Fewer leaves (simpler tree).
5.2. Balancing Fit and Complexity:
Increasing α\alpha reduces the tree's fit to training data but improves generalization.
5.3. Optimal α\alpha:
Achieves the lowest SSR on testing data while maintaining sufficient simplicity.
Output
A pruned regression tree that balances training fit and testing generalization.
Here’s the structured, algorithmic breakdown for Handling Missing Data in Random
Forests:
Algorithm: Handling Missing Data in Random Forests
1. Overview
1.1. Types of Missing Data:
Missing data in the original dataset used for Random Forest training.
Missing data in new samples that require classification.
1.2. Goal:
5. Key Observations
5.1. Iterative Nature:
The iterative process ensures convergence for both categorical and numeric data.
5.2. Advantages of Proximity Matrix:
Provides insights into relationships between samples.
Facilitates advanced visualization techniques.
5.3. Flexibility:
Handles both training and new data seamlessly.
Conclusion
Random Forests provide robust techniques for imputing missing data using proximity-
based refinements and weighted calculations.
Visualization tools like heatmaps and MDS plots offer additional value in analyzing
sample relationships.
Iterative approaches ensure accurate and consistent estimations, improving overall
model reliability.
Sure thing! Here's the explanation of Principal Component Analysis (PCA) in English,
inspired by the StatQuest video:
PC1 is the line that explains the largest amount of variation in the data.
The equation of the line gives a "recipe" for combining the variables, called a linear
combination (e.g., 4 parts Gene 1 + 1 part Gene 2).
Step 6: Find the Second Principal Component (PC2)
PC2 is the line perpendicular to PC1 that explains the second-largest amount of
variation.
Like PC1, PC2 has its own "recipe" for combining variables.
Important Terms
1. Eigenvalues
o Measure the amount of variation a principal component explains.
o For example, if PC1 has an eigenvalue of 15 and PC2 has an eigenvalue of 3,
PC1 explains 83% of the variation (15/18).
2. Eigenvectors
o Represent the direction of the principal components.
o For example, an eigenvector might indicate that PC1 consists of 0.97 parts
Gene 1 and 0.242 parts Gene 2.
3. Loading Scores
o Indicate how much each variable contributes to a principal component.
o Higher scores mean a variable is more important in explaining the variation.
4. Scree Plot
o A graph showing the proportion of variation explained by each principal
component.
o This helps decide how many PCs to keep for analysis.
Final Visualization
1. Rotate the data so that PC1 is the horizontal axis and PC2 is the vertical axis.
2. Plot the samples based on their projections onto PC1 and PC2.
3. If PC1 and PC2 explain most of the variation (e.g., 90%), the resulting 2D plot is a
good representation of the data.
Summary
Summary
K-Means clustering is a powerful and widely used algorithm for grouping data into clusters
based on their similarities. It involves:
Selecting the number of clusters (KK).
Assigning data points to clusters.
Iteratively refining cluster centers until a stable solution is reached.
The elbow plot helps determine KK if it's not predefined.
Introduction to K-NN
K-Nearest Neighbors (K-NN) is a straightforward algorithm used for classification and
regression tasks. It works by comparing a new, unknown data point to a set of known data
points and assigning it to the category of the nearest neighbors.
For example, if we already have data defining different cell types in tumors, we can use K-NN
to classify an unknown cell based on its similarity to the known data.
How K-NN Works
Step 1: Start with Known Data (Training Data)
Begin with a dataset where the categories (labels) of the data points are already
known.
Example: Cell types from an intestinal tumor, categorized by features such as size,
shape, or gene expression levels.
Step 2: Add a New Data Point
Introduce a data point (e.g., a new cell) whose category is unknown.
The goal is to classify this new point based on its similarity to the existing data.
Step 3: Find the Nearest Neighbors
Calculate the distance between the new point and all the points in the training
dataset.
Euclidean Distance is commonly used: Distance=(x2−x1)2+(y2−y1)2\text{Distance}
= \sqrt{(x_2 - x_1)^2 + (y_2 - y_1)^2}
Select the KK closest points (neighbors) to the new data point.
Step 4: Assign a Category by Majority Vote
If K=1K = 1: Assign the category of the single closest neighbor.
If K>1K > 1: Count the categories of the KK closest neighbors. Assign the category
with the most votes.
o Example: If 7 neighbors are red, 3 are orange, and 1 is green, assign the new
point to red.
o Add a new point to the plot. Its position relative to the known points
determines its classification.
o If K=1K = 1: Look at the single nearest neighbor. The new point takes on that
category.
o If K=11K = 11: Look at the 11 nearest neighbors. Use a majority vote to decide
the category.
3. Edge Cases:
o If the new point is equidistant between multiple categories:
Use an odd KK to avoid ties.
If ties persist, decide randomly or leave the point unclassified.
K-NN on Heatmaps
Known Data: Heatmaps often represent gene expression or other cell properties.
Cells are grouped into clusters (e.g., light blue, light green) using hierarchical
clustering.
New Point Classification:
o Place the new point on the heatmap.
o If K=1K = 1: The new point adopts the category of the single nearest point.
o If K=5K = 5: Look at the five nearest points. Assign the most frequent
category.
o If K=11K = 11: Use majority voting among the 11 nearest points.
Summary
K-Nearest Neighbors is an intuitive and powerful algorithm for classifying unknown data
points based on their similarity to known points. Steps include:
1. Starting with labeled training data.
2. Introducing a new point to classify.
3. Finding the KK nearest neighbors and assigning the category by majority vote.
While simple, K-NN's performance heavily depends on choosing the right KK and handling
outliers appropriately. It's versatile and works on scatterplots, heatmaps, or any dataset
where similarity can be calculated.
o Start a new cluster for any unassigned core points and repeat the process.
5. Identify Outliers:
o Any remaining points that don’t belong to any cluster are classified as
outliers.
Dealing with Ties and Edge Cases
1. Overlapping Clusters:
o A non-core point close to two clusters will be assigned to the first cluster it
encounters.
o Once assigned, a point cannot belong to multiple clusters.
2. Parameter Sensitivity:
o Choosing the right ϵ\epsilonϵ and minPts\text{minPts}minPts is crucial.
o Small ϵ\epsilonϵ: More clusters but might split dense clusters.
o Large ϵ\epsilonϵ: Fewer clusters but might merge separate ones.
o Small minPts\text{minPts}minPts: Sensitive to noise.
o Large minPts\text{minPts}minPts: May miss smaller clusters.
DBSCAN Advantages
1. No Need for Predefined Clusters:
o Unlike K-Means, DBSCAN doesn’t require you to specify the number of
clusters.
2. Detects Arbitrary Shapes:
o Handles complex cluster structures, such as nested or elongated clusters.
3. Robust to Outliers:
o Clearly identifies and excludes noise points.
4. Works in High Dimensions:
o Can cluster data in many dimensions, even when visualization is impossible.
DBSCAN Limitations
1. Parameter Selection:
o Poor choices for ϵ\epsilonϵ or minPts\text{minPts}minPts can lead to incorrect
clustering.
o Requires domain knowledge or experimentation.
2. Scalability:
o Computing distances for all points can be computationally expensive for large
datasets.
Summary
DBSCAN is a versatile and powerful algorithm for clustering data based on density. Steps
include:
1. Counting neighbors within ϵ\epsilonϵ for each point.
2. Identifying core, non-core, and outlier points.
3. Growing clusters by connecting core points and including nearby non-core points.
Its ability to handle outliers and nested clusters makes it an excellent choice for many real-
world datasets.