0% found this document useful (0 votes)
8 views17 pages

Here

The document provides a comprehensive guide on the pruning process for decision trees, focusing on reducing complexity to improve generalization and prevent overfitting. It outlines the steps for calculating the Sum of Squared Residuals (SSR), applying cost-complexity pruning, and selecting the optimal subtree through cross-validation. Additionally, it discusses handling missing data in Random Forests and explains Principal Component Analysis (PCA) and K-Means clustering techniques.

Uploaded by

1braderrr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views17 pages

Here

The document provides a comprehensive guide on the pruning process for decision trees, focusing on reducing complexity to improve generalization and prevent overfitting. It outlines the steps for calculating the Sum of Squared Residuals (SSR), applying cost-complexity pruning, and selecting the optimal subtree through cross-validation. Additionally, it discusses handling missing data in Random Forests and explains Principal Component Analysis (PCA) and K-Means clustering techniques.

Uploaded by

1braderrr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Here's a detailed and structured explanation of the pruning process for decision trees,

outlined algorithmically:

Algorithm: Pruning a Regression Tree

1. Purpose of Pruning
1.1. Objective:
 Reduce tree complexity to improve generalization on testing data.
1.2. Key Goal:
 Prevent overfitting the training data while ensuring the tree performs well on testing
data.
1.3. Trade-Off:
 Balancing between reducing residuals on training data and minimizing tree
complexity.

2. Calculate the Sum of Squared Residuals (SSR)


2.1. Definition:
 For each leaf ll, calculate residuals for observations in the leaf.
 Compute the squared residuals: SSR=∑l=1L∑i∈l(yi−y^l)2\text{SSR} = \sum_{l=1}^{L}
\sum_{i \in l} \left( y_i - \hat{y}_l \right)^2 where y^l\hat{y}_l is the mean target value
for leaf ll.
2.2. Process:
 Start with the full-sized tree.
 For each subtree (with fewer leaves), calculate its SSR.
2.3. Example Values:
 Full tree SSR: 543.8543.8.
 Subtree with 3 leaves: 5494.85494.8.
 Subtree with 2 leaves: 19243.719243.7.
 Subtree with 1 leaf: 28897.228897.2.

3. Cost-Complexity Pruning
3.1. Definition:
 Combine SSR with a penalty for tree complexity:
Tree Score=SSR+α⋅Leaves\text{Tree Score} = \text{SSR} + \alpha \cdot \text{Leaves}
where α\alpha is a tuning parameter controlling the complexity penalty.
3.2. Process:
 For each subtree, calculate the Tree Score using different α\alpha values.
 Example:
o α=10,000\alpha = 10,000:
Tree Score (Full Tree)=543.8+10,000⋅4=40,543.8\text{Tree Score (Full Tree)}
= 543.8 + 10,000 \cdot 4 = 40,543.8
o Repeat for subtrees with fewer leaves.
3.3. Select Optimal Subtree:
 Choose the subtree with the lowest Tree Score for a given α\alpha.
3.4. Varying α\alpha:
 As α\alpha increases, larger penalties are imposed for complexity, leading to smaller
trees.
 Example:
o α=10,000\alpha = 10,000: Subtree with 2 leaves is optimal.
o α=22,000\alpha = 22,000: Subtree with 1 leaf is optimal.

4. Pruning Process: Step-by-Step


Step 1: Build Full-Sized Tree
 Use all data to construct the initial tree.
 The full tree has the lowest SSR when α=0\alpha = 0, as no penalty is applied.
Step 2: Generate Subtrees
 Gradually prune the tree by removing leaves and calculating the new SSR for each
subtree.
 Repeat for different α\alpha values to build a sequence of subtrees.
Step 3: Cross-Validation
3.1. Split Data:
 Divide the dataset into training and testing sets.
3.2. Evaluate Subtrees:
 Use training data to build the full tree and subtrees for each α\alpha.
 Calculate SSR on the testing data for each subtree.
3.3. Repeat with New Splits:
 Perform kk-fold cross-validation.
 For each fold, determine the subtree with the lowest average SSR across all splits.
Step 4: Select Optimal α\alpha
 Identify the α\alpha value that minimizes the average SSR on testing data.
 Example: If α=10,000\alpha = 10,000 results in the lowest SSR, select this value.
Step 5: Select Final Subtree
 Return to the original data.
 Build the subtree corresponding to the optimal α\alpha.
 This subtree becomes the final pruned tree.

5. Key Observations
5.1. Effect of α\alpha:
 Smaller α\alpha: More leaves (complex tree).
 Larger α\alpha: Fewer leaves (simpler tree).
5.2. Balancing Fit and Complexity:
 Increasing α\alpha reduces the tree's fit to training data but improves generalization.
5.3. Optimal α\alpha:
 Achieves the lowest SSR on testing data while maintaining sufficient simplicity.

Output
 A pruned regression tree that balances training fit and testing generalization.

Let me know if you'd like further clarification or enhancements! 😊

Here’s the structured, algorithmic breakdown for Handling Missing Data in Random
Forests:
Algorithm: Handling Missing Data in Random Forests

1. Overview
1.1. Types of Missing Data:
 Missing data in the original dataset used for Random Forest training.
 Missing data in new samples that require classification.
1.2. Goal:

 Accurately estimate missing values to maximize model performance.

2. Handling Missing Data in the Original Dataset


Step 1: Initial Guess
2.1. Categorical Variables:
 Guess: Most common value among other samples.
Example: For heart disease data, the most frequent value (e.g., "No") is chosen.
2.2. Numeric Variables:
 Guess: Median value among other samples.

Step 2: Refining the Guess


2.3. Build a Random Forest:
 A Random Forest is constructed using the dataset with initial guesses for missing
values.
2.4. Run Data Through Trees:
 Data is run down all trees in the forest to determine similarity between samples.
2.5. Proximity Matrix:
 Similarity is measured based on leaf nodes where samples converge.
 A proximity matrix is built to record these similarities.
2.6. Update Proximities:
 Proximities are updated as samples traverse the trees.
 Final proximity values are averaged over all trees.

Step 3: Using Proximity Values


2.7. Weighted Frequency for Categorical Variables:
 Calculate weighted frequencies for possible values of the missing variable:
Weighted Frequency=Frequency×Proximity Weight\text{Weighted Frequency} =
\text{Frequency} \times \text{Proximity Weight} Example:
o For "Yes": Weighted Frequency = 0.5×0.6=0.030.5 \times 0.6 = 0.03.
o For "No": Weighted Frequency = 0.8×0.75=0.60.8 \times 0.75 = 0.6.
2.8. Weighted Average for Numeric Variables:
 Compute a weighted average using proximities:
Weighted Average=∑(Valuei×Weighti)\text{Weighted Average} = \sum \left(
\text{Value}_i \times \text{Weight}_i \right)
2.9. Iterative Process:
 Repeat the process of building a forest, refining proximities, and updating guesses
until convergence is achieved.

3. Alternative Method for Missing Data in New Samples


Step 1: Duplicating Data
3.1. Create two datasets:
 One with the missing value left as is.
 One with an initial guess for the missing value.
3.2. Run both datasets through the Random Forest.

Step 2: Evaluating Options


3.3. Determine the best guess:
 For each missing value, check which option (e.g., "Yes" or "No") is correctly classified
more often.
 Example:
o "Yes" classified correctly in 3 trees.
o "No" classified correctly in 1 tree.
Result: Choose "Yes" as the final guess.
3.4. Fill in the missing value using the chosen option.

4. Proximity Matrix for Visualization


4.1. Distance Matrix:

 Convert the proximity matrix to a distance matrix:


Distance=1−Proximity\text{Distance} = 1 - \text{Proximity}
4.2. Visualization:
 Use the distance matrix to create visualizations, such as:
o Heatmaps.
o Multidimensional Scaling (MDS) plots.

5. Key Observations
5.1. Iterative Nature:
 The iterative process ensures convergence for both categorical and numeric data.
5.2. Advantages of Proximity Matrix:
 Provides insights into relationships between samples.
 Facilitates advanced visualization techniques.
5.3. Flexibility:
 Handles both training and new data seamlessly.

Conclusion
 Random Forests provide robust techniques for imputing missing data using proximity-
based refinements and weighted calculations.
 Visualization tools like heatmaps and MDS plots offer additional value in analyzing
sample relationships.
 Iterative approaches ensure accurate and consistent estimations, improving overall
model reliability.

Let me know if you'd like further refinements or additions! 😊

Sure thing! Here's the explanation of Principal Component Analysis (PCA) in English,
inspired by the StatQuest video:

Principal Component Analysis (PCA) Explained


1. What is PCA?
PCA is a statistical technique used to reduce the dimensionality of data while preserving as
much of its variability as possible. It transforms data from a high-dimensional space into a
smaller-dimensional space (e.g., 2D or 3D), making it easier to visualize and analyze.
2. Why use PCA?
 To simplify data visualization.
 To identify patterns or clusters in data.
 To determine which variables contribute most to the variation in the data.
 To reduce noise or redundancy in large datasets.

Steps to Perform PCA


Step 1: Start with the Data
Imagine we measure two variables (e.g., Gene 1 and Gene 2) across six samples (e.g.,
mice). The data can be visualized in a 2D graph, where each sample is a point.
Step 2: Calculate the Mean
 For each variable, calculate its average.
 This gives the center of the data.
Step 3: Center the Data
 Subtract the mean from each data point so that the data is centered at the origin
(0,0).
 This step does not change the relative positions of the data points, but it simplifies the
math for later steps.
Step 4: Fit a Line (Principal Component 1)
 PCA tries to fit a line through the origin that best represents the data.
 It optimizes the line by maximizing the sum of squared distances between the
projected points and the origin.
Step 5: Find the First Principal Component (PC1)

 PC1 is the line that explains the largest amount of variation in the data.
 The equation of the line gives a "recipe" for combining the variables, called a linear
combination (e.g., 4 parts Gene 1 + 1 part Gene 2).
Step 6: Find the Second Principal Component (PC2)
 PC2 is the line perpendicular to PC1 that explains the second-largest amount of
variation.
 Like PC1, PC2 has its own "recipe" for combining variables.

Important Terms
1. Eigenvalues
o Measure the amount of variation a principal component explains.
o For example, if PC1 has an eigenvalue of 15 and PC2 has an eigenvalue of 3,
PC1 explains 83% of the variation (15/18).
2. Eigenvectors
o Represent the direction of the principal components.
o For example, an eigenvector might indicate that PC1 consists of 0.97 parts
Gene 1 and 0.242 parts Gene 2.
3. Loading Scores
o Indicate how much each variable contributes to a principal component.
o Higher scores mean a variable is more important in explaining the variation.
4. Scree Plot
o A graph showing the proportion of variation explained by each principal
component.
o This helps decide how many PCs to keep for analysis.

Final Visualization
1. Rotate the data so that PC1 is the horizontal axis and PC2 is the vertical axis.
2. Plot the samples based on their projections onto PC1 and PC2.
3. If PC1 and PC2 explain most of the variation (e.g., 90%), the resulting 2D plot is a
good representation of the data.

Summary

PCA simplifies complex, high-dimensional data into a lower-dimensional representation while


preserving the most important patterns. It identifies:
 The directions (principal components) where the data varies the most.
 How much variation each principal component explains.

Double bam! PCA helps you make sense of your data! 🧬

K-Means Clustering Explained

Introduction: What is K-Means Clustering?


K-Means clustering is an algorithm that groups data into a specific number of clusters (K).
For instance, imagine we have data points from three different tumor types, and we need to
categorize them into three groups. While visually identifying clusters might seem
straightforward in some cases, K-Means uses a computational approach to achieve this
without relying on human observation.

Steps of K-Means Clustering


Step 1: Select the Number of Clusters (K)
 Determine how many clusters you want to identify in your data.
 This is the “K” in K-Means. For example, if you want three clusters, set K=3K = 3.
 There are advanced methods to determine K (explained later), but here we start with
a known value.
Step 2: Randomly Select Initial Cluster Centers

 Choose KK distinct data points from your dataset randomly.


 These points act as the initial centers of the clusters.
Step 3: Assign Each Data Point to the Nearest Cluster
 Calculate the distance from each data point to the cluster centers.
 Use Euclidean distance, which is essentially the straight-line distance between two
points.
o Formula for 2D: Distance=(x2−x1)2+(y2−y1)2\text{Distance} = \sqrt{(x_2 -
x_1)^2 + (y_2 - y_1)^2}
o For higher dimensions, just add more squared differences:
∑i=1n(xi−ci)2\sqrt{\sum_{i=1}^n (x_i - c_i)^2}.
 Assign each data point to the cluster with the smallest distance.
Step 4: Calculate New Cluster Centers
 For each cluster, calculate the mean of all points in that cluster.
 The mean becomes the new center of the cluster.
Step 5: Repeat Until the Clusters Stop Changing
 Recalculate distances and reassign points to the nearest cluster.
 Update the cluster centers again.
 Continue this process until no data points change their cluster assignments, or the
changes are minimal.

Example: K-Means on a Line


1. Imagine data points plotted along a single line.
2. Set K=3K = 3 (e.g., three tumor types).
3. Randomly select three starting points as cluster centers.
4. Measure the distance of each data point to these centers and assign them to the
nearest cluster.
5. Calculate the mean of each cluster to find the updated centers.
6. Repeat the process until cluster assignments stabilize.

Assessing Clustering Quality


 The quality of clustering is measured using total variation within clusters.
 Total variation = sum of squared distances of all points to their cluster centers.
 Lower variation means better-defined clusters.
 If the clustering is suboptimal, K-Means reinitializes with different random starting
points and repeats the process to find the best solution.

How to Choose K? (Using an Elbow Plot)


 The optimal number of clusters is not always obvious.
 To find KK:
1. Try clustering with different values of KK (e.g., K=1,2,3,…K = 1, 2, 3, \ldots).
2. Calculate the total variation for each KK.
3. Plot the total variation against KK.
4. Look for the "elbow" in the graph—the point where adding more clusters no
longer significantly reduces the variation.
o This is the ideal value for KK.

K-Means in Higher Dimensions


 The process is the same for 2D, 3D, or even higher-dimensional data.
 In higher dimensions, Euclidean distance incorporates all axes (e.g., x,y,z,…x, y, z,
\ldots).
 Formula: Distance=x2+y2+z2+…\text{Distance} = \sqrt{x^2 + y^2 + z^2 + \ldots}
 While it’s harder to visualize, the algorithm still works.

K-Means vs. Hierarchical Clustering


 K-Means:
o You specify the number of clusters (KK).
o Iteratively refines cluster assignments and centers.
 Hierarchical Clustering:
o Doesn’t require a preset number of clusters.
o Builds a tree (dendrogram) showing how similar points are to each other.

K-Means for Heatmap Data


 Even if your data is represented as a heatmap, the clustering process remains the
same.
 The algorithm calculates distances between samples, regardless of how they are
visualized.

Summary

K-Means clustering is a powerful and widely used algorithm for grouping data into clusters
based on their similarities. It involves:
 Selecting the number of clusters (KK).
 Assigning data points to clusters.
 Iteratively refining cluster centers until a stable solution is reached.
 The elbow plot helps determine KK if it's not predefined.

Bam! K-Means makes sense of your data! 🎉

K-Nearest Neighbors (K-NN) Algorithm: A Simple Way to Classify Data

Introduction to K-NN
K-Nearest Neighbors (K-NN) is a straightforward algorithm used for classification and
regression tasks. It works by comparing a new, unknown data point to a set of known data
points and assigning it to the category of the nearest neighbors.
For example, if we already have data defining different cell types in tumors, we can use K-NN
to classify an unknown cell based on its similarity to the known data.
How K-NN Works
Step 1: Start with Known Data (Training Data)

 Begin with a dataset where the categories (labels) of the data points are already
known.
 Example: Cell types from an intestinal tumor, categorized by features such as size,
shape, or gene expression levels.
Step 2: Add a New Data Point
 Introduce a data point (e.g., a new cell) whose category is unknown.
 The goal is to classify this new point based on its similarity to the existing data.
Step 3: Find the Nearest Neighbors

 Calculate the distance between the new point and all the points in the training
dataset.
 Euclidean Distance is commonly used: Distance=(x2−x1)2+(y2−y1)2\text{Distance}
= \sqrt{(x_2 - x_1)^2 + (y_2 - y_1)^2}
 Select the KK closest points (neighbors) to the new data point.
Step 4: Assign a Category by Majority Vote
 If K=1K = 1: Assign the category of the single closest neighbor.
 If K>1K > 1: Count the categories of the KK closest neighbors. Assign the category
with the most votes.
o Example: If 7 neighbors are red, 3 are orange, and 1 is green, assign the new
point to red.

Example: K-NN on Scatterplot Data


1. Known Data:
o You have a scatterplot with points categorized into three types: green, red,
and orange.
o This data might have been clustered earlier using PCA (Principal Component
Analysis).
2. New Point Classification:

o Add a new point to the plot. Its position relative to the known points
determines its classification.
o If K=1K = 1: Look at the single nearest neighbor. The new point takes on that
category.
o If K=11K = 11: Look at the 11 nearest neighbors. Use a majority vote to decide
the category.
3. Edge Cases:
o If the new point is equidistant between multiple categories:
 Use an odd KK to avoid ties.
 If ties persist, decide randomly or leave the point unclassified.

K-NN on Heatmaps
 Known Data: Heatmaps often represent gene expression or other cell properties.
Cells are grouped into clusters (e.g., light blue, light green) using hierarchical
clustering.
 New Point Classification:
o Place the new point on the heatmap.
o If K=1K = 1: The new point adopts the category of the single nearest point.
o If K=5K = 5: Look at the five nearest points. Assign the most frequent
category.
o If K=11K = 11: Use majority voting among the 11 nearest points.

Choosing a Value for KK


 There’s no strict rule for selecting KK. The best value depends on the dataset and
task.
Tips for Picking KK:
1. Test on Training Data:
o Temporarily hide some known data and treat it as "unknown."
o Use K-NN to classify these hidden points and compare predictions to the true
labels.
2. Low KK:

o K=1K = 1 or K=2K = 2 can lead to overfitting. These values are sensitive to


noise and outliers.
3. High KK:
o Larger KK values smooth the decision boundaries but risk losing sensitivity to
smaller categories.
o Avoid setting KK so high that smaller categories are outvoted by larger ones.

Key Concepts and Terminology


1. Training Data:
o The dataset where categories are already known. It is used as the reference
for classification.
2. Testing Data:
o New data points classified by the K-NN algorithm based on the training data.
3. Majority Vote:
o The process of deciding a category by tallying the votes from the KK nearest
neighbors.
4. Distance Metric:
o Euclidean distance is common, but other metrics (e.g., Manhattan distance,
cosine similarity) can be used depending on the dataset.

Summary

K-Nearest Neighbors is an intuitive and powerful algorithm for classifying unknown data
points based on their similarity to known points. Steps include:
1. Starting with labeled training data.
2. Introducing a new point to classify.
3. Finding the KK nearest neighbors and assigning the category by majority vote.
While simple, K-NN's performance heavily depends on choosing the right KK and handling
outliers appropriately. It's versatile and works on scatterplots, heatmaps, or any dataset
where similarity can be calculated.

Bam! K-NN makes classifications easier! 🎉

DBSCAN Forming Clusters


1. Start with a Core Point:
o Randomly select a core point. This becomes the seed for the first cluster.
2. Expand the Cluster:
o Add all neighboring core points within ϵ\epsilonϵ.
o Continue expanding the cluster by recursively adding core points that are
close to the cluster.
3. Add Non-Core Points:
o Once no more core points can be added, include non-core points that are
within ϵ\epsilonϵ of any core point in the cluster.
4. Repeat for Remaining Core Points:

o Start a new cluster for any unassigned core points and repeat the process.
5. Identify Outliers:
o Any remaining points that don’t belong to any cluster are classified as
outliers.
Dealing with Ties and Edge Cases
1. Overlapping Clusters:
o A non-core point close to two clusters will be assigned to the first cluster it
encounters.
o Once assigned, a point cannot belong to multiple clusters.
2. Parameter Sensitivity:
o Choosing the right ϵ\epsilonϵ and minPts\text{minPts}minPts is crucial.
o Small ϵ\epsilonϵ: More clusters but might split dense clusters.
o Large ϵ\epsilonϵ: Fewer clusters but might merge separate ones.
o Small minPts\text{minPts}minPts: Sensitive to noise.
o Large minPts\text{minPts}minPts: May miss smaller clusters.

DBSCAN Advantages
1. No Need for Predefined Clusters:
o Unlike K-Means, DBSCAN doesn’t require you to specify the number of
clusters.
2. Detects Arbitrary Shapes:
o Handles complex cluster structures, such as nested or elongated clusters.
3. Robust to Outliers:
o Clearly identifies and excludes noise points.
4. Works in High Dimensions:
o Can cluster data in many dimensions, even when visualization is impossible.

DBSCAN Limitations
1. Parameter Selection:
o Poor choices for ϵ\epsilonϵ or minPts\text{minPts}minPts can lead to incorrect
clustering.
o Requires domain knowledge or experimentation.
2. Scalability:

o Computing distances for all points can be computationally expensive for large
datasets.

Summary
DBSCAN is a versatile and powerful algorithm for clustering data based on density. Steps
include:
1. Counting neighbors within ϵ\epsilonϵ for each point.
2. Identifying core, non-core, and outlier points.
3. Growing clusters by connecting core points and including nearby non-core points.
Its ability to handle outliers and nested clusters makes it an excellent choice for many real-
world datasets.

5. Summary NEURAL - BACKPROPOGATION


1. Backpropagation Steps:
o Perform a forward pass to compute activations.
o Compute the cost function.
o Use the backward pass to calculate gradients of weights and biases.
o Update parameters using gradient descent.
2. Stochastic Gradient Descent:
o Use mini-batches to approximate the gradient, improving efficiency and
speeding up convergence.
3. Key Intuition:
o Backpropagation adjusts weights and biases in proportion to their contribution
to the error. Larger changes are made to parameters with a greater impact on
the output.

You might also like