ML U3 Notes
ML U3 Notes
1. Weak Learners
o These are simple models (like decision stumps) that perform slightly better than random
guessing.
o They are trained in sequence, focusing more on data points that previous models found
hard to classify.
2. Strong Classifier
o This is the final model created by combining the predictions of all weak learners.
o It is powerful and accurate because it uses the collective learning of all the weak
learners.
3. Weighted Voting
4. Error Rate
5. Iterations
o The number of iterations is a key setting; too many can lead to overfitting.
Advantages of AdaBoost
1. Better Accuracy
o Even with simple models, it can significantly improve accuracy by focusing on tough-to-
classify data.
2. Versatile
o Works with many types of base models and can be applied to different problems.
3. Feature Selection
o Automatically picks the most important features, reducing the need for manual feature
selection.
4. Less Overfitting
o Noisy data and outliers can mislead AdaBoost because it gives extra weight to
misclassified data points.
2. Computationally Expensive
o Training multiple models takes time, especially for large datasets or many iterations.
3. Overfitting Risk
4. Complex Tuning
o Choosing the right weak learner and settings (like the number of iterations) can be
tricky.
Bagging
• Bagging, an abbreviation for Bootstrap Aggregating, is a machine learning ensemble strategy for
enhancing the reliability and precision of predictive models.
• It entails generating numerous subsets of the training data by employing random sampling with
replacement
• These subsets train multiple base learners, such as decision trees, neural networks, or other
models.
1. Dataset Preparation: Prepare your dataset, ensuring it's properly cleaned and preprocessed.
Split it into a training set and a test set.
2. Bootstrap Sampling: Randomly sample from the training dataset with replacement to create
multiple bootstrap samples. Each bootstrap sample should typically have the same size as the
original dataset, but some data points may be repeated while others may be omitted.
3. Model Training: Train a base model (e.g., decision tree, neural network, etc.) on each bootstrap
sample. Each model should be trained independently of the others.
4. Prediction Generation: Use each trained model to predict the test dataset.
5. Combining Predictions: Combine the predictions from all the models. You can use majority
voting to determine the final predicted class for classification tasks. For regression tasks, you can
average the predictions.
6. Evaluation: Evaluate the bagging ensemble's performance on the test dataset using appropriate
metrics (e.g., accuracy, F1 score, mean squared error, etc.).
7. Hyperparameter Tuning: If necessary, tune the hyperparameters of the base model(s) or the
bagging ensemble itself using techniques like cross-validation.
8. Deployment: Once you're satisfied with the performance of the bagging ensemble, deploy it to
make predictions on new, unseen data.
Advantages
Applications
Write your own
Bagging and Sub-bagging are similar. Only difference is that Sub bagging uses random sampling
without replacement where as bagging uses random sampling with replacement
Differences Between Bagging and Subbagging
Each subset can have the same size as the Subsets are usually smaller than the
Sample Size
original dataset. original dataset.
Data points can appear multiple times in a Each data point appears at most once
Data Redundancy
subset. in a subset.
Works well with larger datasets and high Suitable for smaller datasets or when
Best Use Case
computational resources. resources are limited.
Summary
Stumping
• Stumping is a technique where a decision stump (a very simple model) is used as a base learner
in an ensemble learning method like AdaBoost.
• A decision stump is a decision tree with just one split (or decision point).
• It means the model makes decisions based on a single feature.
Purpose of Stumping:
Use in AdaBoost:
• In AdaBoost, many stumps are created sequentially.
• Each stump focuses on the data points that were misclassified by the previous stumps.
Bagging vs Boosting
Differences Between Bagging and Boosting
Type of Parallel ensemble method, where base Sequential ensemble method, where
Ensemble learners are trained independently. base learners are trained sequentially.
KD Trees
Are KD Trees and KNN the Same?
No, KD Trees and KNN (k-Nearest Neighbors) are not the same, but they are related.
• KNN is an algorithm used for classification or regression, where we find the k-nearest neighbors
of a given data point.
• KD Trees are a data structure used to make finding those neighbors (in KNN) faster, especially in
high-dimensional data.
A KD Tree (K-Dimensional Tree) is a binary tree that organizes points in a space with multiple
dimensions (like 2D or 3D) for fast searching of neighbors.
Key Idea:
• Split the data points into smaller regions, where each region focuses on a specific part of the
dataset.
• At each level, split the data based on one dimension (like x, y, or z) and alternate dimensions at
each level.
o Sort the points by the chosen dimension and find the median.
o Points smaller than the median (on the chosen dimension) go to the left subtree.
5. Repeat Recursively:
o Continue splitting the remaining points in the same way until all points are in leaf nodes.
o Split dimension = dmod kd \mod kdmodk, where kkk is the total number of dimensions.
3. Find Median:
o Sort points along the splitting dimension and choose the median.
4. Create Node:
5. Recursive Calls:
o Build left and right subtrees using points before and after the median.
6. Base Case:
Points:
(3,6),(2,7),(17,15),(6,12),(13,15),(9,1),(10,19)(3, 6), (2, 7), (17, 15), (6, 12), (13, 15), (9, 1), (10,
19)(3,6),(2,7),(17,15),(6,12),(13,15),(9,1),(10,19)
Step-by-Step Construction: (Example in Notes)
o Left subtree (points < 9): (2,7),(3,6),(6,12)(2, 7), (3, 6), (6, 12)(2,7),(3,6),(6,12)
o Right subtree (points > 9): (10,19),(13,15),(17,15)(10, 19), (13, 15), (17,
15)(10,19),(13,15),(17,15)
3. Continue Recursively:
o Repeat the process for each subset, alternating between x and y splits.
Goal:
Steps:
1. Start at the root and compare the query point to the splitting dimension.
2. Move to the left or right subtree based on the query point’s position relative to the current
node.
3. Once you reach a leaf node, calculate the distance to the query point.
4. Backtrack and check the other subtree if necessary (to ensure the closest point isn’t missed).
Advantages of KD Tree
1. Fast Search: Reduces the number of distance calculations compared to brute-force KNN.
2. Efficient for Low Dimensions: Works well for datasets with a moderate number of dimensions.
Limitations of KD Tree
1. Curse of Dimensionality: Performance decreases as dimensions increase.
2. Uneven Splits: If the data isn’t evenly distributed, the tree may become unbalanced.
Imagine you have GPS data of cities and want to find the city closest to a given location. Instead of
calculating distances for all cities, a KD Tree organizes the cities for fast nearest neighbor searches.
For example: