0% found this document useful (0 votes)
3 views7 pages

Baysian Final

The document compares various boosting algorithms including AdaBoost, Gradient Boosting, XGBoost, LightGBM, and CatBoost, highlighting their origins, core ideas, strengths, and weaknesses. It also discusses regularization techniques (L1, L2, and Elastic Net) and several machine learning models such as Decision Trees, Logistic Regression, SVM, k-NN, Random Forest, and Linear Regression. Additionally, it explains the concepts of bagging and boosting, emphasizing their differences in training methodologies.

Uploaded by

240415
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views7 pages

Baysian Final

The document compares various boosting algorithms including AdaBoost, Gradient Boosting, XGBoost, LightGBM, and CatBoost, highlighting their origins, core ideas, strengths, and weaknesses. It also discusses regularization techniques (L1, L2, and Elastic Net) and several machine learning models such as Decision Trees, Logistic Regression, SVM, k-NN, Random Forest, and Linear Regression. Additionally, it explains the concepts of bagging and boosting, emphasizing their differences in training methodologies.

Uploaded by

240415
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Feature AdaBoost Gradient Boosting

Origin Introduced by Yoav Freund & Introduced by Jerome Friedman in


Robert Schapire in 1995 2001
Core Idea Focuses more on misclassified Focuses on minimizing loss
samples by adjusting weights function via gradient descent
Error Increases weight of misclassified Fits new learners to the residual
Correction data points errors
Learner Each learner has a weight based on Each learner updates the model to
Weighting its accuracy minimize loss
Data Weights Adjusts sample weights after each Sample weights usually stay fixed
round
Default Loss Exponential loss Flexible (MSE, log-loss, MAE,
Function etc.)
Training Tries to fix what the last learner got Tries to reduce residuals of the last
Strategy wrong (hard examples) prediction
Strengths Simple, good for clean data Powerful, customizable, often
better accuracy
Weaknesses Sensitive to outliers/noise Can overfit if not tuned well

AdaBoost — Key Concepts


 Base Model: Weak learners (usually decision stumps)
 Focus: Focuses on misclassified samples
 Mechanism:
o After each round, increase weight of misclassified points.
o The next model pays more attention to those points.
 Final Prediction: Weighted vote of all learners based on accuracy.

Gradient Boosting — Key Concepts


 Base Model: Usually decision trees (not necessarily stumps)
 Focus: Minimizes a loss function (like MSE or log-loss)
 Mechanism:
o After each step, compute residual errors (actual - predicted)
o Train the next model to predict these residuals
o Use gradient descent to iteratively update the model
AdaBoost (Adaptive Boosting):

 Focuses on wrong predictions.


 Increases the importance (weight) of misclassified points.
 Each new model tries to correct the mistakes of the previous one.
 Works well with clean data.

🔹 Gradient Boosting:

 Focuses on errors (residuals) instead of weights.


 Each new model tries to reduce the total error by predicting the residuals.
 Uses gradient descent to improve the model step by step.
 More flexible and powerful, but needs careful tuning.

Imagine you're standing on a hill (graph of error) and want to reach the lowest point (minimum
error).

 You take small steps downhill.


 Each step moves you in the direction where the slope is steepest.
 Eventually, you reach the bottom (minimum loss).

That’s gradient descent!

XGBoost (Extreme Gradient Boosting)


🧬 Origin:

 Developed by Tianqi Chen as part of his PhD project.


 Released in 2014, became hugely popular after dominating Kaggle competitions.
 Based on gradient boosting but optimized for speed and performance.
 Maintained by DMLC (Distributed Machine Learning Community).

💡 Key Concepts:

1. Gradient Boosting:
o Builds additive models in a forward stage-wise fashion.
o Fits new models to correct residuals of previous models using the gradient of the
loss function.
2. Regularized Objective:
o Adds L1 (Lasso) and L2 (Ridge) regularization to the loss to avoid overfitting.

Obj=∑il(yi,y^i(t))+∑kΩ(fk)\text{Obj}

3. Second-Order Approximation:
o Uses both first and second derivatives (Hessian) of the loss function to optimize
trees (unlike traditional GBMs that use only gradients).
4. Tree Pruning:
o Employs a max-depth pruning strategy after building the tree to avoid
unnecessary splits (greedy algorithm).
5. Handling Sparse Data:
o Efficiently manages missing values and sparse data using a default direction in
split.
6. Parallelization:
o Parallel tree construction on a feature-wise basis (column block) to boost training
speed.
7. Out-of-core computation:
o Capable of handling very large datasets that do not fit in memory by using disk.

2. LightGBM (Light Gradient Boosting Machine)


🧬 Origin:

 Developed by Microsoft Research in 2016.


 Aimed at being faster and more scalable than XGBoost for large datasets.

💡 Key Concepts:

1. Histogram-based Decision Tree Learning:


o Converts continuous features into discrete bins, reducing memory usage and
speeding up training.
o Results in faster training and lower memory usage.
2. Leaf-wise Tree Growth (Best-first):
o Unlike level-wise growth (like XGBoost), LightGBM grows trees leaf-wise,
choosing the leaf with the highest loss reduction.
o Can result in deeper and more accurate trees but may overfit on small datasets.
3. Gradient-based One-Side Sampling (GOSS):
o Samples data points with large gradients more frequently since they contribute
more to loss.
o Reduces data without hurting accuracy.
4. Exclusive Feature Bundling (EFB):
o Combines mutually exclusive (non-overlapping) features into a single one to
reduce dimensionality.
5. GPU Support:
o Supports GPU-based training for even faster performance on large datasets.
6. Built-in Support for Categorical Features:
o Efficient native handling without one-hot encoding.
3. CatBoost (Categorical Boosting)
🧬 Origin:

 Developed by Yandex (Russia’s Google equivalent) in 2017.


 Designed to be particularly effective on datasets with categorical features.

💡 Key Concepts:

1. Ordered Boosting:
o Prevents target leakage by using permutations of the dataset when computing
residuals.
o Avoids overfitting by ensuring that predictions for a row are not based on its own
target.
2. Efficient Categorical Feature Handling:
o Converts categorical values into numbers using target statistics, but in an
ordered and smoothed way to prevent overfitting.
o Avoids one-hot encoding and handles high-cardinality categorical features
efficiently.
3. Symmetric Trees (Oblivious Trees):
o Uses symmetric decision trees, where all nodes at the same depth split on the
same feature.
o Faster inference and highly optimized for CPU/GPU.
4. Minimal Data Preprocessing:
o Can be used without extensive data preprocessing, handling NaNs and
categories natively.
5. Robust to Overfitting:
o Due to ordered boosting and regularization methods, it's more stable on small
datasets than LightGBM.

L1 Regularization (Lasso Regression)

L1 regularization, also known as Lasso (Least Absolute Shrinkage and Selection Operator),
is a technique used to prevent overfitting and improve model interpretability by shrinking some
model coefficients exactly to zero. This means it effectively performs feature selection by
automatically eliminating less important variables. The penalty term added to the loss function is
the sum of the absolute values of the coefficients. When the regularization strength is
increased, more coefficients are pushed to zero, resulting in a sparse model that uses only a
subset of features. This is particularly useful in high-dimensional datasets where many features
may be irrelevant.

L2 Regularization (Ridge Regression)

L2 regularization, or Ridge regression, is another method to combat overfitting by adding a


penalty to the loss function — this time the sum of the squares of the coefficients. Unlike L1,
Ridge does not shrink coefficients to zero, but rather reduces their magnitude, keeping all
features in the model. This helps to stabilize the model, especially when there is
multicollinearity (i.e., when independent variables are highly correlated). Ridge regression is
useful when you want to retain all features but still want to prevent the model from fitting noise
or overly complex patterns in the data.

L1 + L2 Regularization (Elastic Net)

Elastic Net combines the strengths of both Lasso and Ridge by using a mix of L1 and L2
penalties. It adds both the absolute values and the squares of the coefficients to the loss function.
This allows Elastic Net to perform feature selection like Lasso and also stabilize the model like
Ridge. Elastic Net is particularly useful when you have many correlated features, where Lasso
might randomly pick one and ignore the rest. By blending the two approaches, Elastic Net offers
a more balanced regularization that can produce better generalization performance in many real-
world applications.

Decision Tree

A Decision Tree is a supervised learning model used for both classification and regression tasks.
It works by splitting the data into branches based on feature values to create a tree-like structure.
It is moderately complex and highly interpretable, making it easy to visualize and understand.
Training and prediction are generally fast, but the model is prone to overfitting, especially with
deep trees. It handles non-linear relationships well and does not require feature scaling.
However, it is sensitive to outliers and missing data and performs poorly in those conditions
unless preprocessing is applied. It is suitable for tasks like medical diagnosis or loan approval,
and requires minimal hyperparameter tuning (e.g., tree depth, splitting criteria).

Logistic Regression

Logistic Regression is a simple and fast algorithm used primarily for binary classification,
though it can be extended to multi-class problems. Despite its name, it’s a classification model
that outputs probabilities through the sigmoid function, which are then converted to class
labels. It assumes a linear relationship between input features and the log-odds of the outcome.
It is highly interpretable and computationally efficient, making it ideal for large datasets and
real-time systems. However, it does not handle non-linear relationships unless features are
transformed (e.g., with polynomials). It is sensitive to outliers and requires careful handling of
multicollinearity. Common use cases include spam detection and customer churn prediction.
Support Vector Machine (SVM)

SVM is a powerful supervised learning algorithm used mainly for classification but also
adaptable to regression (SVR). It works by finding the optimal hyperplane that maximizes the
margin between different classes. It is high in complexity and less interpretable compared to
simpler models. SVM performs well in high-dimensional spaces, especially with the kernel
trick, which allows it to model non-linear decision boundaries. However, it is
computationally expensive and does not scale well to very large datasets. It requires feature
scaling and careful tuning of parameters like the kernel type, regularization parameter (C), and
gamma. SVM is commonly used in applications like image classification and face recognition.

k-Nearest Neighbors (k-NN)

k-NN is a simple, instance-based algorithm used for both classification and regression. It has
no training phase — instead, it makes predictions based on the majority vote or average of the
k-nearest data points in the training set. It is easy to understand but slow at prediction time,
especially on large datasets, because it must compute distances to all training instances. It
handles non-linear patterns well but is very sensitive to irrelevant features, outliers, and
requires feature scaling. It is best suited for small datasets and applications like
recommendation systems and handwriting recognition. Hyperparameter tuning is important
for choosing the optimal value of k and distance metric.

Random Forest

Random Forest is an ensemble method that combines many Decision Trees using Bagging
(Bootstrap Aggregation). It is used for both classification and regression, and is highly
accurate and robust due to the averaging (or voting) of multiple trees. This reduces overfitting,
a common problem in individual decision trees. While not as interpretable as a single tree,
Random Forests can still provide insights like feature importance. They handle non-linearity
well, are not sensitive to outliers, and require little to no feature scaling. Random Forest is
scalable, and although training may be slower due to multiple trees, predictions are fairly
efficient. It is commonly used in fields like credit scoring, fraud detection, and stock market
prediction.

Linear Regression

Linear Regression is a fundamental model used for predicting continuous values. It assumes a
linear relationship between the input features and the target variable. It is very fast to train
and predict, and offers excellent interpretability through its coefficients. However, it makes
strong assumptions: linearity, independence of errors, constant variance (homoscedasticity),
and normally distributed residuals. It is sensitive to outliers and multicollinearity, and
performs poorly on non-linear data unless transformed. It works best on small to medium-
sized datasets with a clear linear trend. Common use cases include house price prediction,
salary estimation, and sales forecasting. Feature scaling is sometimes necessary, and
overfitting may occur if irrelevant variables are included.

agging involves training multiple models (like decision trees) on different bootstrapped
datasets. Since these models are independent of each other, they can be trained at the same
time — i.e., in parallel.

🔁 Steps Where Parallelism Happens:

1. Bootstrapping:
o You create several datasets by randomly sampling from the original data.
o These datasets can be generated simultaneously.
2. Model Training:
o Each model (e.g., decision tree) is trained on its own bootstrapped dataset.
o Since these models don’t depend on each other, they can be trained
simultaneously on multiple CPU cores or machines.
3. Prediction:
o Each trained model gives a prediction on test data.
o These predictions can also be generated in parallel, then aggregated (voted or
averaged).

⚙️ Why Is This Useful?

 🚀 Faster training: If you use a multi-core CPU or GPU cluster, all trees (or models)
can be trained together.
 🔧 Scalable: Works well on big data or real-time systems.
 🔄 Efficient: No need to wait for one model to finish before starting the next.

🔁 Contrast With Boosting:

Unlike bagging, boosting works sequentially:

 Each new model corrects the errors of the previous one.


 So you can’t train boosting models in parallel — they depend on earlier steps.

You might also like