Baysian Final
Baysian Final
🔹 Gradient Boosting:
Imagine you're standing on a hill (graph of error) and want to reach the lowest point (minimum
error).
💡 Key Concepts:
1. Gradient Boosting:
o Builds additive models in a forward stage-wise fashion.
o Fits new models to correct residuals of previous models using the gradient of the
loss function.
2. Regularized Objective:
o Adds L1 (Lasso) and L2 (Ridge) regularization to the loss to avoid overfitting.
Obj=∑il(yi,y^i(t))+∑kΩ(fk)\text{Obj}
3. Second-Order Approximation:
o Uses both first and second derivatives (Hessian) of the loss function to optimize
trees (unlike traditional GBMs that use only gradients).
4. Tree Pruning:
o Employs a max-depth pruning strategy after building the tree to avoid
unnecessary splits (greedy algorithm).
5. Handling Sparse Data:
o Efficiently manages missing values and sparse data using a default direction in
split.
6. Parallelization:
o Parallel tree construction on a feature-wise basis (column block) to boost training
speed.
7. Out-of-core computation:
o Capable of handling very large datasets that do not fit in memory by using disk.
💡 Key Concepts:
💡 Key Concepts:
1. Ordered Boosting:
o Prevents target leakage by using permutations of the dataset when computing
residuals.
o Avoids overfitting by ensuring that predictions for a row are not based on its own
target.
2. Efficient Categorical Feature Handling:
o Converts categorical values into numbers using target statistics, but in an
ordered and smoothed way to prevent overfitting.
o Avoids one-hot encoding and handles high-cardinality categorical features
efficiently.
3. Symmetric Trees (Oblivious Trees):
o Uses symmetric decision trees, where all nodes at the same depth split on the
same feature.
o Faster inference and highly optimized for CPU/GPU.
4. Minimal Data Preprocessing:
o Can be used without extensive data preprocessing, handling NaNs and
categories natively.
5. Robust to Overfitting:
o Due to ordered boosting and regularization methods, it's more stable on small
datasets than LightGBM.
L1 regularization, also known as Lasso (Least Absolute Shrinkage and Selection Operator),
is a technique used to prevent overfitting and improve model interpretability by shrinking some
model coefficients exactly to zero. This means it effectively performs feature selection by
automatically eliminating less important variables. The penalty term added to the loss function is
the sum of the absolute values of the coefficients. When the regularization strength is
increased, more coefficients are pushed to zero, resulting in a sparse model that uses only a
subset of features. This is particularly useful in high-dimensional datasets where many features
may be irrelevant.
Elastic Net combines the strengths of both Lasso and Ridge by using a mix of L1 and L2
penalties. It adds both the absolute values and the squares of the coefficients to the loss function.
This allows Elastic Net to perform feature selection like Lasso and also stabilize the model like
Ridge. Elastic Net is particularly useful when you have many correlated features, where Lasso
might randomly pick one and ignore the rest. By blending the two approaches, Elastic Net offers
a more balanced regularization that can produce better generalization performance in many real-
world applications.
Decision Tree
A Decision Tree is a supervised learning model used for both classification and regression tasks.
It works by splitting the data into branches based on feature values to create a tree-like structure.
It is moderately complex and highly interpretable, making it easy to visualize and understand.
Training and prediction are generally fast, but the model is prone to overfitting, especially with
deep trees. It handles non-linear relationships well and does not require feature scaling.
However, it is sensitive to outliers and missing data and performs poorly in those conditions
unless preprocessing is applied. It is suitable for tasks like medical diagnosis or loan approval,
and requires minimal hyperparameter tuning (e.g., tree depth, splitting criteria).
Logistic Regression
Logistic Regression is a simple and fast algorithm used primarily for binary classification,
though it can be extended to multi-class problems. Despite its name, it’s a classification model
that outputs probabilities through the sigmoid function, which are then converted to class
labels. It assumes a linear relationship between input features and the log-odds of the outcome.
It is highly interpretable and computationally efficient, making it ideal for large datasets and
real-time systems. However, it does not handle non-linear relationships unless features are
transformed (e.g., with polynomials). It is sensitive to outliers and requires careful handling of
multicollinearity. Common use cases include spam detection and customer churn prediction.
Support Vector Machine (SVM)
SVM is a powerful supervised learning algorithm used mainly for classification but also
adaptable to regression (SVR). It works by finding the optimal hyperplane that maximizes the
margin between different classes. It is high in complexity and less interpretable compared to
simpler models. SVM performs well in high-dimensional spaces, especially with the kernel
trick, which allows it to model non-linear decision boundaries. However, it is
computationally expensive and does not scale well to very large datasets. It requires feature
scaling and careful tuning of parameters like the kernel type, regularization parameter (C), and
gamma. SVM is commonly used in applications like image classification and face recognition.
k-NN is a simple, instance-based algorithm used for both classification and regression. It has
no training phase — instead, it makes predictions based on the majority vote or average of the
k-nearest data points in the training set. It is easy to understand but slow at prediction time,
especially on large datasets, because it must compute distances to all training instances. It
handles non-linear patterns well but is very sensitive to irrelevant features, outliers, and
requires feature scaling. It is best suited for small datasets and applications like
recommendation systems and handwriting recognition. Hyperparameter tuning is important
for choosing the optimal value of k and distance metric.
Random Forest
Random Forest is an ensemble method that combines many Decision Trees using Bagging
(Bootstrap Aggregation). It is used for both classification and regression, and is highly
accurate and robust due to the averaging (or voting) of multiple trees. This reduces overfitting,
a common problem in individual decision trees. While not as interpretable as a single tree,
Random Forests can still provide insights like feature importance. They handle non-linearity
well, are not sensitive to outliers, and require little to no feature scaling. Random Forest is
scalable, and although training may be slower due to multiple trees, predictions are fairly
efficient. It is commonly used in fields like credit scoring, fraud detection, and stock market
prediction.
Linear Regression
Linear Regression is a fundamental model used for predicting continuous values. It assumes a
linear relationship between the input features and the target variable. It is very fast to train
and predict, and offers excellent interpretability through its coefficients. However, it makes
strong assumptions: linearity, independence of errors, constant variance (homoscedasticity),
and normally distributed residuals. It is sensitive to outliers and multicollinearity, and
performs poorly on non-linear data unless transformed. It works best on small to medium-
sized datasets with a clear linear trend. Common use cases include house price prediction,
salary estimation, and sales forecasting. Feature scaling is sometimes necessary, and
overfitting may occur if irrelevant variables are included.
agging involves training multiple models (like decision trees) on different bootstrapped
datasets. Since these models are independent of each other, they can be trained at the same
time — i.e., in parallel.
1. Bootstrapping:
o You create several datasets by randomly sampling from the original data.
o These datasets can be generated simultaneously.
2. Model Training:
o Each model (e.g., decision tree) is trained on its own bootstrapped dataset.
o Since these models don’t depend on each other, they can be trained
simultaneously on multiple CPU cores or machines.
3. Prediction:
o Each trained model gives a prediction on test data.
o These predictions can also be generated in parallel, then aggregated (voted or
averaged).
🚀 Faster training: If you use a multi-core CPU or GPU cluster, all trees (or models)
can be trained together.
🔧 Scalable: Works well on big data or real-time systems.
🔄 Efficient: No need to wait for one model to finish before starting the next.