Session 2 on Discreatization - Binning Notes
Session 2 on Discreatization - Binning Notes
1. Reduces Overfitting
By converting continuous variables into discrete bins, discretization effectively simplifies
the feature space. This simplification means the model has fewer nuances to learn from
the training data. While this might lead to a loss in detail or granularity, it also means
there's less chance for the model to learn noise or overly complex patterns that don't
generalize well to unseen data.
Discretization acts as a form of regularization, imposing a constraint on the model's
complexity. By reducing the number of unique values a feature can take, it limits the
model's ability to fit the training data too closely.
Linear models inherently assume a linear relationship between features and the target
variable. Discretization allows these models to approximate non-linear relationships by
fitting separate slopes to each bin, which can collectively approximate a non-linear trend.
3. Handling outliers
When you discretize the data, you categorize these continuous values into bins based on
their range. An outlier's impact is diluted because it's grouped with other values in the
same bin, reducing its ability to disproportionately influence the analysis. Essentially,
within each bin, the data points are treated equivalently, regardless of their specific
values.
4. Better interpretability
By grouping continuous data into bins, each bin can be treated as a distinct category with
its own effect on the model's predictions. This categorical interpretation allows for
straightforward explanations, such as "being in age group 30-40 increases the likelihood
of buying a new car compared to age group 20-30," which is more intuitive than
interpreting the effect of a one-year increase in age.
5. Model Compatibility
Discretization works particularly well with certain algorithms because it transforms
continuous variables into discrete ones, which can align better with the way these
algorithms process and interpret data. The effectiveness of discretization largely depends
on the nature of the algorithm, the specific data being analysed, and the problem being
solved.
○ Algorithms like decision trees (and by extension, ensemble methods like Random
Forests and Gradient Boosting Machines) inherently split data into branches based
on conditions. Discretization can make these splits more meaningful, especially if the
continuous data does not have a clear linear relationship with the target variable.
Pre-discretized features can lead to simpler trees that are easier to interpret and
possibly more generalizable.
continuous data does not have a clear linear relationship with the target variable.
Pre-discretized features can lead to simpler trees that are easier to interpret and
possibly more generalizable.
2. Naive Bayes:
○ Naive Bayes classifiers, particularly in their basic forms, assume that features are
independent and often deal better with categorical data. Discretization can help
when applying Naive Bayes to continuous data by fitting its assumption of category-
based probabilities, potentially improving model performance and interpretability.
Disadvantages of Discretization
16 February 2024 17:31
1. Loss of information
2. Model Incompatibility
3. Difficulty in choosing bin size
Types of Discretization
16 February 2024 16:54
1. Custom Binning
15 February 2024 06:57
Custom binning, also known as domain binning, is a data pre-processing technique where the
bins are defined based on domain knowledge, specific criteria, or predefined thresholds rather
than through an automated or algorithmic process. This method allows for the creation of bins
that have meaningful interpretations in the context of the specific problem domain or analysis
goals.
Examples
1. Tax Slabs
2. Credit Score for Loan Eligibility
3. Healthcare - BMI Indexing
4. Educational Grading System
5. Air Quality Reporting
2. Uniform Binning
15 February 2024 06:56
Advantages:
1. Simple
2. Uniform Coverage
When to use:
Advantages:
Disadvantages
Advantages
1. Adaptive
2. Minimizes within-bin variance
3. You can find the ideal number of bins
Disadvantages
1. Sensitive to initialization
2. Computationally extensive
3. Assumption of similar sized and density clusters
4. Handling of outliers
5. Interpretability
5. Threshold Binning (Binarization)
15 February 2024 06:57
6. Decision Tree Based Binning
15 February 2024 06:57