0% found this document useful (0 votes)
1 views12 pages

ML_Unit-5

The document discusses the types of data features, classifying them into qualitative (nominal and ordinal) and quantitative (discrete and continuous) categories. It also covers feature transformations, selection methods, and ensemble techniques like bagging and random forests, emphasizing their importance in improving model performance and accuracy. Additionally, it highlights various feature selection techniques, including supervised, unsupervised, and embedded methods, to optimize machine learning models.

Uploaded by

ytind2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views12 pages

ML_Unit-5

The document discusses the types of data features, classifying them into qualitative (nominal and ordinal) and quantitative (discrete and continuous) categories. It also covers feature transformations, selection methods, and ensemble techniques like bagging and random forests, emphasizing their importance in improving model performance and accuracy. Additionally, it highlights various feature selection techniques, including supervised, unsupervised, and embedded methods, to optimize machine learning models.

Uploaded by

ytind2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

UNIT-5

Kinds of feature

There are two types of data: Qualitative and Quantitative data, which are further classified
into:
The data is classified into four categories:
• Nominal feature.
• Ordinal feature.
• Discrete feature.
• Continuous feature.
Qualitative feature
Qualitative or Categorical Data is data that can’t be measured or counted in the form of
numbers. These types of data are sorted by category, not by number. That’s why it is also known
as Categorical Data.
These data consist of audio, images, symbols, or text. The gender of a person, i.e., male, female,
or others, is qualitative data.
Qualitative data tells about the perception of people. This data helps market researchers
understand the customers’ tastes and then design their ideas and strategies accordingly.
The Qualitative feature are further classified into two parts:
Nominal Feature (categorical)
Nominal Feature is used to label variables without any order or quantitative value. The color
of hair can be considered nominal feature, as one color can’t be compared with another color.
Examples of Nominal Feature :
• Colour of hair (Blonde, red, Brown, Black, etc.)
• Marital status (Single, Widowed, Married)
• Nationality (Indian, German, American)
• Gender (Male, Female, Others)
• Eye Color (Black, Brown, etc.)
Ordinal Feature
Ordinal feature have natural ordering where a number is present in some kind of order by their
position on the scale. These feature are used for observation like customer satisfaction,
happiness, etc., but we can’t do any arithmetical tasks on them.
Examples of Ordinal Feature :
• Letter grades in the exam (A, B, C, D, etc.)
• Ranking of people in a competition (First, Second, Third, etc.)
• Economic Status (High, Medium, and Low)
• Education Level (Higher, Secondary, Primary)
Quantitative Feature (continuous)
Quantitative feature can be expressed in numerical values, making it countable and including
statistical feature analysis. These kinds of feature are also known as Numerical feature.
Examples of Quantitative Feature :
• Height or weight of a person or object
• Room Temperature
• Scores and Marks (Ex: 59, 80, 60, etc.)
• Time

The Quantitative feature are further classified into two parts :


Discrete Feature
The term discrete means distinct or separate. The discrete feature contain the values that fall
under integers or whole numbers.
Examples of Discrete Feature :
• Total numbers of students present in a class
• Cost of a cell phone
• Numbers of employees in a company
• The total number of players who participated in a competition
• Days in a week

Continuous Feature
Continuous feature are in the form of fractional numbers.
Examples of Continuous Feature :
• Height of a person
• Speed of a vehicle
• “Time-taken” to finish the work
• Wi-Fi Frequency
• Market share price

Calculations on Features:
The possible calculations on features are statistics of central tendency, statistics of dispersion
and shape statistics.
Statistics of central tendency:
● The mean or average value;
● The median, which is the middle value if we order the instances from lowest to highest feature
value;
● And the mode, which is the majority value or values.
The second kind of calculation on features are statistics of dispersion or ‘spread’.
Two well-known statistics of dispersion are the variance or average squared deviation from the
(arithmetic) mean, and its square root, the standard deviation.
Other statistics of dispersion include percentiles. The p-th percentile is the value such that p
per cent of the instances fall below it.
The skew and ‘peakedness’ of a distribution can be measured by shape statistics such as
skewness and kurtosis.
Feature transformations
Feature transformations aim at improving the utility of a feature by removing, changing, or
adding information.
Binarisation transforms a categorical feature into a set of Boolean features, one for each value
of the categorical feature.
Unordering trivially turns an ordinal feature into a categorical one by discarding the ordering
of the feature values.
Thresholding and discretisation: Thresholding transforms a quantitative or an ordinal feature
into a Boolean feature by finding a feature value to split on.
Concretely, let f : X →R be a quantitative feature and let t ∈ R be a threshold, then
ft : X → {true, false} is a Boolean feature defined by ft (x) = true if f (x) ≥ t and ft (x) = false
if f (x) < t . We can choose such thresholds in an unsupervised or a supervised way.
Discretisation transforms a quantitative feature into an ordinal feature. Each ordinal value is
referred to as a bin and corresponds to an interval of the original quantitative feature.
Unsupervised discretisation methods typically require one to decide the number of bins
beforehand. A simple method that often works reasonably well is to choose the bins so that
each bin has approximately the same number of instances: this is referred to as equal-
frequency discretisation.
Another unsupervised discretisation method is equal-width discretisation, which chooses the
bin boundaries so that each interval has the same width.
In supervised discretisation methods, we can distinguish between top–down or divisive
discretisation methods on the one hand, and bottom–up or agglomerative discretisation
methods on the other. Divisive methods work by progressively splitting bins, whereas
agglomerative methods proceed by initially assigning each instance to its own bin and
successively merging bins. In either case an important role is played by the stopping criterion,
which decides whether a further split or merge is worthwhile.
Normalization and calibration:
Thresholding and discretisation are feature transformations that remove the scale of a
quantitative feature.
Normalization and calibration involve adding a scale to an ordinal or categorical feature.
If this is done in an unsupervised fashion it is usually called normalisation, whereas calibration
refers to supervised approaches taking in the (usually binary) class labels.
Feature normalisation is often required to neutralise the effect of different quantitative features
being measured on different scales. If the features are approximately normally distributed, we
can convert them into z-scores by centring on the mean and dividing by the standard deviation.
In certain cases it is mathematically more convenient to divide by the variance instead, as we
have seen in . If we don’t want to assume normality we can centre on the median and divide by
the interquartile range.
Feature calibration is understood as a supervised feature transformation adding a meaningful
scale carrying class information to arbitrary features.
The problem of feature calibration can thus be stated as follows: given a feature F : X → F,
construct a calibrated feature Fc :X →[0,1] such that Fc(x) estimates the probability Fc(x) =
P(⊕|v), where v = F(x) is the value of the original feature for x.
Feature construction and selection
Feature selection

A feature is an attribute that has an impact on a problem or is useful for the problem, and
choosing the important features for the model is known as feature selection.
Below are some benefits of using feature selection in machine learning:
o It helps in avoiding the curse of dimensionality.
o It helps in the simplification of the model so that it can be easily interpreted by the
researchers.
o It reduces the training time.
o It reduces overfitting hence enhance the generalization.
Feature Selection Techniques
There are mainly two types of Feature Selection techniques, which are:
Supervised Feature Selection technique
Supervised Feature selection techniques consider the target variable and can be used for the
labelled dataset.
Unsupervised Feature Selection technique
Unsupervised Feature selection techniques ignore the target variable and can be used for the
unlabelled dataset.

There are mainly three techniques under supervised feature Selection:


1. Wrapper Methods
In wrapper methodology, selection of features is done by considering it as a search problem, in
which different combinations are made, evaluated, and compared with other combinations. It
trains the algorithm by using the subset of features iteratively. On the basis of the output of the
model, features are added or subtracted, and with this feature set, the model has trained again.
Some techniques of wrapper methods are:
o Forward selection - Forward selection is an iterative process, which begins with an
empty set of features. After each iteration, it keeps adding on a feature and evaluates
the performance to check whether it is improving the performance or not. The process
continues until the addition of a new variable/feature does not improve the performance
of the model.
o Backward elimination - Backward elimination is also an iterative approach, but it is
the opposite of forward selection. This technique begins the process by considering all
the features and removes the least significant feature. This elimination process
continues until removing the features does not improve the performance of the model.
o Exhaustive Feature Selection- Exhaustive feature selection is one of the best feature
selection methods, which evaluates each feature set as brute-force. It means this method
tries & make each possible combination of features and return the best performing
feature set.
o Recursive Feature Elimination-Recursive feature elimination is a recursive greedy
optimization approach, where features are selected by recursively taking a smaller and
smaller subset of features. Now, an estimator is trained with each set of features, and
the importance of each feature is determined using coef_attribute or through
a feature_importances_attribute.
2. Filter Methods
In Filter Method, features are selected on the basis of statistics measures.
This method does not depend on the learning algorithm and chooses the features as a pre-
processing step.
The filter method filters out the irrelevant feature and redundant columns from the model by
using different metrics through ranking.
The advantage of using filter methods is that it needs low computational time and does not
overfit the data.
Some common techniques of Filter methods are as follows:
o Information Gain
o Chi-square Test
o Fisher's Score
o Missing Value Ratio
Information Gain: Information gain determines the reduction in entropy while transforming
the dataset. It can be used as a feature selection technique by calculating the information gain
of each variable with respect to the target variable.
Chi-square Test: Chi-square test is a technique to determine the relationship between the
categorical variables. The chi-square value is calculated between each feature and the target
variable, and the desired number of features with the best chi-square value is selected.
Fisher's Score: Fisher's score is one of the popular supervised technique of features selection.
It returns the rank of the variable on the fisher's criteria in descending order. Then we can select
the variables with a large fisher's score.
Missing Value Ratio: The value of the missing value ratio can be used for evaluating the
feature set against the threshold value. The formula for obtaining the missing value ratio is the
number of missing values in each column divided by the total number of observations. The
variable is having more than the threshold value can be dropped.

3. Embedded Methods
Embedded methods combined the advantages of both filter and wrapper methods by
considering the interaction of features along with low computational cost. These are fast
processing methods similar to the filter method but more accurate than the filter method.
These methods are also iterative, which evaluates each iteration, and optimally finds the most
important features that contribute the most to training in a particular iteration. Some
techniques of embedded methods are:
o Regularization- Regularization adds a penalty term to different parameters of the
machine learning model for avoiding overfitting in the model. This penalty term is
added to the coefficients; hence it shrinks some coefficients to zero. Those features with
zero coefficients can be removed from the dataset. The types of regularization
techniques are L1 Regularization (Lasso Regularization) or Elastic Nets (L1 and L2
regularization).
o Random Forest Importance - Different tree-based methods of feature selection help
us with feature importance to provide a way of selecting features. Here, feature
importance specifies which feature has more importance in model building or has a
great impact on the target variable. Random Forest is such a tree-based method, which
is a type of bagging algorithm that aggregates a different number of decision trees. It
automatically ranks the nodes by their performance or decrease in the impurity (Gini
impurity) over all the trees. Nodes are arranged as per the impurity values, and thus it
allows to pruning of trees below a specific node. The remaining nodes create a subset
of the most important features.
How to choose a Feature Selection Method?
Input Output Feature Selection technique
Variable Variable
Numerical Numerical o Pearson's correlation coefficient (For linear
Correlation).
o Spearman's rank coefficient (for non-linear
correlation).
Numerical Categorical o ANOVA correlation coefficient (linear).
o Kendall's rank coefficient (nonlinear).
Categorical Numerical o Kendall's rank coefficient (linear).
o ANOVA correlation coefficient (nonlinear).
Categorical Categorical o Chi-Squared test (contingency tables).
o Mutual Information.

Feature construction

1) One of the best-known algebraic feature construction methods is principal component


analysis (PCA). Principal components are new features constructed as linear combinations of
the given features.
Model ensembles:
Ensemble simply means combining multiple models. Thus a collection of models is used to
make predictions rather than an individual model.
{Some knowledge- Weak learners-Weak learners have low prediction accuracy, similar to
random guessing. They are prone to overfitting—that is, they can't classify data that varies too
much from their original dataset. Strong learners-Strong learners have higher prediction
accuracy.}
Bagging (variance-reduction technique)
Bagging, also known as Bootstrap aggregating, is an ensemble learning technique that helps to
improve the performance and accuracy of machine learning algorithms.
It is used to deal with bias-variance trade-offs and reduces the variance of a prediction model.
Bagging avoids overfitting of data and is used for both regression and classification models,
specifically for decision tree algorithms.
It consists of two steps: bootstrapping and aggregation.
Bootstrapping
Involves resampling subsets of data with replacement from an initial dataset. In other words,
subsets of data are taken from the initial dataset. These subsets of data are called bootstrapped
datasets or, simply, bootstraps. Resampled ‘with replacement’ means an individual data point
can be sampled multiple times. Each bootstrap dataset is used to train a weak learner.
Aggregating
The individual weak learners (weak models) are trained independently from each other. Each
learner makes independent predictions. The results of those predictions are aggregated at the
end to get the overall prediction. The predictions are aggregated using either max voting or
averaging.
Max Voting is commonly used for classification problems. It consists of taking the mode of
the predictions (the most occurring prediction).
Averaging is generally used for regression problems. It involves taking the average of the
predictions. The resulting average is used as the overall prediction for the combined model.
Steps of Bagging

The steps of bagging are as follows:


1. We have an initial training dataset containing n-number of instances.
2. We create a m-number of subsets of data from the training set. We take a subset of N
sample points from the initial dataset for each subset. Each subset is taken with
replacement. This means that a specific data point can be sampled more than once.
3. For each subset of data, we train the corresponding weak learners (weak models)
independently. These models are homogeneous, meaning that they are of the same
type.
4. Each model makes a prediction.
5. The predictions are aggregated into a single prediction. For this, either max voting or
averaging is used.
Random forests (train an ensemble of tree models from bootstrap samples and random
subspaces.)
Random Forest is one of the most popular and commonly used algorithms by Data Scientists.
Random forest is a Supervised Machine Learning Algorithm that is used widely in
Classification and Regression problems. It builds decision trees on different samples and takes
their majority vote for classification and average in case of regression.
One of the most important features of the Random Forest Algorithm is that it can handle the
data set containing continuous variables, as in the case of regression, and categorical
variables, as in the case of classification. It performs better for classification and regression
tasks.
Steps Involved in Random Forest Algorithm

Step 1: In the Random forest model, a subset of data points and a subset of features is selected
for constructing each decision tree. Simply put, n random records and m features are taken
from the data set having k number of records.
Step 2: Individual decision trees are constructed for each sample.
Step 3: Each decision tree will generate an output.
Step 4: Final output is considered based on Majority Voting or Averaging for Classification
and regression, respectively.
For example: consider the fruit basket as the data as shown in the figure below. Now n number
of samples are taken from the fruit basket, and an individual decision tree is constructed for
each sample. Each decision tree will generate an output, as shown in the figure. The final output
is considered based on majority voting. In the below figure, you can see that the majority
decision tree gives output as an apple when compared to a banana, so the final output is taken
as an apple.

Important Features of Random Forest


• Diversity: Not all attributes/variables/features are considered while making an
individual tree; each tree is different.
• Immune to the curse of dimensionality: Since each tree does not consider all the
features, the feature space is reduced.
• Parallelization: Each tree is created independently out of different data and attributes.
This means we can fully use the CPU to build random forests.
• Train-Test split: In a random forest, we don’t have to segregate the data for train and
test as there will always be 30% of the data which is not seen by the decision tree.
• Stability: Stability arises because the result is based on majority voting/ averaging.
Important Hyperparameters in Random Forest
Hyperparameters are used in random forests to either enhance the performance and predictive
power of models or to make the model faster.
Hyperparameters to Increase the Predictive Power
n_estimators: Number of trees the algorithm builds before averaging the predictions.
max_features: Maximum number of features random forest considers splitting a node.
mini_sample_leaf: Determines the minimum number of leaves required to split an internal
node.
criterion: How to split the node in each tree? (Entropy/Gini impurity/Log Loss)
max_leaf_nodes: Maximum leaf nodes in each tree
Hyperparameters to Increase the Speed
n_jobs: it tells the engine how many processors it is allowed to use. If the value is 1, it can use
only one processor, but if the value is -1, there is no limit.
random_state: controls randomness of the sample. The model will always produce the same
results if it has a definite value of random state and has been given the same hyperparameters
and training data.
oob_score: OOB means out of the bag. It is a random forest cross-validation method. In this,
one-third of the sample is not used to train the data; instead used to evaluate its performance.
These samples are called out-of-bag samples.
Boosting (bias-reduction technique) [train an ensemble of binary classifiers from
reweighted training sets.]
Boosting is a method used in machine learning to reduce errors in predictive data analysis.
Boosting is an ensemble learning method that iteratively combines a set of weak learners into
strong learners to minimize training errors.
It is done by building a model by using weak models in series. Firstly, a model is built from
the training data. Then the second model is built which tries to correct the errors present in the
first model. This procedure is continued and models are added until either the complete training
data set is predicted correctly or the maximum number of models are added.

How is training in boosting done?


The training method varies depending on the type of boosting process called the boosting
algorithm. However, an algorithm takes the following general steps to train the boosting model:
Step 1
The boosting algorithm assigns equal weight to each data sample. It feeds the data to the first
machine model, called the base algorithm. The base algorithm makes predictions for each data
sample.
Step 2: False predictions made by the base learner are identified. In the next iteration, these
false predictions are assigned to the next base learner with a higher weightage on these incorrect
predictions.
Step 3: Repeat step 2 until the algorithm can correctly classify the output.
Therefore, the main aim of Boosting is to focus more on miss-classified predictions.
Types of Boosting Algorithms
1. AdaBoost (Adaptive Boosting)
2. Gradient Tree Boosting
3. XGBoost
Benefits and Challenges of Boosting
The boosting method presents many advantages and challenges for classification or regression
problems. The benefits of boosting include:
o Ease of Implementation: Boosting can be used with several hyper-parameter tuning
options to improve fitting. No data preprocessing is required, and boosting algorithms
have built-in routines to handle missing data. In Python, the sci-kit-learn library of
ensemble methods makes it easy to implement the popular boosting methods, including
AdaBoost, XGBoost, etc.
o Reduction of bias: Boosting algorithms combine multiple weak learners in a
sequential method, iteratively improving upon observations. This approach can help to
reduce high bias, commonly seen in shallow decision trees and logistic regression
models.
o Computational Efficiency: Since boosting algorithms have special features that
increase their predictive power during training, it can help reduce dimensionality and
increase computational efficiency.
And the challenges of boosting include:
o Overfitting: There's some dispute in the research around whether or not boosting can
help reduce overfitting or make it worse. We include it under challenges because in the
instances that it does occur, predictions cannot be generalized to new datasets.
o Intense computation: Sequential training in boosting is hard to scale up. Since each
estimator is built on its predecessors, boosting models can be computationally
expensive, although XGBoost seeks to address scalability issues in other boosting
methods. Boosting algorithms can be slower to train when compared to bagging, as a
large number of parameters can also influence the model's behaviour.
o Vulnerability to outlier data: Boosting models are vulnerable to outliers or data values
that are different from the rest of the dataset. Because each model attempts to correct
the faults of its predecessor, outliers can skew results significantly.
o Real-time implementation: You might find it challenging to use boosting for real-time
implementation because the algorithm is more complex than other processes. Boosting
methods have high adaptability, so you can use various model parameters that
immediately affect the model's performance.
Applications of Boosting
Boosting algorithms are well suited for artificial intelligence projects across a broad range of
industries, including:
o Healthcare: Boosting is used to lower errors in medical data predictions, such as
predicting cardiovascular risk factors and cancer patient survival rates. For example,
research shows that ensemble methods significantly improve the accuracy in
identifying patients who could benefit from preventive treatment of cardiovascular
disease while avoiding unnecessary treatment of others. Likewise, another study found
that applying boosting to multiple genomics platforms can improve the prediction of
cancer survival time.
o IT: Gradient boosted regression trees are used in search engines for page rankings,
while the Viola-Jones boosting algorithm is used for image retrieval. As noted by
Cornell, boosted classifiers allow the computations to be stopped sooner when it's clear
which direction a prediction is headed. A search engine can stop evaluating lower-
ranked pages, while image scanners will only consider images containing the desired
object.
o Finance: Boosting is used with deep learning models to automate critical tasks,
including fraud detection, pricing analysis, and more. For example, boosting methods
in credit card fraud detection and financial product pricing analysis improves the
accuracy of analyzing massive data sets to minimize financial losses.

Differences Between Bagging and Boosting


S.NO Bagging Boosting
1. The simplest way of combining predictions that A way of combining predictions
belong to the same type. that belong to the different types.
2. Aim to decrease variance, not bias. Aim to decrease bias, not variance.
3. Each model receives equal weight. Models are weighted according to
their performance.
4. Each model is built independently. New models are influenced
by the performance of previously
built models.
5. Different training data subsets are selected using row Every new subset contains the
sampling with replacement and random sampling elements that were misclassified by
methods from the entire training dataset. previous models.
6. Bagging tries to solve the over-fitting problem. Boosting tries to reduce bias.
7. If the classifier is unstable (high variance), then If the classifier is stable and simple
apply bagging. (high bias) the apply boosting.
8. In this base classifiers are trained parallelly. In this base classifiers are trained
sequentially.
9 Example: The Random forest model uses Bagging. Example: The AdaBoost uses
Boosting techniques

You might also like