Unit 3 ML
Unit 3 ML
With statistical learning theory, there are two main types of data:
Dependent Variable — a variable (y) whose values depend on the values of other
variables (a dependent variable is sometimes also referred to as a target variable)
Independent Variables — a variable (x) whose value does not depend on the values of
other variables (independent variables are sometimes also referred to as predictor
variables, input variables, explanatory variables, or features)
A common examples of an Independent Variable is Age. There is nothing that one can do
to increase or decrease age. This variable is independent.
Weight — a person’s weight is dependent on his or her age, diet, and activity levels (as
well as other factors)
In this example, which shows how the price of a home is affected by the size of the home,
sq. ft is the independent variable while price of the home is the dependent variable.
Statistical Model:
y = mx + c
where m represents the gradient and c is the intercept. Another way that this equation
can be expressed is with roman numerals which would look something like:
If we suppose that the size of the home is not the only independent variable when
determining the price and that the number of bathrooms is also an independent variable,
the equation would look like:
Model Generalization:
In order to build an effective model, the available data needs to be used in a way that
would make the model generalizable for unseen situations. Some problems that occur
when building models is that the model under-fits or over-fits to the data.
Under-fitting — when a statistical model does not adequately capture the underlying
structure of the data and, therefore, does not include some parameters that would
appear in a correctly specified model.
The PCA algorithm is based on some mathematical concepts such as:35.8M 698
Feature selection is one of the important concepts of machine learning, which highly
impacts the performance of the model. As machine learning works on the concept of
"Garbage In Garbage Out", so we always need to input the most appropriate and
relevant dataset to the model in order to get a better result.
In this topic, we will discuss different feature selection techniques for machine
learning. But before that, let's first understand some basics of feature selection.
1.7M
Twitter Announces Elon Musk Will Join the Company's Board of Directors
1. Wrapper Methods
In wrapper methodology, selection of features is done by considering it as a search
problem, in which different combinations are made, evaluated, and compared with
other combinations. It trains the algorithm by using the subset of features iteratively.
On the basis of the output of the model, features are added or subtracted, and with
this feature set, the model has trained again.
2. Filter Methods
In Filter Method, features are selected on the basis of statistics measures. This
method does not depend on the learning algorithm and chooses the features as a
pre-processing step.
The filter method filters out the irrelevant feature and redundant columns from the
model by using different metrics through ranking.
The advantage of using filter methods is that it needs low computational time and
does not overfit the data.
o Information Gain
o Chi-square Test
o Fisher's Score
o Missing Value Ratio
Fisher's Score:
The value of the missing value ratio can be used for evaluating the feature set
against the threshold value. The formula for obtaining the missing value ratio is the
number of missing values in each column divided by the total number of
observations. The variable is having more than the threshold value can be dropped.
3. Embedded Methods
Embedded methods combined the advantages of both filter and wrapper
methods
These are fast processing methods similar to the filter method but more
accurate than the filter method.
These methods are also iterative, which evaluates each iteration, and optimally finds
the most important features that contribute the most to training in a particular
iteration. Some techniques of embedded methods are:
What is L1 Regularization?
W-Weight
B-Bias
What is L2 regularization?
Learning
Model evaluation is a method of assessing the correctness of models on test data.
The test data consists of data points that have not been seen by the model before.
Model selection is a technique for selecting the best model after the individual
models are evaluated based on the required criteria.
1)Classification metrics
For every classification model prediction, a matrix called the confusion matrix can
be constructed which demonstrates the number of test cases correctly and
incorrectly classified.
It looks something like this (considering 1 -Positive and 0 -Negative
Negative are the target
classes):
Actual 0 Actual 1
2)Accuracy
Accuracy is the simplest metric and can be defined as the number of test cases
correctly classified divided by the total number of test cases.
It can be applied to most generic problems but is not very useful when it comes to
unbalanced datasets.
For instance, if we are detecting frauds in bank data, the ratio of fraud to non
non-fraud
cases can be 1:99. In such cases, if accuracy is used, the model will turn out to be
99% accurate by predicting all test cases as non
non-fraud.
fraud. The 99% accurate model will
be completely useless.
Therefore, for such a case, a metric is required that can focus on the ten fraud data
points which were completely missed by the model.
3.Precision
Intuitively, this equation is the ratio of correct positive classifications to the total
number of predicted positive classifications. The greater the fraction, the hig
higher is
the precision, which means better is the ability of the model to correctly classify the
positive class.
4.Recall
Recall tells us the number of positive cases correctly identified out of the total
number of positive cases.
Going back to the fraud problem, the recall value will be very useful in fraud cases
because a high recall value will indicate that a lot of fraud cases were identified out
of the total number of frauds.
5.F1 Score
Here, precision will be required to save on the company’s cost (because plane parts
are extremely expensive)
Recall
ecall will be required to ensure that the machinery is stable and not a threat to
human lives.
6.AUC-ROC
ROC curve is a plot of true positive rate (recall) against false positive rate (TN /
(TN+FP)). AUC-ROC
ROC stands for Area Under the Receiver Operating
Characteristics and the higher the area, the better is the model performance.
If the curve is somewhere near the 50% diagonal line, it suggests that the model
randomly predicts the output variable.
7.Log Loss
Log loss is a very effective classification metric and is equivalent to -1* log
(likelihood function) where the likelihood function suggests how likely the model
thinks the observed set of outcomes was.
Since the likelihood function provides very small values, a better way to interpret
them is by converting the values to log and the negative is added to reverse the
order of the metric such that a lower loss score suggests a better model.
Gain and lift charts are tools that evaluate model performance just like the confusion
matrix but with a subtle, yet significant difference. The confusion matrix determines
the performance of the model on the whole population or the entire test set, whereas
the gain and lift charts evaluate the model on portions of the whole population.
Therefore, we have a score (y-axis) for every % of the population (x-axis).
Lift charts measure the improvement that a model brings in compared to random
predictions. The improvement is referred to as the ‘lift’.
9.K-S Chart
1.Mean
Mean Squared Error or MSE
MSE is a simple metric that calculates the difference between the actual value and
the predicted value (error), squares it and then provides the mean of all the errors.
MSE is very sensitive to outliers and will show a very high error value even if a few
outliers
rs are present in the otherwise well
well-fitted model predictions.
2.Root
Root Mean Squared Error or RMSE
RMSE is the root of MSE and is beneficial because it helps to bring down the scale
of the errors closer to the actual values, making it more interpretable.
3.Mean
Mean Absolute Error or MAE
If one wants to ignore the outlier values to a certain degree, MAE is the choice since
it reduces the penalty of the outliers significantly with the removal of the square
terms.
4.Root
Root Mean Squared Log Error or RMSLE
In RMSLE, the same equation as that of RMSE is followed except for an added log
function along with the actual and predicted values.
x is the actual value and y is the predicted value. This helps to scale down the effect
of the outliers by downplaying the higher error rates with the log function. Also,
RMSLE helps to capture a relative error (by comparing all the error values) through
the use of logs.
5.R-Squared
R-Square
Square helps to identify the proportion of variance of the target variable that can
be captured with the help of the independent variables or predictors.
1.Dunn Index
Dunn Index focuses on identifying clusters that have low variance (among all
members in the cluster) and are compact. The mean values of the different clusters
also need to be far apart.
δ(Xi, Yj) is the intercluster distance i.e. the distance between Xi and Xj
∆(Xk)
(Xk) is the intercluster distance of cluster Xki.e.distance within the cluster Xk
However, the disadvantage of Dunn index is that with a higher number of clusters
and more dimensions, the computation cost increases.
2.Silhouette Coefficient
Silhouette Coefficient tracks how every point in one cluster is close to every point in
the other clusters in the range of -1 to +1.:
3.Elbow method
1)Random Split
Random Splits are used to randomly sample a percentage of data into training,
testing, and preferably validation sets. The advantage of this method is that there is
a good chance that the original population is well represented in all the three sets. In
more formal terms, random splitting will prevent a biased sampling of data.
2)Time-Based Split
There are some types of data where random splits are not possible. For
example, if we have to train a model for weather forecasting, we cannot randomly
divide the data into training and testing sets. This will jumble up the seasonal pattern!
Such data is often referred to by the term – Time Series.
In such cases, a time-wise split is used. The training set can have data for the last
three years and 10 months of the present year. The last two months can be reserved
for the testing or validation set.
3)K-Fold Cross-Validation
The cross-validation technique works by randomly shuffling the dataset and then
splitting it into k groups. Thereafter, on iterating over each group, the group needs
to be considered as a test set while all other groups are clubbed together into the
training set. The model is tested on the test group and the process continues for k
groups.
Thus, by the end of the process, one has k different results on k different test groups.
The best model can then be selected easily by choosing the one with the highest
score.
4.Bootstrap
Bootstrap is one of the most powerful ways to obtain a stabilized model. It is close to
the random splitting technique since it follows the concept of random sampling.
The first step is to select a sample size (which is usually equal to the size of the
original dataset). Thereafter, a sample data point must be randomly selected from
the original dataset and added to the bootstrap sample. After the addition, the
sample needs to be put back into the original sample. This process needs to be
repeated for N times, where N is the sample size.
The model is trained on the bootstrap sample and then evaluated on all those data
points that did not make it to the bootstrapped sample. These are called the out-of-
bag samples.
Another important point to note here is that the model performance taken into
account in probabilistic measures is calculated from the training set only.
only A hold-
out test set is typically not required.
A fair bit of disadvantage however lies in the fact that probabilistic measures do not
consider the uncertainty of the models and has a chance of selecting simpler models
over complex models.
1)Akaike
Akaike Information Criterion (AIC)
The limitation of AIC is that it is not very good with generalizing models as it tends to
select complex models that lose less training information.
2)Bayesian Information
formation Criterion (BIC)
BIC was derived from the Bayesian probability concept and is suited for models that
are trained under the maximum likelihood estimation.
BIC penalizes the model for its complexity and is preferably used when the size of
the dataset is not very small (otherwise it tends to settle on very simple models).
3)Minimum
Minimum Description Length (MDL)
d = model
D = predictions made by the model
L(h) = number of bits required to represent the model
L(D | h) = number of bits required to represent the predictions from the model