Dependent Independent Variable (S) : Regression: What Is Regression
Dependent Independent Variable (S) : Regression: What Is Regression
What is Regression
These techniques are mostly driven by three metrics (number of independent variables,
type of dependent variables and shape of regression line).
Add techniques
1. Linear Regression
2. Logistic Regression
3. Polynomial Regression
4. Ridge Regression
5. Lasso Regression
6. ElasticNet Regression
7. SVM (Support Vector Regression)
8. Decision Tree Regression
9. Random Forrest Regression
10. Naïve Bayes Regression
Multiple linear regression : two or more independent variables are used to predict
the value of a dependent variable. The difference between the two is the number of
independent variables.
y= b+m1X1+m2X2+m3X3
Note: (Follow up question) Some interviewers expect this when we talk about Lin Reg
his task can be easily accomplished by Least Square Method. It is the most common
method used for fitting a regression line. It calculates the best-fit line for the observed data
by minimizing the sum of the squares of the vertical deviations from each data point to the
line. Because the deviations are first squared, when added, there is no cancelling out
between positive and negative values.
I think this is enough – No need for us to talk about other metrics RMSE , R2
Adjusted R2 when we talk about Linear Reg – (Those will be talked when we talk
about Model Performance)
Look at below points and answer them if they ask something about assumptions (Not required but if
still needed wait for Part 2 😊)
Important Points:
Polynomial Regression :
y=b+m*x^2
https://stats.stackexchange.com/questions/92065/why-is-polynomial-regression-
considered-a-special-case-of-multiple-linear-regres
In this regression technique, the best fit line is not a straight line. It is rather a curve that fits
into the data points.
Note : Comparison with SVM regressor for polynomial regression is very important
concept to understand – I will try to put some notes in notebook which I create )
Important Points:
While there might be a temptation to fit a higher degree polynomial to get lower error,
this can result in over-fitting. Always plot the relationships to see the fit and focus on
making sure that the curve fits the nature of the problem.
Especially look out for curve towards the ends and see whether those shapes and trends
make sense. Higher polynomials can end up producing wierd results on extrapolation. ( I
will add more points on this while comparing with SVM regressor in notebook)
Ridge Regression : Ridge Regression is a technique used when the data suffers from
multicollinearity ( independent variables are highly correlated). In multicollinearity, even
though the least squares estimates (OLS) are unbiased, their variances are large which
deviates the observed value far from the true value. By adding a degree of bias to the
regression estimates, ridge regression reduces the standard errors
y=a+ b*x
This equation also has an error term. The complete equation becomes:
y=a+b*x+e (error term), [error term is the value needed to correct for a prediction
In a linear equation, prediction errors can be decomposed into two sub components. First
is due to the biasedand second is due to the variance. Prediction error can occur due to
any one of these two or both components. Here, we’ll discuss about the error caused due to
variance.
Ridge regression solves the multicollinearity problem through shrinkage parameter λ
(lambda). Look at the equation below.
In this equation, we have two components. First one is least square term and other one is
lambda of the summation of β2 (beta- square) where β is the coefficient. This is added to
least square term in order to shrink the parameter to have a very low variance.
y=w0X0+w1X1+w2X2+b
Ridge regression is also a linear model for regression , So the formula it uses to make
predictions is same used for OLS , In ridge regression co-efficeints (w) are chosen not only they
predict well on training set but also on additional constraint . We want magnitude of co efficient
as small as possible in other words all entries of w should be close to Zero , Intuitively this
means each feature should have little effect on outcome as possible( This means having small
slope) .This constraint what we put is what we call it as regularization
Regularization means explicitly restricting model to avoid overfitting this kind used by Ridge
regression is called L2 regularization. Below is the equation sklearn uses
Here, is a complexity parameter that controls the amount of shrinkage: the larger the
value of , the greater the amount of shrinkage and thus the coefficients become more robust
to collinearity.
Lasso Regression
Similarto Ridge Regression, Lasso (Least Absolute Shrinkage and Selection Operator) also
penalizes the absolute size of the regression coefficients. In addition, it is capable of
reducing the variability and improving the accuracy of linear regression models.Look at the
equation :
Lasso regression differs from ridge regression in a way that it uses absolute values in the
penalty function, instead of squares. This leads to penalizing (or equivalently constraining
the sum of the absolute values of the estimates) values which causes some of the
parameter estimates to turn out exactly zero. Larger the penalty applied, further the
estimates get shrunk towards absolute zero. This results to variable selection out of given n
variables.
Intuitively
An alternative to Ridge for regularizing linear regression is Lasso. As with ridge regression,
using the lasso also restricts coefficients to be close to zero, but in a slightly different way,
called L1 regularization.8 The consequence of L1 regularization is that when using the lasso,
some coefficients are exactly zero. This means some fea ‐ tures are entirely ignored by the
model. This can be seen as a form of automatic fea‐ ture selection. Having some coefficients be
exactly zero often makes a model easier to interpret, and can reveal the most important
features of your model.
This is how sklearn deduces above equation when we are building our model we need to tune
these alpha( this will be discussed in notebook)
Note: Notebook will talk about tuning parameters alpha and I will try to plot on how it works
Elastic Net :
In Elastic Net Regularization, a linear sum of both noises are added. Hence, the
objective function would then be
Note that L1 and L2 regularizations are special cases of Elastic Net regularization
DT for Regression: It breaks down a dataset into smaller and smaller subsets while at the
same time an associated decision tree is incrementally developed. The final result is a tree
with decision nodes and leaf nodes. A decision node has two or more branches each
representing values for the attribute tested. Leaf node (e.g., Hours Played) represents a
decision on the numerical target. The topmost decision node in a tree which corresponds to
the best predictor called root node.
Step 1: Standard Deviation
A decision tree is built top-down from a root node and involves partitioning the data into
subsets that contain instances with similar values (homogenous). We use standard
deviation to calculate the homogeneity of a numerical sample. If the numerical sample is
completely homogeneous its standard deviation is zero.
a) This is like for the Target variable we take standard deviation without considering
features SD Target variable
b) Then we take standard deviation for (Target and Predictor(dependent variable—you
can think like you do some sort of group by and then calculate std deviation )
STEP 2: Standard Deviation Reduction
The standard deviation reduction is based on the decrease in standard deviation after a
dataset is split on an attribute. Constructing a decision tree is all about finding attribute that
returns the highest standard deviation reduction (i.e., the most homogeneous branches).
Step 3: The attribute with the largest standard deviation reduction is chosen for the decision
node.
Step 4: The dataset is divided based on the values of the selected attribute. This process is
run recursively on the non-leaf branches, until all data is processed.
In practice, we need some termination criteria. For example, when coefficient of deviation
(CV) for a branch becomes smaller than a certain threshold (e.g., 10%) and/or when too few
instances (n) remain in the branch --- This you can compare it with DT classification
problem like when to stop – we normally stop when there is no further gain
Repeat step 4 until you cannot further split or it is not good to split further
When we reach leaf node we calculate the average as final value for the leaf node
Remember this is a Regression we need to calculate Error like MSE, MAE etc to figure out
how good model is performing
(http://www.saedsayad.com/decision_tree_reg.htm) refer these link to get better understanding
https://www.youtube.com/watch?v=nWuUahhK3Oc
https://www.youtube.com/watch?v=IQe2Icb1WKE
Import Note : Standard deviation is square root of Variance so if some one talks about
Variance it is same they are talking
SKLEARN uses variance to implement
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_criterion.pyx
cdef class
RegressionCriterion(Criterion)
:
r"""Abstract regression criterion.
This handles cases where the target is a continuous
value, and is
evaluated by computing the variance of the target
values left and right
of the split point. The computation takes linear time
with `n_samples`
by using ::
var = \sum_i^n (y_i - y_bar) ** 2
= (\sum_i^n y_i ** 2) - n_samples * y_bar **
2
"""
Note: Explained Variance and all other model performance will be talked in notebook and
separate prep guide
Note : Here we are not using majority Vote instead we are taking Average
Note : Random Forrest is a Bagging technique It only reduces Variance
SVM Regression :
Popular question if you are attending companies like Amazon , MSFT ,GOOG etc 😊
Intuitively, as all regressors it tries to fit a line to data by minimizing a cost function. However,
the interesting part about SVR is that you can deploy a non-linear kernel. In this case you end
making non-linear regression, i.e. fitting a curve rather than a line.
This process is based on the kernel trick and the representation of the solution/model in the
dual rather than in the primal. That is, the model is represented as combinations of the training
points rather than a function of the features and some weights. At the same time the basic
algorithm remains the same: the only real change in the process of going non-linear is the
kernel function, which changes from a simple inner product to some non linear function.
The Support Vector Regression (SVR) uses the same principles as the SVM for
classification, with only a few minor differences. First of all, because output is a real
number it becomes very difficult to predict the information at hand, which has
infinite possibilities. In the case of regression, a margin of tolerance (epsilon) is set in
approximation to the SVM which would have already requested from the problem.
But besides this fact, there is also a more complicated reason, the algorithm is more
complicated therefore to be taken in consideration. However, the main idea is
always the same: to minimize error, individualizing the hyperplane which maximizes
the margin, keeping in mind that part of the error is tolerated.
Linear SVR
Non-linear SVR
The kernel functions transform the data into a higher dimensional feature space to make it possible to perform
linear separation.
Kernel functions
Note : SVM regressor is both Linear and Non Linear – When interviewer asks Linear Regression
be careful
SVM is a type of Linear classifier when you use Linear as long as you are not playing with
Kernel that is as long as you are using liblinear
https://link.springer.com/content/pdf/10.1023%2FA%3A1007670802811.pdf
Even if we force naive bayes and tweak it a little bit for regression the result is
disappointing; A team experimented with this and achieve not so good results.
Relation to logistic regression: naive Bayes classifier can be considered a way of fitting a
probability model that optimizes the joint likelihood p(C , x), while logistic regression fits the
same probability model to optimize the conditional p(C | x).
So now you have to choices, tweak naive bayes formula or use logistic regression.
https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Relation_to_logistic_regression
Don’t get confused with Bayseian Ridge Regression : Please update sheet if someone asks
it in interview
Classification Techniques - I will put logistic regression but interviewers tend to ask Logistic
Regression as Regression Technique to confuse and talk about Linear models caution
while answering
Model evaluation Prep guide will cover all errors and Notebook will compare them
Happy Regression