0% found this document useful (0 votes)
56 views16 pages

3 Da

The document discusses concepts related to regression analysis including linear regression, multiple linear regression, and logistic regression. It covers topics such as the relationship between dependent and independent variables, assumptions of linear regression, and overfitting and underfitting in linear regression models.

Uploaded by

sanjaykt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views16 pages

3 Da

The document discusses concepts related to regression analysis including linear regression, multiple linear regression, and logistic regression. It covers topics such as the relationship between dependent and independent variables, assumptions of linear regression, and overfitting and underfitting in linear regression models.

Uploaded by

sanjaykt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

UNIT-III

Regression: Concepts, Blue property assumptions, Least Square Estimation, Variable Rationalization,
and Model Building etc.
Logistic Regression: Model Theory, Model fit Statistics, Model Construction, Analytics applications
to various Business Domains etc.

Regression- Concepts:

• Regression analysis is a form of predictive modelling technique which investigates the


relationship between a dependent and independent variable.
• Linear regression shows the linear relationship between the independent(predictor) variable i.e.
X-axis and the dependent(output) variable i.e. Y-axis, called linear regression.
• If there is a single input variable X(dependent variable), such linear regression is called simple
linear regression.

• The above graph presents the linear relationship between the output(y) variable and predictor(X)
variables.
• The blue line is referred to as the best fit straight line.
• Based on the given data points, we attempt to plot a line that fits the points the best.
• To calculate best-fit line linear regression uses a traditional slope-intercept form which is given
below,

Yi = β0 + β1Xi
where Yi = Dependent variable, β0 = constant/Intercept, β1 = Slope/Intercept, Xi = Independent variable.

This algorithm explains the linear relationship between the dependent(output) variable y and the
independent(predictor) variable X using a straight line Y= B0 + B1 X.
• The goal of the linear regression algorithm is to get the best values for B0 and B1 to find the best
fit line.
• The best fit line is a line that has the least error which means the error between predicted values
and actual values should be minimum.

Random Error(Residuals):

• In regression, the difference between the observed value of the dependent variable(yi) and the
predicted value(predicted) is called the residuals. εi = ypredicted – yi

where ypredicted = B0 + B1 Xi
• Mathematically, the best fit line is obtained by minimizing the Residual Sum of Squares(RSS)

Cost function for Linear Regression:

• The cost function helps to work out the optimal values for B0 and B1, which provides the best fit
line for the data points.
• In Linear Regression, generally Mean Squared Error (MSE) cost function is used, which is the
average of squared error that occurred between the ypredicted and yi.
• We calculate MSE using simple linear equation y=mx+b:

• Using the MSE function, we’ll update the values of B0 and B1 such that the MSE value settles
at the minima.

Evaluation Metrics for Linear Regression:

1. Coefficient of Determination or R-Squared (R2)


2. Root Mean Squared Error (RSME) and Residual Standard Error (RSE)
Coefficient of Determination or R-Squared (R2):

• R-Squared is a number that explains the amount of variation that is explained/captured by the
developed model.
• It always ranges between 0 & 1.
• Overall, the higher the value of R-squared, the better the model fits the data.
• Mathematically it can be represented as,

R2 = 1 – ( RSS/TSS )

• Residual sum of Squares (RSS) is defined as the sum of squares of the residual for each data
point in the plot/data.
• It is the measure of the difference between the expected and the actual observed output.

• Total Sum of Squares (TSS) is defined as the sum of errors of the data points from the mean of
the response variable.
• Mathematically TSS is,

where y hat is the mean of the sample data points.

• The significance of R-squared is shown by the following figures,

Root Mean Squared Error (RSME) and Residual Standard Error (RSE):

• The Root Mean Squared Error is the square root of the variance of the residuals.
• It specifies the absolute fit of the model to the data i.e. how close the observed data points are to
the predicted values.
• Mathematically it can be represented as,
• To make this estimate unbiased, one has to divide the sum of the squared residuals by the degrees
of freedom rather than the total number of data points in the model.
• This term is then called the Residual Standard Error(RSE).
• Mathematically it can be represented as,

• R-squared is a better measure than RSME.


• Because the value of Root Mean Squared Error depends on the units of the variables (i.e.
it is not a normalized measure), it can change with the change in the unit of the variables.

Assumptions of Linear Regression:

• Regression is a parametric approach, which means that it makes assumptions about the data for
the purpose of analysis.
• For successful regression analysis, it’s essential to validate the following assumptions.

1. Linearity of residuals: There needs to be a linear relationship between the dependent variable
and independent variable(s).

2. Independence of residuals:
• The error terms should not be dependent on one another (like in time-series data wherein
the next value is dependent on the previous one).
• There should be no correlation between the residual terms.
• The absence of this phenomenon is known as Autocorrelation.
• There should not be any visible patterns in the error terms.
3. Normal distribution of residuals:
• The mean of residuals should follow a normal distribution with a mean equal to zero or
close to zero.
• This is done in order to check whether the selected line is actually the line of best fit or
not.
• If the error terms are non-normally distributed, suggests that there are a few unusual data
points that must be studied closely to make a better model.

4. The equal variance of residuals:


• The error terms must have constant variance. This phenomenon is known as
Homoscedasticity.
• The presence of non-constant variance in the error terms is referred to as


Heteroscedasticity.
• Generally, non-constant variance arises in the presence of outliers or extreme leverage values.
Multiple Linear Regression:

• Multiple linear regression is a technique to understand the relationship between a single


dependent variable and multiple independent variables.
• The formulation for multiple linear regression is also similar to simple linear regression
with the small change that instead of having one beta variable, you will now have betas for
all the variables used. The formula is given as: Y = B0 + B1X1 + B2X2 + … + BpXp + ε
Considerations of Multiple Linear Regression:

All the four assumptions made for Simple Linear Regression still hold true for Multiple Linear
Regression along with a few new additional assumptions.

1. Overfitting: When more and more variables are added to a model, the model may become far
too complex and usually ends up memorizing all the data points in the training set. This
phenomenon is known as the overfitting of a model. This usually leads to high training accuracy
and very low test accuracy.
2. Multicollinearity: It is the phenomenon where a model with several independent variables, may
have some variables interrelated.
3. Feature Selection: With more variables present, selecting the optimal set of predictors from the
pool of given features (many of which might be redundant) becomes an important task for
building a relevant and better model.

Overfitting and Underfitting in Linear Regression:

• There have always been situations where a model performs well on training data but not on the
test data.
• While training models on a dataset, overfitting, and underfitting are the most common problems
faced by people.
• Before understanding overfitting and underfitting one must know about bias and variance.

Bias:

• Bias is a measure to determine how accurate is the model likely to be on future unseen data.
• Complex models, assuming there is enough training data available, can do predictions accurately.
• Whereas the models that are too naive, are very likely to perform badly with respect to
predictions.
• Simply, Bias is errors made by training data.
• Generally, linear algorithms have a high bias which makes them fast to learn and easier to
understand but in general, are less flexible.
• Implying lower predictive performance on complex problems that fail to meet the expected
outcomes.

Variance:

• Variance is the sensitivity of the model towards training data, that is it quantifies how much the
model will react when input data is changed.
• Ideally, the model shouldn’t change too much from one training dataset to the next training data,
which will mean that the algorithm is good at picking out the hidden underlying patterns between
the inputs and the output variables.
• Ideally, a model should have lower variance which means that the model doesn’t change
drastically after changing the training data(it is generalizable).
• Having higher variance will make a model change drastically even on a small change in the
training dataset.

Bias Variance Tradeoff:

• The aim of any supervised machine learning algorithm is to achieve low bias and low variance
as it is more robust.
• So that the algorithm should achieve better performance.
• There is no escape from the relationship between bias and variance in machine learning.

• There is an inverse relationship between bias and variance, o An increase in bias will
decrease the variance.
o An increase in the variance will decrease the bias.
• There is a trade-off that plays between these two concepts and the algorithms must find a balance
between bias and variance.

Overfitting:

• When a model learns each and every pattern and noise in the data to such extent that it affects the
performance of the model on the unseen future dataset, it is referred to as overfitting.
• The model fits the data so well that it interprets noise as patterns in the data.
• When a model has low bias and higher variance it ends up memorizing the data and causing
overfitting.
• Overfitting causes the model to become specific rather than generic.
• This usually leads to high training accuracy and very low test accuracy.
• Detecting overfitting is useful, but it doesn’t solve the actual problem. There are several ways to
prevent overfitting, which are stated below:

• Cross-validation
• If the training data is too small to train add more relevant and clean data.
• If the training data is too large, do some feature selection and remove unnecessary features.
• Regularization

Underfitting:  Underfitting is not often discussed as often as overfitting is discussed.

• When the model fails to learn from the training dataset and is also not able to generalize the test
dataset, is referred to as underfitting.
• This type of problem can be very easily detected by the performance metrics.
• When a model has high bias and low variance it ends up not generalizing the data and causing
underfitting.
• It is unable to find the hidden underlying patterns from the data.
• This usually leads to low training accuracy and very low test accuracy.
• The ways to prevent underfitting are stated below,

• Increase the model complexity


• Increase the number of features in the training data  Remove noise from the data.

Best Linear Unbiased Estimator (BLUE) Property Assumptions:

In simple linear regression or multiple linear regression, we make some basic assumptions on error
term.

Assumptions:

1. Error has zero mean


2. Error has constant variance
3. Errors are uncorrelated
4. Errors are normally distributed

Least Squares Estimation:

• In practice, of course, we have a collection of observations but we do not know the values of the
coefficients β0,β1,…,βk
• These need to be estimated from the data.
• The least squares principle provides a way of choosing the coefficients effectively by minimising
the sum of the squared errors.
• That is, we choose the values of β0,β1,…,βk that minimise

• This is called least squares estimation because it gives the least value for the sum of squared
errors.
• Finding the best estimates of the coefficients is often called “fitting” the model to the data
Variable Rationalization and Model Building:
• In most practical problems, the analyst has a rather large pool of possible candidate
regressors(variables), of which only a few are likely to be important.
• Finding an appropriate subset of regressors(variables) for the model is often called the variable
selection problem or Variable Rationalization.
• The basic steps for variable selection are as follows:
1. Specify the maximum model to be considered.
2. Specify a criterion for selection a model.
3. Specify a strategy for selecting variables.
4. Conduct the specified analysis.
5. Evaluate the Validity of the model chosen

Step 1: Specifying the maximum Model: The maximum model is defined to be the largest model (the
one having the most predictor variables) considered at any point in the process of model selection.
Step 2: Specifying a Criterion for Selecting a Model: There are several criteria that can be used to
evaluate subset regression models such as F-Test Statistic, Coefficient of Determination,
Residual Mean Square, Mallow’s Cp Statistic,
Step 3: Specifying a Strategy for Selecting Variables:

(a) All possible regression procedure: The all possible regression procedure requires that we fit
each possible regression equation associated with each possible combination of the k independent
variables.

(b) Backward Elimination Procedure: Backward elimination is one of several computer-based


iterative variable-selection procedures. It begins with a model containing all the independent variables
of interest. Then, at each step the variable with smallest F-statistic is deleted (if the F is not higher than
the chosen cutoff level).

(c)Forward Selection Procedure: The procedure begins with the assumption that there are no
regressors in the model other than the intercept. Forward selection is a type of stepwise regression
which begins with an empty model and adds in variables one by one. In each forward step, you add the
one variable that gives the single best improvement to your model.

(d) Stepwise Regression Procedure: Stepwise regression is a modified version of forward regression
that permits reexamination, at every step, of the variables incorporated in the model in pervious steps.
A variable that entered at an early stage may become superfluous at a larger stage because of its
relationship with other variables subsequently added to the model.

Step 4: Conduct the specified analysis


Conduct the analysis by using above specified strategies and criterias and also choose best model that
fits the data.

Step 5: Evaluate the Validity of the model chosen:

Check validity of the chosen model by using evaluation metrics such as R Squared, RMSE, RSS and
so on.
To select the best regression equation, carry out the following steps:

1) Fit the largest model possible to the data.

2) Perform a through analysis of this model.

3) Determine if a transformation of the response or some of the regressors is necessary.

4) Determine if all possible regression is feasible.

5) Compare and contrast the best models recommended by each criterion.

6) Perform through analyses of the “best models” (usually three to five models).

7) Explore the need for further transformations.

8) Discuss with the subject-matter experts the relative advantages and disadvantages of the final set of

models.

Logistic Regression- Model Theory:

• Logistic Regression is a one of the popular supervised learning Technique.


• It is used for predicting the categorical dependent variable using a given set of independent
variables.
• Logistic regression predicts the output of a categorical dependent variable. Therefore the outcome
must be a categorical or discrete value.
• It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact value as 0 and
1, it gives the probabilistic values which lie between 0 and 1.
• Logistic Regression is much similar to the Linear Regression except that how they are used.
Linear Regression is used for solving Regression problems, whereas Logistic regression is used
for solving the classification problems.
• In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic function,
which predicts two maximum values (0 or 1).
• The curve from the logistic function indicates the likelihood of something such as whether the
cells are cancerous or not, a mouse is obese or not based on its weight, etc.

Logistic Function (Sigmoid Function):


• The sigmoid function is a mathematical function used to map the predicted values to probabilities.
• It maps any real value into another value within a range of 0 and 1.
• The value of the logistic regression must be between 0 and 1, which cannot go beyond this limit,
so it forms a curve like the "S" form. The S-form curve is called the Sigmoid function or the
logistic function.
• In logistic regression, we use the concept of the threshold value, which defines the probability of
either 0 or 1. Such as values above the threshold value tends to 1, and a value below the threshold
values tends to 0.
Assumptions for Logistic Regression:

• The dependent variable must be categorical in nature.


• The independent variable should not have multi-collinearity.

Logistic Regression Equation:

The Logistic regression equation can be obtained from the Linear Regression equation. The
mathematical steps to get Logistic Regression equations are given below:

• We know the equation of the straight line can be written as:

• In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above equation
by (1-y):

• But we need range between -[infinity] to +[infinity], then take logarithm of the equation it will
become:

• The standard logistic function is simply the inverse of the logit equation above. If we solve for
y from the logit equation, the formula of the logistic function is below:

y = 1/(1 + e^(-(b0 + b1*x1 + b2*x2 + … + bn*xn))) where e is the base of the natural
logarithms

• The logistic function is a type of sigmoid function.


sigmoid(h) = 1/(1 + e^(-h))
where h = b0 + b1*x1 + b2*x2 + … + bn*xn for logistic function.

Model Fit Statistic- Maximum Likelihood Estimation (The Best Fit Model):
• Like linear regression, the logistic regression algorithm finds the best values of coefficients (b0,
b1, …,bn) to fit the training dataset.
• The standard way to determine the best fit for logistic regression is maximum likelihood
estimation (MLE)
• In this estimation method, we use a likelihood function that measures how well a set of parameters
fit a sample of data.
• The parameter values that maximize the likelihood function are the maximum likelihood
estimates.
Model Construction:
Let’s see a simple example with the following dataset:
Observa on # Input x1 Binary Output y
0 0.5 0
1 1.0 0
2 0.65 0
3 0.75 1
4 1.2 1
With one input variable x1, the logistic regression formula becomes:
log(p/(1-p)) = w0 + w1*x1
or p = 1/(1 + e^(-(w0 + w1*x1)))

Since y is binary of values 0 or 1, a bernoulli random variable can be used to model its probability:

P(y=1) = p P(y=0) = 1 – p

Or:

P(y) = (p^y)*(1-p)^(1-y) with y being either 0 or 1

This distribution formula is only for a single observation.

How do we model the distribution of multiple observations like P(y0, y1, y2, y3, y4)?

Let’s assume these observations are mutually independent from each other. Then we can write the
joint distribution of the training dataset as:

P(y0, y1, y2, y3, y4) = P(y0) * P(y1) * P(y2) * P(y3) * P(y4)

To make it more specific, each observed y has a different probability of being 1.


Let’s assume P(yi = 1) = pi for i = 0,1,2,3,4. Then we can rewrite the formula as below:

P(y0) * P(y1) * P(y2) * P(y3) * P(y4) = p0^(y0)*(1-p0)^(1-y0) * p1^(y1)*(1-p1)^(1-y1) *… *


p4^(y4)*(1-p4)^(1-y4)

We can calculate the p estimate for each observation based on the logistic function formula:

• p0 = 1/(1 + e^(-(w0 + w1*0.5)))


• p1 = 1/(1 + e^(-(w0 + w1*1.0)))
• p2 = 1/(1 + e^(-(w0 + w1*0.65)))
• p3 = 1/(1 + e^(-(w0 + w1*0.75)))
• p4 = 1/(1 + e^(-(w0 + w1*1.2)))

We also have the values of the output variable y:

• y0 = 0
• y1 = 0
• y2 = 0
• y3 = 1
• y4 = 1
Log Likelihood Function in statistics:

So we have all the p0 – p4 and y0 – y4 values from the training dataset.

Our likelihood becomes a function of the parameters w0 and w1:

L(w0, w1) = p0^(y0)*(1-p0)^(1-y0) * p1^(y1)*(1-p1)^(1-y1) * … * p4^(y4)*(1-p4)^(1-y4)

The goal is to choose the values of w0 and w1 that result in the maximum likelihood based on the
training dataset.

Note that it’s computationally more convenient to optimize the log-likelihood function. Since the
natural logarithm is a strictly increasing function, the same w0 and w1 values that maximize L would
also maximize l = log(L).

So in statistics, we often try to maximize the function below:

l(w0, w1) = log(L(w0, w1)) = y0*log(p0) + (1-y0)*log(1-p0) + y1*log(p1) + (1-y1)*log(1-p1) + … +


(1-y4)*log(1-p4)

Cost Function (Cross Entropy Loss) :

While in machine learning, we prefer the idea of minimizing cost/loss functions, so we often define the
cost function as the negative of the average log-likelihood.

cost function = – avg(l(w0, w1)) = – 1/5 * l(w0, w1) = – 1/5 * (y0*log(p0) + (1-y0)*log(1-p0) +
y1*log(p1) + (1-y1)*log(1-p1) + … + (1-y4)*log(1-p4))
This is also called the average of the cross entropy loss.

Maximizing the (log) likelihood is the same as minimizing the cross entropy loss function.

Optimization Methods:

• Unlike OLS estimation for the linear regression, we don’t have a closed-form solution for the
MLE. But we do know that the cost function is convex, which means a local minimum is also the
global minimum.
• To minimize this cost function, Python libraries such as scikit-learn (sklearn) use numerical
methods similar to Gradient Descent. And since sklearn uses gradients to minimize the cost
function, it’s better to scale the input variables and/or use regularization to make the algorithm
more stable.

Model Interpretations:

But by using the Logistic Regression algorithm in Python sklearn, we can find the best estimates are
w0 = -4.411 and w1 = 4.759 for our example dataset.
We can plot the logistic regression with the sample dataset. As you can see, the output y only has two
values of 0 and 1, while the logistic function has an S shape.

We can also make some interpretations with the parameter w1.

Recall that we have:

log(odds of y=1) = log(p/(1-p)) = w0 + w1*x1 where p = P(y = 1)

Since w1 = 4.759, with a one-unit increase of x1, the log odds is expected to increase by 4.759 as well.
How to use Logistic Regression Models to Predict?:

As mentioned earlier, we often use logistic regression models for predictions.

Given a new observation, how would we predict which class y = 0 or 1 it belongs to?

For example, say a new observation has input variable x1 = 0.9. By using the logistic regression
equation estimated from MLE, we can calculate the probability p of it belongs to y = 1.

p = 1/(1 + e^(-(-4.411 + 4.759*0.9))) = 46.8%

If we use 50% as the threshold, we would predict that this observation is in class 0, since p < 50%.

Since the logistic regression has an S shape, the larger x1, the more likely the observation has class y =
1.
What’s the threshold of x1 for us to classify the observation as y = 1?

At the threshold of probability p=50%, the odds are p/(1-p) = 50%/50% = 1. So the log(odds) = log(1)
= 0.

While log (odds) fits the linear regression equation, we have:

log(odds) = 0 = -4.411 + 4.759*x1

Solving for x1, we get 0.927. That’s the threshold of x1 for prediction, i.e., when x1 > 0.927, the
observation will be classified as y = 1.

Types of Logistic Regression: Logistic Regression can be classified into three types:

• Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
• Multinomial: In multinomial Logistic regression, there can be 3 or more possible unordered
types of the dependent variable, such as "cat", "dogs", or "sheep"
• Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of
dependent variables, such as "low", "Medium", or "High".

Applications of Logistic Regression:

• Fraud Detection
• Customer Churn Prediction
• Cancer diagnosis
Analytics applications to various Business Domains:

• Finance
BA is of utmost importance to the finance sector. Data Scientists are in high demand in investment
banking, portfolio management, financial planning, budgeting, forecasting, etc.
For example: Companies these days have a large amount of financial data. Use of intelligent
Business Analytics tools can help use this data to determine the products’ prices. Also, on the
basis of historical information Business Analysts can study the trends on the performance of a
particular stock and advise the client on whether to retain it or sell it.

• Marketing
Studying buying patterns of consumer behaviour, analysing trends, help in identifying the target
audience, employing advertising techniques that can appeal to the consumers, forecast supply
requirements, etc.
For example: Use Business Analytics to gauge the effectiveness and impact of a marketing
strategy on the customers. Data can be used to build loyal customers by giving them exactly
what they want as per their specifications.
• HR Professionals
HR professionals can make use of data to find information about educational background of
high performing candidates, employee attrition rate, number of years of service of employees,
age, gender, etc. This information can play a pivotal role in the selection procedure of a
candidate.
For example: HR manager can predict the employee retention rate on the basis of data given by
Business Analytics.
• CRM
Business Analytics helps one analyse the key performance indicators, which further helps in
decision making and make strategies to boost the relationship with the consumers. The
demographics, and data about other socio-economic factors, purchasing patterns, lifestyle, etc.,
are of prime importance to the CRM department.
For example: The company wants to improve its service in a particular geographical segment.
With data analytics, one can predict the customer’s preferences in that particular segment, what
appeals to them, and accordingly improve relations with customers.
• Manufacturing
Business Analytics can help you in supply chain management, inventory management, measure
performance of targets, risk mitigation plans, improve efficiency in the basis of product data,
etc.
For example: The Manager wants information on performance of a machinery which has been
used past 10 years. The historical data will help evaluate the performance of the machinery and
decide whether costs of maintaining the machine will exceed the cost of buying a new
machinery.
• Credit Card Companies
Credit card transactions of a customer can determine many factors: financial health, life style,
preferences of purchases, behavioral trends, etc.
For example: Credit card companies can help the retail sector by locating the target audience.
According to the transactions reports, retail companies can predict the choices of the
consumers, their spending pattern, preference over buying competitor’s product

You might also like