Unit 3 Regression
Unit 3 Regression
Contents
Concepts, Blue property assumptions, Linear regression, Logistic regression, Least Square Estimation, Variable
Rationalization, Model Building, etc.
Logistic Regression: Binary, Multinomial regression, Model Theory, Model fit Statistics, Maximum Likelihood
Estimation(MLE), Model Construction, Analytics applications to various Business Domains, Finance, marketing,
credit card companies
Regression Analysis
Now, the company wants to do the advertisement of $200 in the year 2023 and wants to know the
prediction about the sales for this year. So to solve such type of prediction problems in machine
learning, we need regression analysis.
Lecture Notes for E Alpaydın 2010 Introduction to Ma 4
chine Learning 2e © The MIT Press (V1.0)
• Regression is a supervised learning technique which helps in finding the correlation between
variables and enables us to predict the continuous output variable based on the one or more
predictor variables. It is mainly used for prediction, forecasting, time series modeling, and
determining the causal-effect relationship between variables.
• In Regression, we plot a graph between the variables which best fits the given datapoints, using
this plot, the machine learning model can make predictions about the data. In simple
words, "Regression shows a line or curve that passes through all the datapoints on target-
predictor graph in such a way that the vertical distance between the datapoints and the
regression line is minimum." The distance between datapoints and line tells whether a model
has captured a strong relationship or not.
Some examples of regression can be as:
• Prediction of rain using temperature and other factors
• Determining Market trends
• Prediction of road accidents due to rash driving.
Y= aX+b
Here,
Y= Dependent variables (target variables), X= Independent variables (predictor variables),
a and b are the linear coefficients
• Simple linear regression is an approach for predicting a response using a single feature.
• It is assumed that the two variables are linearly related. Hence, we try to find a linear
function that predicts the response value(y) as accurately as possible as a function of the
feature or independent variable(x).
Let us consider a dataset where we have a value of response y for every feature x:
• To determine how well our regression line fits the data, we want to calculate the
correlation coefficient, commonly referred to just as R, and the coefficient of
determination, otherwise known as R² (R squared).
Here x̅ is the mean of all the values in the input X and ȳ is the mean of all the
values in the desired output Y. This is the Least Squares method.
Summary
• The least-squares method is used to predict the behavior of the dependent variable with
respect to the independent variable.
• The sum of the squares of errors is called variance.
• The main aim of the least-squares method is to minimize the sum of the squared errors.
Model Building Life Cycle in Data Analytics:
• The problem-solving steps involved in the data science model-building life cycle. Let’s
understand every model-building step in-depth, The data science model-building life
cycle includes some important steps to follow. The following are the steps to follow to
build a Data Model
1. Problem Definition
• The first step in constructing a model is to understand the industrial problem in a more
comprehensive way. To identify the purpose of the problem and the prediction target, we
must define the project objectives appropriately.
• Therefore, to proceed with an analytical approach, we have to recognize the obstacles
first. Remember, excellent results always depend on a better understanding of the
problem.
2. Hypothesis Generation
• Hypothesis generation is the guessing approach through which we derive some essential
data parameters that have a significant correlation with the prediction target.
• Your hypothesis research must be in-depth, looking for every perceptive of all
stakeholders into account. We search for every suitable factor that can influence the
outcome.
• Hypothesis generation focuses on what you can create rather than what is available in
the dataset
3. Data Collection
• Data collection is gathering data from relevant sources regarding the analytical
problem, then we extract meaningful insights from the data for prediction.
The data gathered must have:
• The data you collected may be in unfamiliar shapes and sizes. It may contain unnecessary
features, null values, unanticipated small values, or immense values. So, before applying
any algorithmic model to data, we have to explore it first.
• By inspecting the data, we get to understand the explicit and hidden trends in data. We
find the relation between data features and the target variable.
• Usually, a data scientist invests his 60–70% of project time dealing with data exploration
only.
• There are several sub-steps involved in data exploration:
o Feature Identification:
• You need to analyze which data features are available and which ones are not.
• Identify independent and target variables.
• Identify data types and categories of these variables.
o Univariate Analysis:
• We inspect each variable one by one. This kind of analysis depends on the variable type
whether it is categorical and continuous.
• Continuous variable: We mainly look for statistical trends like mean, median, standard
deviation, skewness, and many more in the dataset.
• Categorical variable: We use a frequency table to understand the spread of data for each
category. We can measure the counts and frequency of occurrence of values
o Multi-variate Analysis:
• The bi-variate analysis helps to discover the relation between two or more variables.
• We can find the correlation in case of continuous variables and the case of
categorical, we look for association and dissociation between them.
o Filling Null Values:
• Usually, the dataset contains null values which lead to lower the potential of the
model.
• With a continuous variable, we fill these null values using the mean or mode of
that specific column.
• For the null values present in the categorical column, we replace them with the
most frequently occurring categorical value.
• Remember, don’t delete those rows because you may lose the information.
5. Predictive Modeling
• Algorithm Selection:
o When we have a structured dataset, and we want to estimate the continuous or
categorical outcome then we use supervised machine learning methodologies like
regression and classification techniques. When we have unstructured data and want to
predict the clusters of items to which a particular input test sample belongs, we use
unsupervised algorithms. An actual data scientist applies multiple algorithms to get a more
accurate model.
Train Model:
After assigning the algorithm and getting the data handy, we train our model using
the input data applying the preferred algorithm. It is an action to determine the
correspondence between independent variables, and the prediction targets.
Model Prediction
• We make predictions by giving the input test data to the trained model. We
measure the accuracy by using a cross-validation strategy or ROC curve which
performs well to derive model output for test data.
6. Model Deployment
• There is nothing better than deploying the model in a real-time environment. It helps
us to gain analytical insights into the decision-making procedure. You constantly
need to update the model with additional features for customer satisfaction.
• To predict business decisions, plan market strategies, and create personalized
customer interests, we integrate the machine learning model into the existing
production domain.
• When you go through the Amazon website and notice the product recommendations
completely based on your curiosities. You can experience the increase in the
involvement of the customers utilizing these services. That’s how a deployed model
changes the mindset of the customer and convince him to purchase the product.
BLUE Property Assumptions
In data analytics, the term "BLUE" stands for "Best Linear Unbiased Estimator." BLUE properties
refer to the desirable characteristics of an estimator that make it optimal for estimating parameters
in a linear regression model. These properties are fundamental for statistical inference and
regression analysis. Here's a breakdown of what each component means:
• Best: The estimator is considered "best" because it has the smallest variance among all unbiased
estimators. In other words, it minimizes the spread or uncertainty of the estimated parameter
values.
• Linear: The estimator is a linear function of the observed data. This means that it can be
expressed as a linear combination of the independent variables.
• Unbiased: The estimator is unbiased if, on average, it produces estimates that are equal to the
true parameter values. In other words, there is no systematic overestimation or underestimation.
• Logistic regression is the appropriate regression analysis to conduct when the dependent
variable is dichotomous (binary). Like all regression analyses, logistic regression is a
predictive analysis. It is used to describe data and to explain the relationship between
one dependent binary variable and one or more nominal, ordinal, interval or ratio-level
independent variables.
1.Prepare the data: The data should be in a format where each row represents a single
observation and each column represents a different variable. The target variable (the
variable you want to predict) should be binary (yes/no, true/false, 0/1).
2.Train the model: We teach the model by showing it the training data. This involves
finding the values of the model parameters that minimize the error in the training data.
3.Evaluate the model: The model is evaluated on the held-out test data to assess its
performance on unseen data.
4.Use the model to make predictions: After the model has been trained and assessed, it
can be used to forecast outcomes on new data.
Logistic Regression
• Logistic regression uses a sigmoid function or logistic function which is a complex cost function.
This sigmoid function is used to model the data in logistic regression. The function can be
represented as:
58
Logistic Regression: Model Theory
• Logistic regression is a technique used when the dependent variable is categorical (or
nominal). Examples: 1) Consumers make a decision to buy or not to buy, 2) a product may
pass or fail quality control, 3) there are good or poor credit risks, and 4) an employee may be
promoted or not.
• Binary logistic regression - determines the impact of multiple independent variables
presented simultaneously to predict membership of one or other of the two dependent
variable categories.
• Since the dependent variable is dichotomous we cannot predict a numerical value for it using
logistic regression so the usual regression least squares deviations criteria for the best fit
approach of minimizing error around the line of best fit is inappropriate (It’s impossible to
calculate deviations using binary variables!).
• Instead, logistic regression employs binomial probability theory in which there are only two
values to predict: that probability (p) is 1 rather than 0, i.e. the event/person belongs to one
group rather than the other.
• Logistic regression forms a best fitting equation or function using the maximum
likelihood (ML) method, which maximizes the probability of classifying the observed data
into the appropriate category given the regression coefficients.
• Like multiple regression, logistic regression provides a coefficient ‘b’, which measures
each independent variable’s partial contribution to variations in the dependent variable.
• The goal is to correctly predict the category of outcome for individual cases using the
most parsimonious model.
• To accomplish this goal, a model (i.e. an equation) is created that includes all predictor
variables that are useful in predicting the response variable.
The Purpose of Binary Logistic Regression
• The sigmoid function is a mathematical function that maps the predicted values to
probabilities.
• The sigmoid function maps any real value into another value within a range of 0 and 1,
forming forma S-Form curve.
• The value of the logistic regression must be between 0 and 1, which cannot go beyond
this limit, so it forms a curve like the "S" form
Hypothesis Test
• the alternate hypothesis that the model currently under consideration is accurate and
differs significantly from the null of zero, i.e. gives significantly better than the chance or
random prediction level of the null hypothesis.
• The null hypothesis in logistic regression states that there is no relationship between the
independent variables and the outcome variable. In other words, it suggests that the
coefficients of the independent variables in the regression equation are all equal to zero.
This implies that the independent variables do not have any effect on predicting the
outcome variable, and any observed relationship is due to random chance.
• On the other hand, the alternate hypothesis in logistic regression asserts that the model
being evaluated is accurate and significantly different from the null hypothesis. This
means that the independent variables included in the model are meaningful predictors of
the outcome variable, and the model provides a better fit to the data than would be
expected by chance alone. In essence, the alternate hypothesis suggests that the
regression model has predictive power beyond random chance and is capable of making
meaningful predictions about the outcome variable based on the values of the
independent variables.
Model Statistics
Likelihood Ratio Test: This test compares how well our full model with all predictors fits the
data compared to a simpler model with fewer predictors. It helps us see if adding more
predictors significantly improves our model's fit.
Example: Let's say we have a logistic regression model predicting whether customers will
purchase a product based on their age, income, and education level. We compare this full
model to a simpler model that only includes age and income. If the likelihood ratio test
shows a significant improvement in fit for the full model over the reduced model, it
suggests that including education level improves our ability to predict purchases.
Model Statistics
Deviance: Deviance tells us how far our model's predictions are from a perfect fit. A lower
deviance means our model fits the data better.
Example: Suppose we have a logistic regression model predicting whether patients will
develop a certain disease based on their health indicators. Lower deviance indicates that
our model's predicted probabilities are closer to the actual outcomes. For instance, a
deviance value of 1000 suggests better fit compared to a deviance value of 1500.
Model Statistics
AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion): These are
measures that balance how well our model fits the data with how complex it is. Smaller AIC
and BIC values suggest better models.
Example: Continuing with the disease prediction example, if we have two competing
logistic regression models, one with five predictors and another with ten predictors, we can
compare their AIC and BIC values. If the model with five predictors has lower AIC and BIC
values compared to the ten-predictor model, it suggests that the simpler model is
preferable as it balances goodness of fit with model complexity.
Model Statistics
Pseudo R-squared: Instead of using the traditional R-squared from linear regression,
logistic regression uses pseudo R-squared values like McFadden's R-squared or
Nagelkerke's R-squared. These values help us understand how much of the variability in the
data our model explains.
1. Data Collection and Preparation: Gather and preprocess your data, ensuring that
independent and dependent variables are correctly formatted.
2. Model Specification: Choose the appropriate independent variables based on domain
knowledge and exploratory data analysis.
3. Model Estimation: Use an algorithm (often maximum likelihood estimation) to estimate
the coefficients of the logistic regression model.
4. Model Evaluation: Evaluate the model using fit statistics, cross-validation, or other
techniques to assess its performance and generalization ability.
5. Model Interpretation: Interpret the coefficients of the model to understand the
relationship between the independent variables and the log odds of the outcome.
6. Model Deployment: Deploy the model for making predictions on new data or for use in
decision-making processes.
Analytics applications to various Business Domains:
• A multinomial logistic regression (or multinomial regression for short) is used when the
outcome variable being predicted is nominal and has more than two categories that do
not have a given rank or order.
• This model can be used with any number of independent variables that are categorical or
continuous.
What MNLR?
• When it comes to multinomial logistic regression. The idea is to use the logistic
regression techniques to predicted the target class for more than 2 target classes.
• The underline technique will be same like the
logistic regression for binary classification until calculating the probabilities for
each target. Once the probabilities were calculated. We need to transfer them
into one hot encoding and uses the cross entropy methods in the training
process for getting the proper weights.
Example
1.Independence of observations
2.Categories of the outcome variable must be mutually exclusive and exhaustive
3.No multicollinearity between independent variables
4.Linear relationship between continuous variables and the logit transformation of
the outcome variable
5.No outliers or highly influential points
1.Mutually Exclusive: The categories or groups into which the variable is divided
should not overlap. In other words, each observation should only fall into one
and only one category. There should be no ambiguity or possibility of an
observation belonging to multiple categories simultaneously.
For example, if you're categorizing people by their education level into "High School
Graduate," "College Graduate," and "Postgraduate," these categories should be
mutually exclusive. A person cannot be both a "College Graduate" and a
"Postgraduate" simultaneously; they should fit into only one category.
2. Exhaustive: This implies that all possible outcomes or scenarios should be
covered by the categories defined within the variable. No residual category should
be needed to account for observations that do not fit into any defined categories.
For instance, if you're categorizing people by their employment status into
"Employed," "Unemployed," and "Student," these categories should be exhaustive.
Every person's employment status should fit into one of these categories, leaving
no one unaccounted for. There should not be any other potential employment
status that is not covered by these categories.
Multinomial Logistic Regression Workflow/ Stages:
• Inputs
• Linear model
• Logits
• Softmax Function
• Cross Entropy
• One-Hot-Encoding
Inputs
• The inputs to the multinomial logistic regression are the features we have in the
dataset. Suppose if we are going to predict the Iris flower species type, the features
will be the flower sepal length, width and petal length and width parameters will be
our features. These features will treat as the inputs for the multinomial logistic
regression.
• The keynote to remember here is the features values always numerical. If the features
are not numerical, we need to convert them into numerical values using the proper
categorical data analysis techniques.
• Just a simple example: If the feature is color and having different attributes of the
color features are RED, BLUE, YELLOW, ORANGE. Then we can assign an integer value
to each attribute of the features like for RED we can assign 1. For BLUE we can assign
the value 2 likewise of the other attributes for the color feature. Later we can use the
numerically converted values as the inputs for the classifier.
Linear Model
• The linear model equation is the same as the linear equation in the linear regression
model. You can see this linear equation in the image. Where the X is the set of inputs,
Suppose from the image we can say X is a matrix. Which contains all the
feature( numerical values) X = [x1,x2,x3]. Where W is another matrix includes the same
input number of weights W = [w1,w2,w3].
• In this example, the linear model output will be the w1*x1, w2*x2, w3*x3
• The weights w1, w2, w3, w4 will update in the training phase. We will learn about this in
the parameters optimization section of this article.
Logits
• The Logits also called as scores. These are just the outputs of the linear
model. The Logits will change with the changes in the calculated
weights.
Softmax function
• The Softmax function is a probabilistic function that calculates the probabilities for the
given score. Using the softmax function return the high probability value for the high
scores and fewer probabilities for the remaining scores. This we can observe from the
image. For the logits 0.5, 1.5, 0.1 the calculated probabilities using the softmax function
are 0.2, 0.7, 0.1
• For Logit 1.5, we are getting a high probability value 0.7 and a very less probability value
for the remaining Logits 0.5 and 0.1
Cross Entropy
• The cross entropy is the last stage of multinomial logistic regression. It uses the cross-
entropy function to find the similarity distance between the probabilities calculated
from the softmax function and the target one-hot-encoding matrix.
• Before we learn more about Cross Entropy, let’s understand what it means by a One-Hot-
Encoding matrix.
One hot encoding
• One-Hot Encoding is a method to represent the target values or categorical attributes into a
binary representation. From this article main image, where the input is the dog image, the
target having 3 possible outcomes like bird, dog, cat. Where you can find the one-hot-
encoding matrix like [0, 1, 0].
• The one-hot-encoding matrix is so simple to create. For every input features (x1, x2, x3) the
one-hot-encoding matrix is with the values of 0 and the 1 for the target class. The total
number of values in the one-hot-encoding matrix and the unique target classes are the same.
• Suppose if we have 3 input features like x1, x2, and x3 and one target variable (With 3 target
classes). Then the one-hot-encoding matrix will have 3 values. Out of 3 values, one value will
be 1 and all other will be 0s.
• You will know where to place the 1 and where to place the 0 value from the training dataset.
Let’s take one observation from the training dataset which contains values for x1, x2, x3 and
what will be the target class for that observation. The one-hot-encoding matrix will be having
1 for the target class for that observation and 0s for other.
Cross-entropy