MODULE-3
MODULE-3
Regression
Covariance
• Covariance is a measure of how much two random variables vary together. It’s similar to
variance, but where variance tells you how a single variable varies, covariance tells you
how two variables vary together.
Where:
Correlation
• The Correlation is a measure of association between two variables. Correlations are
Positive and negative which are ranging between +1 and -1.
• Degree and type of relationship between any two or more quantities (variables) in which
they vary together over a period.
• for example, variation in the level of expenditure or savings with variation in the level of
income.
• A positive correlation exists where the high values of one variable are associated with
the high values of the other variable(s).
• A 'negative correlation' means association of high values of one with the low values of
the other(s).
• Values close to +1 indicate a high degree of positive correlation, and values close to -1
indicate a high degree of negative correlation.
• Values close to zero indicate poor correlation of either kind, and 0 indicates no
correlation at all.
Regression – Concepts
Introduction:
• The term regression is used to indicate the estimation or prediction of the average value
of one variable for a specified value of another variable.
• Regression analysis is a very widely used statistical tool to establish a relationship model
between two variables.
Regression Analysis:
– Linear Regression
– Logistic Regression
– Ridge Regression
– Lasso Regression
– Polynomial Regression
– Bayesian Linear Regression
1. Linear Regression
• Purpose: Predict a continuous dependent variable based on the linear relationship with
one or more independent variables.
• Key Assumption: There is a linear relationship between the dependent and independent
variables.
2. Logistic Regression
• Key Assumption: The dependent variable is binary, and the log-odds of the outcome are
linearly related to the independent variables.
• Purpose: A variant of linear regression that adds a regularization term to penalize large
coefficients, reducing overfitting.
• Key Assumption: The predictors may be highly correlated (multicollinearity), and the
model is prone to overfitting.
• Example: Predicting car prices with many correlated variables like mileage, age, and
horsepower.
• Purpose: Similar to Ridge Regression but capable of driving some coefficients to zero,
performing automatic feature selection.
• Example: Predicting sales with hundreds of potential features, where some are not
useful.
5. Polynomial Regression
• Purpose: Extends linear regression by fitting a polynomial equation (curved line) to the
data to model non-linear relationships.
• Key Assumption: The relationship between the dependent and independent variables is
non-linear.
• Key Assumption: Parameters are treated as random variables with prior distributions,
and predictions are made using a posterior distribution.
• Example: Predicting stock prices while incorporating prior beliefs and uncertainties
about the coefficients.
Where Variables:
Parameters:
β0= Intercept
• The y-intercept of a line is the point at which the line crosses the y axis. ( i.e. where the x
value equals 0)
β1= Slope
ε = residuals
• The residual value is a discrepancy between the actual and the predicted value. The
distance of the plotted points from the line gives the residual value.
Positive relation
Negative relation
Constructing Regression model for following samples
OLS Regression: Linear Regression using Ordinary /Least Squares
Approximation/ Estimation /Based on Gauss Markov Theorem
• We can start off by estimating the value for B1 as:
If we had multiple input attributes (e.g. x1, x2, x3, etc.) This would be called multiple linear
regression. The procedure for linear regression is different and simpler than that for multiple
linear regression.
Calculating B1 & B0 using Correlations and Standard Deviations:
B0 = mean(y)-B1*mean(x)
Where
• x<-c(1,2,4,3,5)
• y<-c(1,3,3,2,5)
• x;y
• [1] 1 2 4 3 5
• [1] 1 3 3 2 5
• B1=cor(x,y)*sd(y)/sd(x)
• B1=0.8
• B0=mean(y)-B1*mean(x)
• B0=0.4
• Linear regressions can be used in business to evaluate trends and make estimates or
forecasts.
• For example, if a company’s sales have increased steadily every month for the past few
years, conducting a linear analysis on the sales data with monthly sales on the y-axis and
time on the x-axis would produce a line that that depicts the upward trend in sales.
• After creating the trend line, the company could use the slope of the line to forecast sales
in future months.
• Linear regression can also be used to analyze the effect of pricing on consumer behavior.
• For example, if a company changes the price on a certain product several times, it can
record the quantity it sells for each price level and then performs a linear regression with
quantity sold as the dependent variable and price as the explanatory variable.
• The result would be a line that depicts the extent to which consumers reduce their
consumption of the product as prices increase, which could help guide future pricing
decisions.
• The results of such an analysis might guide important business decisions made to account
for risk.
• However, before we conduct linear regression, we must first make sure that four
assumptions are met:
• If one or more of these assumptions are violated, then the results of our linear regression
may be unreliable or even misleading.
– Unbiasedness:
– Biased estimator is defined as the difference between its expected value and the
true value. i.e., e(y)=y_actual – y_predited
– Least variance property is more important when it combined with small biased.
– Efficient estimator:
– Sufficient Estimator:
• An estimator is sufficient if it utilizes all the information of a sample about the True
parameter.
– Randomness of µ
– Mean of µ is Zero
– Variance of µ is constant
• 0% indicates that the model explains none of the variability of the response data around
its mean.
• 100% indicates that the model explains all the variability of the response data around its
mean.
Variable Rationalization:
• The data set may have a large number of attributes. But some of those attributes can be
irrelevant or redundant.
• The goal of Variable Rationalization is to improve the Data Processing in an optimal way
through attribute subset selection.
• This process is to find a minimum set of attributes such that dropping of those irrelevant
attributes does not much affect the utility of data and the cost of data analysis could be
reduced.
• Mining on a reduced data set also makes the discovered pattern easier to understand.
• As part of Data processing, we use the below methods of Attribute subset selection
• All the above methods are greedy approaches for attribute subset selection.
• This procedure starts with an empty set of attributes as the minimal set.
• The most relevant attributes are chosen (having minimum p-value) and are added to the
minimal set. In each iteration, one attribute is added to a reduced set.
• Here all the attributes are considered in the initial set of attributes.
• In each iteration, one attribute is eliminated from the set of attributes whose p-value is
higher than significance level.
• The stepwise forward selection and backward elimination are combined so as to select
the relevant attributes most efficiently.
• This is the most common technique which is generally used for attribute selection.
• This approach uses decision tree for attribute selection. It constructs a flow chart like
structure having nodes denoting a test on an attribute.
• Each branch corresponds to the outcome of test and leaf nodes is a class prediction.
• The attribute that is not the part of tree is considered irrelevant and hence discarded.
• Models are used to make predictions, uncover patterns, and derive insights from data.
Types of Models
Predictive Models:
• Purpose: To predict future outcomes based on historical data.
• Examples:
Descriptive Models:
• Examples:
Prescriptive Models:
• Examples:
– Simulation Models: Analyze complex systems and their behaviors under various
scenarios.
Causal Models:
• Examples:
• Data Preprocessing: Clean and prepare the data for analysis, including handling missing
values, normalization, and encoding categorical variables.
• Model Selection: Choose the appropriate modeling technique based on the problem type.
• Model Evaluation: Assess model performance using relevant metrics (e.g., accuracy,
precision, recall, R-squared).
• Model Deployment: Implement the model in a production environment for practical use.
• Before realizing the misfortunes, we try to implement and predict the outcomes.
• The problem-solving steps involved in the data science model-building life cycle.
• The data science model-building life cycle includes some important steps to follow.
– Problem Definition
– Hypothesis Generation
– Data Collection
– Data Exploration/Transformation
– Predictive Modelling
– Model Deployment
1. Problem Definition
• The first step in constructing a model is to understand the industrial problem in a more
comprehensive way.
• To identify the purpose of the problem and the prediction target, we must define the
project objectives appropriately.
2. Hypothesis Generation
• Hypothesis generation is the guessing approach through which we derive some essential
data parameters that have a significant correlation with the prediction target.
• Your hypothesis research must be in-depth, looking for every perceptive of all
stakeholders into account.
• We search for every suitable factor that can influence the outcome.
• Hypothesis generation focuses on what you can create rather than what is available in the
dataset.
3. Data Collection
• Data collection is gathering data from relevant sources regarding the analytical problem,
then we extract meaningful insights from the data for prediction.
• The data gathered must have:
4. Data Exploration/Transformation
• It may contain unnecessary features, null values, unanticipated small values, or immense
values.
• So, before applying any algorithmic model to data, we have to explore it first.
• By inspecting the data, we get to understand the explicit and hidden trends in data.
• We find the relation between data features and the target variable.
• Usually, a data scientist invests his 60–70% of project time dealing with data exploration
only.
1. Feature Identification:
– You need to analyze which data features are available and which ones are not.
2. Univariate Analysis:
• This kind of analysis depends on the variable type whether it is categorical and
continuous.
– Continuous variable: We mainly look for statistical trends like mean, median,
standard deviation, skewness, and many more in the dataset.
3. Multi-variate Analysis:
• The bi-variate analysis helps to discover the relation between two or more variables.
• We can find the correlation in case of continuous variables and the case of categorical,
we look for association and dissociation between them.
• Usually, the dataset contains null values which lead to lower the potential of the model.
• With a continuous variable, we fill these null values using the mean or mode of that
specific column.
• For the null values present in the categorical column, we replace them with the most
frequently occurred categorical value.
• Remember, don’t delete that rows because you may lose the information.
5. Predictive Modeling
• Predictive modeling is a mathematical approach to create a statistical model to forecast
future behavior based on input test data.
Algorithm Selection:
• When we have the structured dataset, and we want to estimate the continuous or
categorical outcome then we use supervised machine learning methodologies like
regression and classification techniques.
• When we have unstructured data and want to predict the clusters of items to which a
particular input test sample belongs, we use unsupervised algorithms.
• An actual data scientist applies multiple algorithms to get a more accurate model.
Train Model:
• After assigning the algorithm and getting the data handy, we train our model using the
input data applying the preferred algorithm.
Model Prediction:
• We make predictions by giving the input test data to the trained model.
6. Model Deployment
• You constantly need to update the model with additional features for customer
satisfaction.
• To predict business decisions, plan market strategies, and create personalized customer
interests, we integrate the machine learning model into the existing production domain.
• When you go through the Amazon website and notice the product recommendations
completely based on your curiosities.
• You can experience the increase in the involvement of the customers utilizing these
services.
• That’s how a deployed model changes the mindset of the customer and convince him to
purchase the product.
Logistic Regression
• Regression models traditionally work with continuous numeric value data for dependent
and independent variables.
• Logistic regression models can, however, work with dependent variables with binary
values, such as whether a loan is approved (yes or no).
• For example, Logistic regression might be used to predict whether a patient has a given
disease (e.g. diabetes), based on observed characteristics of the patient (age, gender, body
mass index, results of blood tests, etc.).
• Logistical regression models use probability scores as the predicted values of the
dependent variable.
• Logistic regression takes the natural logarithm of the odds of the dependent variable
being a case (referred to as the logit) to create a continuous criterion as a transformed
version of the dependent variable.
• Thus the logit transformation is used in logistic regression as the dependent variable.
• The net effect is that although the dependent variable in logistic regression is binomial
(or categorical, i.e. has only two possible values), the logit is the continuous function
upon which linear regression is conducted.
• Logistic Function: The key component in logistic regression is the logistic (or sigmoid)
function, which maps any real-valued input to a probability value between 0 and 1. The
sigmoid function is given by:
• where z=β0+β1x1+β2x2+⋯+βnxn represents a linear combination of the predictor
variables (x1,x2,…,xn) and the model parameters (coefficients) β.
• S-Shape: The sigmoid function has an S-shaped curve, which smoothly transforms input
values into a range between 0 and 1.
• Logistic regression does not directly predict classes. Instead, it predicts the probability of
belonging to a particular class.
`Y = βo + β1X + ∈`
• In Logistic Regression, we use the same equation but with some modifications made to
Y.
we'll meet the above two criteria. We know the exponential of any value is always a
positive number. And, any number divided by number + 1 will always be lower than 1.
Let's implement these two findings:
• Now we are convinced that the probability value will always lie between 0 and 1.
• To determine the link function, follow the algebraic calculations carefully. P(Y=1|X) can
be read as "probability that Y =1 given some value for x."
• The left side is known as the log - odds or odds ratio or logit function and is the link
function for Logistic Regression.
• This link function follows a sigmoid (shown below) function which limits its range of
probabilities between 0 and 1.
• In Multiple Regression, we use the Ordinary Least Square (OLS) method to determine
the best coefficients to attain good model fit.
• In Logistic Regression, we use maximum likelihood method to determine the best
coefficients and eventually a good model fit.
• Maximum likelihood works like this: It tries to find the value of coefficients (βo,β1) such
that the predicted probabilities are as close to the observed probabilities as possible.
• In other words, for a binary classification (1/0), maximum likelihood will try to find
values of βo and β1 such that the resultant probabilities are closest to either 1 or 0.
How can you evaluate Logistic Regression model fit and accuracy?
• In Linear Regression, we check adjusted R², F Statistics, MAE, and RMSE to evaluate
model fit and accuracy.
• In other words, adding more variables to the model wouldn't let AIC increase.
• The importance of deviance can be further understood using its types: Null and Residual
Deviance.
• Null deviance is calculated from the model with no features, i.e.,only intercept.
• Residual deviance is calculated from the model having all the features.
• The larger the difference between null and residual deviance, better the model.
• Also, you can use these metrics to compared multiple models: whichever model has a
lower null deviance, means that the model explains deviance pretty well, and is a better
model.
• Practically, AIC is always given preference above deviance to evaluate model fit.
3. Confusion Matrix
• Confusion matrix is the most crucial metric commonly used to evaluate classification
models.
Accuracy –
It indicates how many positive values, out of all the positive values, have been correctly
predicted.
• The formula to calculate the true positive rate is (TP/TP + FN). Also, TPR = 1 - False
Negative Rate.
It indicates how many negative values, out of all the negative values, have been
incorrectly predicted.
It indicates how many negative values, out of all the negative values, have been correctly
predicted.
It indicates how many positive values, out of all the positive values, have been incorrectly
predicted.
It indicates how many values, out of all the predicted positive values, are actually
positive.
F Score:
F score is the harmonic mean of precision and recall. It lies between 0 and 1.
• ROC determines the accuracy of a classification model at a user defined threshold value.
• The area under the curve (AUC), also referred to as index of accuracy (A) or concordant
index, represents the performance of the ROC curve.
• ROC is plotted between True Positive Rate (Y axis) and False Positive Rate (X Axis).
• In this plot, our aim is to push the red curve (shown below) toward 1 (left corner) and
maximize the area under curve.
Data Collection:
• Gather and prepare the relevant dataset that includes the target variable and the predictor
variables.
Data Preprocessing:
• Encoding: Convert categorical variables into a suitable format (e.g., one-hot encoding).
• Feature Scaling: Although not always necessary for logistic regression, scaling can help
in interpreting coefficients.
Model Training:
Model Evaluation:
– ROC Curve and AUC: To assess the trade-off between true positive rate and
false positive rate.
Model Interpretation:
• Interpret the coefficients to understand the impact of each predictor on the odds of the
outcome.
• A positive coefficient increases the odds, while a negative coefficient decreases them.
Model Validation:
Deployment:
• Deploy the model for real-time predictions or batch processing, integrating it into
existing systems.
• Regularly monitor the model’s performance and update it as needed to ensure it remains
accurate over time.
• Here are some key applications of logistic regression across various domains:
5. Telecommunications
9. Insurance Industry
• Disease Prediction: Logistic regression is often used to assess the likelihood of diseases,
such as heart disease, diabetes, or cancer, based on patient data. For example, logistic
regression can analyze factors like age, blood pressure, cholesterol levels, and more to
predict if a patient is at risk.
• Patient Outcome Prediction: It can predict patient outcomes like survival rates or the
probability of hospital readmission based on previous history, demographics, and clinical
variables.
• Loan Default Prediction: Banks and financial institutions use logistic regression to
predict the likelihood of a customer defaulting on a loan. This model can use features like
credit score, income level, and employment history to estimate risk.
• Credit Scoring: Logistic regression helps classify individuals into different credit risk
categories (e.g., low-risk or high-risk borrowers) based on financial behavior and
historical data.
• Predicting Click-Through Rates (CTR): Logistic regression can predict the probability
of a user clicking on an ad based on features like time, device type, and past behavior.
This enables effective ad placements and marketing spend optimization.
5. Telecommunications
• Service Upgrade Prediction: Telecoms can use logistic regression to predict the
likelihood of a customer opting for service upgrades or additional features, allowing for
targeted offers and personalized recommendations.
• Sentiment Classification: Logistic regression is used to classify text data (like social
media posts or customer reviews) into sentiment categories, such as positive, neutral, or
negative. This is valuable for brand sentiment analysis.
• Fake News Detection: Logistic regression can classify articles as real or fake based on
language patterns, sources, and other textual features.
• User Behavior Prediction: Predicting the likelihood of a user engaging with or sharing
certain content can be helpful for optimizing content strategy and social media targeting.
9. Insurance Industry
• Spam Detection: Email providers use logistic regression to classify emails as spam or
not based on features like sender details, frequency of specific keywords, and attachment
types.
• Although Logistic regression is used widely by many people for solving various types of
problems, it fails to hold up its performance due to its various limitations and also other
predictive models provide better predictive results.
Pros
• The logistic regression model not only acts as a classification model, but also gives you
probabilities.
• This is a big advantage over other models where they can only provide the final
classification.
• Knowing that an instance has a 99% probability for a class compared to 51% makes a big
difference.
• We see that Logistic regression is easier to implement, interpret and very efficient to
train.
Cons
• If there is a feature that would perfectly separate the two classes, the logistic regression
model can no longer be trained.
• This is because the weight for that feature would not converge, because the optimal
weight would be infinite.
• This is really a bit unfortunate, because such a feature is really very useful.
• But you do not need machine learning if you have a simple rule that separates both
classes.
weights:
• Logistic regression is less prone to overfitting but it can overfit in high dimensional
datasets and in that case, regularization techniques should be considered to avoid over-
fitting in such scenarios.