Copy of Unit 5 Business Analytics
Copy of Unit 5 Business Analytics
Applications:
Marketing: Customer behavior prediction, lead segmentation and targeting of high-value prospects.
Retail: Personalized shopping, pricing optimization, inventory planning.
Manufacturing: Machine performance monitoring, equipment failure prevention and smooth
logistics.
Finance: Fraud detection, credit scoring, risk assessment.
Healthcare: Patient care personalization, resource allocation and identification of high-risk patients
for timely interventions.
Simple Linear Regression
Simple linear regression can be defined as a statistical learning method that is used to examine or
predict quantitative relationship between two continuous variables: one independent variable called
predictor (X) and other dependent variable called response (Y).
This method helps us model the linear relationship between the variables and make their
predictions assuming that there is approximately a linear relationship between independent variable
X and dependent variable Y. Mathematically, we can write this linear relationship as:
y=β0+β1x+ε
• y: Dependent (response) variable
• x: Independent (predictor) variable
• β₀: Intercept (value of y when x = 0)
• β₁: Slope (change in y for a one-unit change in x)
• ε: Error term (variation in y not explained by the model)
Key Use:
• Predict how Y changes with changes in X
CONFIDENCE AND PREDICITION INTERVAL
In predictive analysis, confidence intervals and prediction intervals are two critical tools that can be
used to quantify the uncertainty surrounding any statistical estimates or predictions.
Both of them give a quantitative indication of ranges in which the true values are expected to lie, yet
they have different usage. They play an important role in interpreting and evaluating the regression
model. They provide insight into the accuracy of parameter estimates and the range within which
individual predictions are likely to fall.
Confidence Intervals (CI)-Estimate the range within which the true mean of the dependent variable
(y) lies for a given independent variable (x), with a given confidence level (typically 95%).
Key Points:
• Reflects uncertainty in the mean prediction.
• CI is narrower than Prediction Interval (PI), indicating greater precision.
Factors affecting CI:
• Sample size, Confidence level, Data variability , Model fit
Uses:
• Estimating precision
• Inferring population parameters
• Model validation
• Decision-making
Prediction Intervals-Predicts the range where an individual value of the dependent variable (y) is
likely to fall for a given x, with a certain confidence level (e.g., 95%).
Key Differences from CI:
• Includes residual error (ε)
• Wider than CI (due to individual variability)
Uses:
• Forecasting individual outcomes
• Quantifying uncertainty , Informed decision-making
Multiple Linear Regression
Multiple Linear Regression (MLR) is just an extension of simple linear regression that models the
relationship between two or more independent variables and a dependent variable. In MLR, the
dependent variable is predicted using a linear combination of multiple independent variables. This
method can be very helpful when we want to understand the influence of several independent
factors on a single outcome or target variable. The mathematical equation for MLR is:
y=β0+β1X1+β2X2+…+βnXn+ε
Where , y: Dependent variable
X1,X2,...,Xn:Independent variables
β0: Intercept
β1, β2 ...,βn : Coefficients
ε: Error term
Assumptions for a Valid MLR Model:
1. Linearity:
Relationship between dependent and independent variables must be linear.
2. Independence:
Observations must be independent of each other.
3. Homoscedasticity:
Constant variance of error terms (no heteroscedasticity).
4. Normality of Errors:
Residuals (errors) should be normally distributed.
5. No Multicollinearity:
Independent variables should not be highly correlated with each other.
Interpretation of Regression Coefficients
• Regression coefficients describe how each independent variable (predictor) affects the
dependent variable (outcome).
• The intercept (β₀) is the expected value of the dependent variable when all predictors are zero.
It may not always make real-world sense but mathematically defines the model's baseline.
• In simple linear regression, the coefficient (β₁) is the slope, showing how much the dependent
variable changes for a one-unit change in the independent variable.
o A negative coefficient indicates an inverse relationship.
o A positive coefficient shows a positive relationship.
o The magnitude of the coefficient indicates the strength of the relationship.
o A larger positive value means a stronger effect of x on y.
• In multiple linear regression (MLR) interpretation is more complex, each coefficient reflects the
effect of its variable after controlling for other variables in the model.
Statistical Significance of Coefficients (P-value)
• Coefficients must be assessed with p-values to determine reliability ,they have to seen in
context of other statistical tests like p-values.
• P-value indicates whether an independent variable (x) has a statistically significant relationship
with the dependent variable (y).
• Significance level is typically 0.05 (5%):
o p < 0.05 → Statistically significant: Strong evidence that x influences y.
o p ≥ 0.05 → Not statistically significant: Insufficient evidence that x affects y.
Heteroscedasticity
Heteroscedasticity refers to the situation in regression analysis where the variance of the residuals or
errors (i.e. the differences between observed and predicted values) is not constant across all levels of
the independent variable(s). It means that whenever the value of the independent variable is
changed, the spread or dispersion of the residuals also varies.
In a properly specified regression model, the residuals are expected to have constant variance, a
condition called homoscedasticity. When this condition is violated, heteroscedasticity occurs, which
interferes with the estimation of the standard errors of the coefficients, potentially impacting the
reliability of the model’s results, it can lead to incorrect conclusions about predictor’s significance,
undermining the regression model’s validity.
Multicollinearity
Multi-collinearity occurs when two or more independent variables are highly correlated. This gives
redundant information which makes it difficult to determine each predictor’s unique effect on the
dependent variable, reducing their statistical significance and leading to unstable coefficient
estimates.
It can cause large variations in coefficient estimates with small changes in the data, making the
model less reliable.
The Variance Inflation Factor (VIF) is a diagnostic measure for multi-collinearity. High VIF values
usually above 10 indicate strong collinearity between independent variables.
• VIF = 1: No multi-collinearity.
• 1 < VIF ≤ 5: Moderate multi-collinearity (acceptable in most cases).
• VIF > 5: High multi-collinearity, which may distort the model.
• VIF > 10: Extreme multi-collinearity, requiring corrective measures.
Reducing Multi-collinearity
Reducing multi-collinearity in a regression model is essential to improve the stability and
interpretability of the coefficients.
Here are some common strategies to address multicollinearity:
1. Remove Highly Correlated Predictors:
Identify pairs of predictors with high correlation (using a correlation matrix). Remove one of the
correlated variables.
2. Combine Predictors:
Combine correlated variables into a single predictor using techniques like principal component
analysis (PCA) or by creating an index.
3. Centering Variables:
Subtract the mean from each predictor to create mean-centered variables.
4. Increase Sample Size:
Multi-collinearity effects are less pronounced in larger datasets because coefficients stabilize
with more observations.
5. Variance Inflation Factor (VIF):
Compute the VIF for each predictor. Remove or adjust variables with high VIF values (>5 or >10).
Textual Analysis
Textual Analysis refers to the process of extracting useful information and patterns from text
data like product reviews, social media posts, emails, or documents. It is commonly applied in
tasks such as sentiment analysis, keyword extraction, and text classification.
Since text is unstructured, it first needs to be cleaned and converted into a structured form so
that statistical or machine learning techniques can be applied.
Text Mining
Text mining is the process of extracting useful information and knowledge from unstructured text
data. It helps to uncover patterns, trends, and relationships in large text collections like books,
reviews, articles, or social media.
Key Steps:
• Text Preprocessing: Cleaning the text (Normalizing, removing punctuation, stop words,
numbers, and special characters, etc.)
• Tokenization: Breaking text into words or phrases.
• Word Frequency Analysis: Identifying most common words.
• Advanced Methods: Like topic modeling and clustering to group and understand content better.
It is useful to draw insights and understand deeper meanings from text data (e.g., what customers
commonly complain about).
R has an extensive list of libraries such as tm and tidytext that make the process of text mining easier.
Categorization
It refers to the process of assigning text into predefined categories or labels based on the content.
This method is applied in many applications, including email filtering (spam vs. non-spam), document
classification (business, sports, tech), and sentiment analysis (positive, negative, neutral).
Techniques of categorization involve supervised learning models including Naive Bayes, Support
Vector Machines (SVM), and Logistic Regression. Such models require labeled training data to learn
how to classify new, unseen data.
The model can predict the category of a new document based on the patterns learned from the
training set once trained. In R, one can do text categorization by creating a Document-Term Matrix
(DTM) and using classification models such as Naive Bayes.
Sentiment Analysis
Sentiment analysis is the process of determining the emotional tone or sentiment behind a piece of
text. The aim is to classify text as expressing a positive, negative, or neutral sentiment.
This technique is widely used for analyzing customer feedback, product reviews, social media posts,
and other forms of text to measure public opinion or sentiment about a particular topic.
There are two main approaches in sentiment analysis:
• Lexicon-based Methods: These use pre-defined dictionaries of words with positive, negative, or
neutral sentiments.
Example: "happy" = positive, "angry" = negative.
The text is scanned for these words and the overall sentiment is calculated.
• Machine Learning-based Approaches: This involves training the model on labelled text data
wherein the sentiment is known in advance and then applies that model to classify new text.
Techniques like Naive Bayes, Support Vector Machines, and deep learning can be applied here.
R provides libraries like syuzhet, tidytext, and sentimentr to perform sentiment analysis.