0% found this document useful (0 votes)
15 views

Copy of Unit 5 Business Analytics

Unit 5 covers predictive and textual analytics, focusing on regression models, confidence and prediction intervals, and the challenges of textual data analysis. It explores applications in various fields such as marketing, finance, and healthcare, while also detailing methods for textual analysis using R. Key techniques include text mining, categorization, and sentiment analysis, alongside the importance of addressing issues like heteroscedasticity and multicollinearity in regression models.

Uploaded by

kumram1134
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Copy of Unit 5 Business Analytics

Unit 5 covers predictive and textual analytics, focusing on regression models, confidence and prediction intervals, and the challenges of textual data analysis. It explores applications in various fields such as marketing, finance, and healthcare, while also detailing methods for textual analysis using R. Key techniques include text mining, categorization, and sentiment analysis, alongside the importance of addressing issues like heteroscedasticity and multicollinearity in regression models.

Uploaded by

kumram1134
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Unit 5: Predictive and Textual Analytics

Simple Linear Regression models;


Confidence & Prediction intervals;
Multiple Linear Regression;
Interpretation of Regression Coefficients; heteroscedasticity; multi-collinearity.
Basics of textual data analysis, significance, application, and challenges.
Introduction to Textual Analysis using R.
Methods and Techniques of textual analysis: Text Mining, Categorization and Sentiment
Analysis.
Predictive Analytics
Predictive analytics is a process that makes informed predictions about future events based on
historical data. It uses techniques such as data mining and artificial intelligence, statistical models,
machine learning, and AI. Its helps businesses make smarter data-driven decisions and prepare for
future risks or opportunities.

Applications:
Marketing: Customer behavior prediction, lead segmentation and targeting of high-value prospects.
Retail: Personalized shopping, pricing optimization, inventory planning.
Manufacturing: Machine performance monitoring, equipment failure prevention and smooth
logistics.
Finance: Fraud detection, credit scoring, risk assessment.
Healthcare: Patient care personalization, resource allocation and identification of high-risk patients
for timely interventions.
Simple Linear Regression
Simple linear regression can be defined as a statistical learning method that is used to examine or
predict quantitative relationship between two continuous variables: one independent variable called
predictor (X) and other dependent variable called response (Y).
This method helps us model the linear relationship between the variables and make their
predictions assuming that there is approximately a linear relationship between independent variable
X and dependent variable Y. Mathematically, we can write this linear relationship as:
y=β0+β1x+ε
• y: Dependent (response) variable
• x: Independent (predictor) variable
• β₀: Intercept (value of y when x = 0)
• β₁: Slope (change in y for a one-unit change in x)
• ε: Error term (variation in y not explained by the model)
Key Use:
• Predict how Y changes with changes in X
CONFIDENCE AND PREDICITION INTERVAL
In predictive analysis, confidence intervals and prediction intervals are two critical tools that can be
used to quantify the uncertainty surrounding any statistical estimates or predictions.
Both of them give a quantitative indication of ranges in which the true values are expected to lie, yet
they have different usage. They play an important role in interpreting and evaluating the regression
model. They provide insight into the accuracy of parameter estimates and the range within which
individual predictions are likely to fall.
Confidence Intervals (CI)-Estimate the range within which the true mean of the dependent variable
(y) lies for a given independent variable (x), with a given confidence level (typically 95%).
Key Points:
• Reflects uncertainty in the mean prediction.
• CI is narrower than Prediction Interval (PI), indicating greater precision.
Factors affecting CI:
• Sample size, Confidence level, Data variability , Model fit
Uses:
• Estimating precision
• Inferring population parameters
• Model validation
• Decision-making

Prediction Intervals-Predicts the range where an individual value of the dependent variable (y) is
likely to fall for a given x, with a certain confidence level (e.g., 95%).
Key Differences from CI:
• Includes residual error (ε)
• Wider than CI (due to individual variability)
Uses:
• Forecasting individual outcomes
• Quantifying uncertainty , Informed decision-making
Multiple Linear Regression
Multiple Linear Regression (MLR) is just an extension of simple linear regression that models the
relationship between two or more independent variables and a dependent variable. In MLR, the
dependent variable is predicted using a linear combination of multiple independent variables. This
method can be very helpful when we want to understand the influence of several independent
factors on a single outcome or target variable. The mathematical equation for MLR is:

y=β0+β1X1+β2X2+…+βnXn+ε
Where , y: Dependent variable
X1,X2,...,Xn:Independent variables
β0: Intercept
β1, β2 ...,βn : Coefficients
ε: Error term
Assumptions for a Valid MLR Model:
1. Linearity:
Relationship between dependent and independent variables must be linear.
2. Independence:
Observations must be independent of each other.
3. Homoscedasticity:
Constant variance of error terms (no heteroscedasticity).
4. Normality of Errors:
Residuals (errors) should be normally distributed.
5. No Multicollinearity:
Independent variables should not be highly correlated with each other.
Interpretation of Regression Coefficients
• Regression coefficients describe how each independent variable (predictor) affects the
dependent variable (outcome).
• The intercept (β₀) is the expected value of the dependent variable when all predictors are zero.
It may not always make real-world sense but mathematically defines the model's baseline.
• In simple linear regression, the coefficient (β₁) is the slope, showing how much the dependent
variable changes for a one-unit change in the independent variable.
o A negative coefficient indicates an inverse relationship.
o A positive coefficient shows a positive relationship.
o The magnitude of the coefficient indicates the strength of the relationship.
o A larger positive value means a stronger effect of x on y.
• In multiple linear regression (MLR) interpretation is more complex, each coefficient reflects the
effect of its variable after controlling for other variables in the model.
Statistical Significance of Coefficients (P-value)
• Coefficients must be assessed with p-values to determine reliability ,they have to seen in
context of other statistical tests like p-values.
• P-value indicates whether an independent variable (x) has a statistically significant relationship
with the dependent variable (y).
• Significance level is typically 0.05 (5%):
o p < 0.05 → Statistically significant: Strong evidence that x influences y.
o p ≥ 0.05 → Not statistically significant: Insufficient evidence that x affects y.

Heteroscedasticity
Heteroscedasticity refers to the situation in regression analysis where the variance of the residuals or
errors (i.e. the differences between observed and predicted values) is not constant across all levels of
the independent variable(s). It means that whenever the value of the independent variable is
changed, the spread or dispersion of the residuals also varies.
In a properly specified regression model, the residuals are expected to have constant variance, a
condition called homoscedasticity. When this condition is violated, heteroscedasticity occurs, which
interferes with the estimation of the standard errors of the coefficients, potentially impacting the
reliability of the model’s results, it can lead to incorrect conclusions about predictor’s significance,
undermining the regression model’s validity.

Visually detection of heteroscedasticity


Heteroscedasticity can be visually detected by
examining the residuals of a regression model,
patterns in residual plots may indicate
heteroscedasticity. A scatter plot of residuals
versus predicted values often reveals
heteroscedasticity. If the residuals form a funnel-
shaped pattern (narrow at one end and wider at
the other), heteroscedasticity is likely present.
Method to address heteroscedasticity in regression models:
One common method to address heteroscedasticity in regression models is to apply a logarithmic
transformation or other similar transformations to the dependent variable, the independent
variable(s), or both.
The logarithmic transformation is a simple and effective method to address heteroscedasticity,
helping stabilize variance and improve the reliability of regression results. It is particularly useful
when dealing with variables that have large ranges or exponential growth patterns.

Multicollinearity
Multi-collinearity occurs when two or more independent variables are highly correlated. This gives
redundant information which makes it difficult to determine each predictor’s unique effect on the
dependent variable, reducing their statistical significance and leading to unstable coefficient
estimates.
It can cause large variations in coefficient estimates with small changes in the data, making the
model less reliable.
The Variance Inflation Factor (VIF) is a diagnostic measure for multi-collinearity. High VIF values
usually above 10 indicate strong collinearity between independent variables.
• VIF = 1: No multi-collinearity.
• 1 < VIF ≤ 5: Moderate multi-collinearity (acceptable in most cases).
• VIF > 5: High multi-collinearity, which may distort the model.
• VIF > 10: Extreme multi-collinearity, requiring corrective measures.

Reducing Multi-collinearity
Reducing multi-collinearity in a regression model is essential to improve the stability and
interpretability of the coefficients.
Here are some common strategies to address multicollinearity:
1. Remove Highly Correlated Predictors:
Identify pairs of predictors with high correlation (using a correlation matrix). Remove one of the
correlated variables.
2. Combine Predictors:
Combine correlated variables into a single predictor using techniques like principal component
analysis (PCA) or by creating an index.
3. Centering Variables:
Subtract the mean from each predictor to create mean-centered variables.
4. Increase Sample Size:
Multi-collinearity effects are less pronounced in larger datasets because coefficients stabilize
with more observations.
5. Variance Inflation Factor (VIF):
Compute the VIF for each predictor. Remove or adjust variables with high VIF values (>5 or >10).

Textual Analysis
Textual Analysis refers to the process of extracting useful information and patterns from text
data like product reviews, social media posts, emails, or documents. It is commonly applied in
tasks such as sentiment analysis, keyword extraction, and text classification.
Since text is unstructured, it first needs to be cleaned and converted into a structured form so
that statistical or machine learning techniques can be applied.

Basic steps of text analysis are:


1.Text Preprocessing
It means cleaning and preparing the text for the analysis process, for example conversion to
lowercase, remove stop words such as “the” and “is”, punctuation, and special characters.
2.Tokenization
Dividing a text into smaller tokens or units such as words, phrases, etc.
3.Text Representation
It implies converting text in a form that is ready to be analysed, for instance as a frequency count of
words, bag of-words representation.
Significance/Importance of Textual Analysis
Textual analysis is becoming increasingly important in business because it enables organizations to
extract valuable insights from the vast amounts of unstructured text data generated daily. With the
growing volume of data, businesses need efficient ways to process and understand this information.
1. Understanding Customer Sentiment: Helps analyze emotions and opinions from reviews,
feedback, etc. Businesses can adjust products, services, and marketing based on what
customers feel.
2. Market Research & Trend Analysis: Identifies trends and consumer needs through social media,
news, and reviews. Supports better product development and strategic planning.
3. Improved Customer Service: Finds common complaints/feedbacks in emails or support chats.
Enables faster replies and personalized support.
4. Brand Monitoring: Tracks what people say about a brand online to help protect and manage
brand reputation.
5. Competitive Advantage: Analyzes competitor feedback and public presence to identify
strengths and weaknesses.
6. Better Decision-Making: Converts large volumes of text into clear, actionable insights to
support smarter decisions in marketing, operations, and customer engagement.

Applications of Textual Analysis


Textual data analysis aims to transform raw text into structured information that can inform
decisions, improve processes, and generate actionable outcomes.
1. Information Extraction: Extract relevant information (names, dates, entities) from large text
volumes to create structured datasets.
2. Sentiment Analysis: Classify emotional tone behind a text (positive, negative, neutral) to
understand customer opinions, social media reactions, etc.
3. Topic Modeling: Discover hidden themes or topics in large text datasets using models like LDA.
4. Text Classification: Categorize text into predefined groups or classes (e.g., news types, feedback
categories).
5. Improved Decision-Making: Use insights to make informed, data-driven decisions that enhance
business or organizational outcomes.
6. Spam Detection: Classify messages as spam or not spam.

Challenges in Textual Analysis


1. Unstructured Format: Text doesn’t follow a fixed schema, making it hard to process.
2. Noise in Text: Includes spelling mistakes, abbreviations, emojis, and irrelevant information.
3. Language Ambiguity: Words may have multiple meanings based on context (e.g., “bank”).
4. Slang and Informal Language: Common in social media; hard for models to interpret.
5. Context Dependence: Meaning can change based on surrounding words or sentences.
6. Multilingual and Mixed-Language Text: Mixed or different languages add complexity.
Key Steps in Textual Analysis using R:
1. Text Collection: Gathering text data from sources such as CSV files, websites, or manually
entered documents.
2. Text Cleaning and Preprocessing: Converting to lowercase, Removing punctuation and
stopwords, Stemming and lemmatization, Tokenization.
3. Document-Term Matrix (DTM): The cleaned data is converted into a matrix format where rows
are documents and columns are terms. This helps in quantifying the presence of each term.
4. Visualization: Word clouds to show most frequent words, Bar plots for top terms, Sentiment
plots using polarity scores.
5. Sentiment Analysis: R packages like syuzhet and tidytext are used to classify emotions and
polarity (positive/negative) in text using built-in lexicons (e.g., NRC, Bing).
R packages commonly used for text analysis are:
• tm (Text Mining) Package: One of the most commonly used packages for text preprocessing
and text mining tasks in R. It provides a rich set of functions for performing various
transformations like convert to lowercase, remove punctuation, remove numbers, etc.
• tidytext Package: Integrates text mining with tidyverse principles, allowing for easier
manipulation of text data using tidy data structures. It’s particularly useful for sentiment
analysis, word frequencies, and visualization.
• dplyr Package: Provides powerful tools for manipulating data (e.g., filtering, grouping,
summarizing).
• wordcloud Package: Used to visually display the most frequent words in a text corpus.
• syuzhet Package: Performs sentiment analysis by extracting emotions and sentiment scores
from text.
• ggplot2: Powerful data visualization package for creating elegant and customizable graphics.
Methods and Techniques of Textual Analysis
There are three methods of textual analysis: text mining, categorization, and sentiment analysis.
• Text mining gives the fundamental tools for cleaning and extracting features from raw text.
• Categorization helps classify text into predefined categories.
• Sentiment analysis gives insights into the emotional tone of the text.
Together, these methods unlock valuable insights from large volumes of unstructured text data,
making it more useful for business, research, and decision-making purposes.

Text Mining
Text mining is the process of extracting useful information and knowledge from unstructured text
data. It helps to uncover patterns, trends, and relationships in large text collections like books,
reviews, articles, or social media.
Key Steps:
• Text Preprocessing: Cleaning the text (Normalizing, removing punctuation, stop words,
numbers, and special characters, etc.)
• Tokenization: Breaking text into words or phrases.
• Word Frequency Analysis: Identifying most common words.
• Advanced Methods: Like topic modeling and clustering to group and understand content better.
It is useful to draw insights and understand deeper meanings from text data (e.g., what customers
commonly complain about).
R has an extensive list of libraries such as tm and tidytext that make the process of text mining easier.

Categorization
It refers to the process of assigning text into predefined categories or labels based on the content.
This method is applied in many applications, including email filtering (spam vs. non-spam), document
classification (business, sports, tech), and sentiment analysis (positive, negative, neutral).
Techniques of categorization involve supervised learning models including Naive Bayes, Support
Vector Machines (SVM), and Logistic Regression. Such models require labeled training data to learn
how to classify new, unseen data.
The model can predict the category of a new document based on the patterns learned from the
training set once trained. In R, one can do text categorization by creating a Document-Term Matrix
(DTM) and using classification models such as Naive Bayes.

Sentiment Analysis
Sentiment analysis is the process of determining the emotional tone or sentiment behind a piece of
text. The aim is to classify text as expressing a positive, negative, or neutral sentiment.
This technique is widely used for analyzing customer feedback, product reviews, social media posts,
and other forms of text to measure public opinion or sentiment about a particular topic.
There are two main approaches in sentiment analysis:
• Lexicon-based Methods: These use pre-defined dictionaries of words with positive, negative, or
neutral sentiments.
Example: "happy" = positive, "angry" = negative.
The text is scanned for these words and the overall sentiment is calculated.
• Machine Learning-based Approaches: This involves training the model on labelled text data
wherein the sentiment is known in advance and then applies that model to classify new text.
Techniques like Naive Bayes, Support Vector Machines, and deep learning can be applied here.
R provides libraries like syuzhet, tidytext, and sentimentr to perform sentiment analysis.

You might also like