0% found this document useful (0 votes)
2 views

Predictive Analytics Notes1

The document provides an overview of predictive analytics, covering descriptive statistics, data types, measures of central tendency, and probability theory. It explains various statistical concepts such as hypothesis testing, sampling distributions, and different types of errors in hypothesis testing. Additionally, it discusses the significance of normal distribution and its applications in inferential statistics.

Uploaded by

nidhicarikture
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Predictive Analytics Notes1

The document provides an overview of predictive analytics, covering descriptive statistics, data types, measures of central tendency, and probability theory. It explains various statistical concepts such as hypothesis testing, sampling distributions, and different types of errors in hypothesis testing. Additionally, it discusses the significance of normal distribution and its applications in inferential statistics.

Uploaded by

nidhicarikture
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

PREDICTIVE ANALYTICS

Descriptive Statistics – A set of numbers that describe a data of a study. It helps us to


decipher meaning out of the data.

Data Types and Scales


Data are plain facts usually raw numbers

Data at a macro-level can be classified as structured and unstructured data. Structured data
means that the data is described in a matrix form with labelled rows and columns. Any data that
is not originally in the matrix form with rows and columns is an unstructured data (Ex: e-mails,
images, videos, etc.). There is an increasing trend in the generation of unstructured data due to
social media platforms such as Facebook and YouTube and analysis of unstructured data is
important for effective management. Internet of Things is another popular source of unstructured
data.

Data is classified into different categories based on data structure and scale of measurement of
the variables

Based on the type of data collected, the data is grouped into the following three classes:

 Cross-Sectional Data: A data collected on many variables of interest at the same time or
duration of time is called cross-sectional data.

 Time Series Data: A data collected for a single variable over several time intervals (weekly,
monthly, etc.) is called a time series data.

 Panel Data: Data collected on several variables (multiple dimensions) over several time
intervals is called panel data (also known as longitudinal data).

Structured data can be either numeric or alpha numeric and may follow different scales of
measurement (level of measurement). The following are the measurement scales:

 Nominal Scale: Refers to the variables that are names and also known as categorical variables.

 Ordinal Scale: Refers to the variables in which the value of the data is captured from an ordered
set, which is recorded in the order of magnitude.
 Interval Scale: Refers to the variables in which the value is chosen from an interval set.

 Ratio Scale: Refers to the variables for which the ratios can be computed and are meaningful.

Measures of Central Tendency and Variation


Measures of central tendency describes the tendency of the observations to bunch around a
center value while measure of variation describes the variability in the data.

Measures of central tendency are the measures that are used for describing the data using a single
value. Mean Median and Mode are the three measures of central tendency and are frequently
used to compare different data sets.

Measures of central tendency help users to summarize and comprehend the data

Mean: It involves adding up all the scores and dividing by the number of scores. The most
important property of mean is that “the summation of deviation of observations from the mean is
always zero.

Median: The value divides the data into two equal parts.

Note – Mean is affected to the outliers while median in not affected. Median is appropriate to
interval, ratio, and ordinal level data.

Mode: Most frequently occurring value in the dataset.

Percentile, Decile and Quartile are frequently used to identify the position of the observation in
the data set. Percentile is a measure indicating the value below which a given percentage of
observations in a group of observations fall. Decile corresponds to special values of percentile
that divide the data into 10 equal parts. Quartile divides the data into 4 equal parts.

Px=x(n+1)/100

One of the primary objectives of analytics is to understand the variability in the data. Measure of
dispersion is the variability in the data. Predictive analytics techniques such as regression attempt
to explain variation in the outcome variable (Y) using predictor variables (X). Variability in the
data is measured using the following measures:

 Range = Maximum value – Minimum value

 Inter-Quartile Distance (IQD) = Q3-Q1

 Variance =
 Standard Deviation: It represents the deviation of all the data points around the mean.

Measures of Shape
The shape of the distribution helps in identifying the right descriptive statistic for the given data.

If the data is symmetrical, then MEAN or MEDIAN is the best measure of Central Tendency.

If the data is skewed, then Median is the best measure.

Skewness and Kurtosis are the measures of shape.

Skewness is a measure of symmetry or lack of symmetry. A dataset is symmetrical when the


proportion of data at equal distance from mean(or median) is equal. This implies that the
distribution (or proportion) of the data on either side of the mean is same.

Kurtosis aimed at shape of the tail, that is, whether the tail of the data distribution is heavy or
light.

 Kurtosis value < 3 is Platykurtic Distribution

 Kurtosis value > 3 is Leptokurtic Distribution


 Kurtosis value = 3 is Mesokurtic Distribution

Probability theory

Random Variables
In probability and statistics, a random variable, random quantity, aleatory variable, or
stochastic variable is a variable whose possible values are outcomes of a random phenomenon.

A random variable is defined as a function that maps the outcomes of unpredictable processes to
numerical quantities (labels), typically real numbers. In this sense, it is a procedure for assigning
a numerical quantity to each physical outcome. Contrary to its name, this procedure itself is
neither random nor variable. Rather, the underlying process providing the input to this procedure
yields random (possibly non-numerical) output that the procedure maps to a real-numbered
value.

1. Discrete Random Variables


2. Continuous Random Variables

Probability: PMF PDF & CDF

1. Probability mass function (PMF) gives you the probability that a discrete random
variable is exactly equal to some real value.

 Expected value
 Variance and Standard deviation of D.R.V
2. Cumulative distribution function (CDF) will give you the probability that a random
variable is less than or equal to a certain real number.
3. Probability density function (PDF) of a random variable X, when integrated over a set
of real numbers a, will give the probability that X lies in A.

Normal Distribution

In probability theory, the normal (or Gaussian or Gauss or Laplace–Gauss) distribution is a very
common continuous probability distribution. Normal distributions are important in statistics and
are often used in the natural and social sciences to represent real-valued random variables whose
distributions are not known.

The normal distribution is useful because of the central limit theorem. In its most general form,
under some conditions (which include finite variance), it states that averages of samples of
observations of random variables independently drawn from independent distributions converge
in distribution to the normal, that is, become normally distributed when the number of
observations is sufficiently large.

The normal distribution is sometimes informally called the bell curve. However, many other
distributions are bell-shaped (such as the Cauchy, Student's t, and logistic distributions).

Chi-Square Distribution
In probability theory and statistics, the chi-squared distribution (also chi-square or χ2-
distribution) with k degrees of freedom is the distribution of a sum of the squares of k
independent standard normal random variables.

The chi-square distribution is a special case of the gamma distribution and is one of the most
widely used probability distributions in inferential statistics, notably in hypothesis testing or in
construction of confidence intervals.

Student's t Distribution
In probability and statistics, Student's t-distribution (or simply the t-distribution) is any member
of a family of continuous probability distributions that arises when estimating the mean of a
normally distributed population in situations where the sample size is small and population
standard deviation is unknown. It was developed by William Sealy Gosset under the pseudonym
Student.

The Student's t-distribution is a special case of the generalized hyperbolic distribution


F-Distribution
In probability theory and statistics, the F-distribution, also known as Snedecor's F distribution or
the Fisher–Snedecor distribution (after Ronald Fisher and George W. Snedecor) is a continuous
probability distribution that arises frequently as the null distribution of a test statistic, most
notably in the analysis of variance (ANOVA), e.g., F-test.

Sampling
Sampling is a process of selecting subset of observations from a population to make inference
about various population parameters such as mean, proportion, standard deviation, etc. Sampling
process itself has several steps and each step is important to ensure that the ideal sample is used
for estimation of population parameters and for making inferences about the population.

Incorrect sample may lead to incorrect inference about the population

Population Parameter and Sample Statistic


Population Parameter: Measures such as mean and standard deviation calculated using the
entire population are called population parameters.

Sample Statistic: When population parameters are estimated from sample they are called sample
statistic or statistic.

Random Sampling

Sampling Distribution
In statistics, a sampling distribution or finite-sample distribution is the probability distribution of
a given random-sample-based statistic.

The sampling distribution of a statistic is the distribution of that statistic, considered as a random
variable, when derived from a random sample of size n. It may be considered as the distribution
of the statistic for all possible samples from the same population of a given sample size. The
sampling distribution depends on the underlying distribution of the population, the statistic being
considered, the sampling procedure employed, and the sample size used.

There is often considerable interest in either whether the sampling distribution can be
approximated by an asymptotic distribution, which corresponds to the limiting case as the
number of random samples of finite size, taken from an infinite population and used to produce
the distribution, tends to infinity, or when just one equally-infinite-size "sample" is taken of that
same population.

Central Limit Theorem


In probability theory, the central limit theorem (CLT) establishes that, in some situations, when
independent random variables are added, their properly normalized sum tends toward a normal
distribution (informally a "bell curve") even if the original variables themselves are not normally
distributed.

The theorem is a key concept in probability theory because it implies that probabilistic and
statistical methods that work for normal distributions can be applicable to many problems
involving other types of distributions.

Hypothesis Testing
"Beware of the problem of testing too many hypotheses; the more you torture the data, the more
likely they are to confess, but confession obtained under duress may not be admissible in the
court of scientific opinion"
- Stephen M Stigler

Hypothesis is a claim or belief, hypothesis testing is a statistical process of either rejecting or


retaining a claim or belief or association related to business context, product, service, processes,
etc. Hypothesis testing consists of two complementary statements called null hypothesis and
alternative hypothesis, and only one them is true.

Hypothesis testing is an integral part of many predictive analytics techniques such as multiple
linear regression and logistic regression. It plays an important role in providing evidence of an
association relationship between an outcome variable and predictor variables.

HYPOTHESIS TESTING
Data analysis in general can be classified as exploratory or confirmatory data analysis. In
exploratory data analysis, the idea is to look for new or previously unknown hypothesis or
suggest hypotheses. In case of confirmatory data analysis, the objective is to test the validity of a
hypothesis using techniques such as hypothesis testing and regression.

Confirmatory data analysis looks for evidence in support of hypotheses using techniques such as
hypothesis testing
Null and Alternative Hypothesis
Null hypothesis refers to the statement that there is no relationship or no difference between
different groups with respect to the value of a population parameter. It states that null condition
exists, nothing new is happening, old standard is correct, and system is in control. It is denoted
by 𝐻0

Alternative hypothesis is the complement of null hypothesis. It states that something new is
happening, the new theory is true, and the system is out of control. It is denoted by 𝐻1

Hypothesis test checks the validity of the null hypothesis based on the evidence from the sample.

At the beginning of the test, we assume that the null hypothesis is true. Since the researcher may
believe in alternative hypothesis, she/he may like to reject the null hypothesis. However, in many
cases (such as goodness of fit tests), we would like to retain or fail to reject the null hypothesis.

One-Tailed and Two-Tailed Tests


In statistical significance testing, a one-tailed test and a two-tailed test are alternative ways of
computing the statistical significance of a parameter inferred from a data set, in terms of a test
statistic.

A one-tailed test is appropriate if the estimated value may depart from the reference value in
only one direction, for example, whether a machine produces more than one-percent defective
products. In one-tailed problems, the researcher is trying to prove that something is higher,
lower, more, less, older, younger, greater, and so on.

A two-tailed test is appropriate if the estimated value may be more than or less than the reference
value, for example, whether a test taker may score above or below the historical average. It
always uses = or ≠ sign.

Alternative names are one-sided and two-sided tests; the terminology "tail" is used because the
extreme portions of distributions, where observations lead to rejection of the null hypothesis, are
small and often "tail off" toward zero as in the normal distribution or "bell curve".

The purpose of hypothesis testing is not to question the computed value of the sample statistic
but to make a judgment about the difference between that sample statistic and a hypothesized
population parameter.

Significance value, usually denoted by alpha (α), is the criteria used for taking the decision of
whether to accept or reject the null hypothesis.
Type-I and Type-II Error
In hypothesis test, we end up with the following two decisions:

 Reject null hypothesis.

 Retain null hypothesis.

Type-I error is defined as the conditional probability of rejecting a null hypothesis when it is
true. The significance value α is the value of Type I error.

Type I error alpha (α) is the probability of rejecting the null hypothesis given that H0 is true.

Type-II error is defined as the conditional probability of retaining a null hypothesis when the
alternative hypothesis is true. It is denoted by the symbol β.

Type II error beta (β) is the probability of retaining null hypothesis given that H0 is false.

The value (1 – β) is known as the power of the hypothesis test. The power of a test is the
probability that a false null hypothesis will be detected by the test.

TEST STATISTIC
Test statistic is the standardized difference between the estimated values of the parameter being
tested calculated from the sample and the hypothesis value in order to establish the evidence in
support of the null hypothesis. Also, remember that p-value is nothing but the conditional
probability of observing the statistic value given that the null hypothesis is true.

Test statistic is the standardized value used for calculating the p-value in support of null
hypothesis

The cases in which the test statistic is Z are:

1. Population standard deviation is known, and the population is normal and


2. Population standard deviation is known, and the sample size is at least 30.

𝑥̅ − 𝜇
𝑧= 𝜎
√𝑛

The cases in which the test statistic is t are:


When the population is normal, and population standard deviation is unknown but the sample
standard deviation ‘S’ is known.

𝑥̅ − 𝜇
𝑡= 𝑆
√𝑛

Critical value method

A critical value divides the sampling distribution into two parts, a rejection region, and a
nonrejection region. If the test statistic falls into the rejection region, we reject the null
hypothesis; otherwise, we fail to reject it.

There are three common types of hypothesis tests:

1. Tests of hypothesis about population means


2. Tests of hypothesis about population proportions and
3. Tests of hypothesis about population variances.

One-Sampled Test for Proportion: z Statistic


The One-Sample Proportion Test is used to assess whether a population proportion (P1) is
significantly different from a hypothesized value (P0). This is called the hypothesis of inequality.
The hypotheses may be stated in terms of the proportions, their difference, their ratio, or their
odds ratio, but all four hypotheses result in the same test statistics.

According to the central limit theorem of proportions, the sampling distribution of proportions p ̂
(p-hat) for a large sample follows an approximate normal distribution with mean p, the
population proportion and standard deviation √(p(1-p))/n .
One-Tailed and Two-Tailed Test of Means: t Statistic
In statistical significance testing, a one-tailed test and a two-tailed test are alternative ways of
computing the statistical significance of a parameter inferred from a data set, in terms of a test
statistic.

A one-tailed test is appropriate if the estimated value may depart from the reference value in only
one direction, for example, whether a machine produces more than one-percent defective
products. A two-tailed test is appropriate if the estimated value may be more than or less than the
reference value, for example, whether a test taker may score above or below the historical
average.

Two-Sample Hypothesis Test


In many cases, we would like to compare parameters of two different populations to check for
any difference in parameter values such as mean.

Scenario 1
Difference in Two Population Means when the Population Standard
Deviations are known: Two-Sample-Z-Test
In this case, the following assumptions are made-

a) The sample sizes say n1 and n2 of the two samples drawn from two populations are large,
like at least 30, and the corresponding standard deviations σ1 and σ2 are known.
b) The samples are drawn from two normally distributed populations

If these two assumptions are met, the Z-Statistic is applicable.

Scenario 2
Difference in Two Population Means when the Population Standard
Deviations are Unknown and Believed to be Equal: Two-Sample-t-Test
Pooled Variance =

Scenario 3
Difference in Two Population Means when the Population Standard
Deviations are Unknown and Not Equal: Two-Sample-t-Test with Unequal
Variance

Paired Sample t-Test


The objective of this test is to check whether the difference in the parameter values is statistically
significant before and after the intervention or between two different types of interventions. In a
paired t-test, the data related to the parameter is captured twice from the same subject, once
before the intervention and once after intervention.
𝑆𝑑 = 𝐻𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠𝑒𝑑 𝑚𝑒𝑎𝑛 𝑑𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒

"Analysis of Variance is not a mathematical theorem, but rather a


convenient method of arranging the arithmetic"
- Ronald Fisher
In many situations we may have conduct a hypothesis test to compare mean values
simultaneously for more than two groups (samples) created using a factor (or factors). When we
have to compare the impact of a factor on mean on more than two groups (created by different
levels of a factor) simultaneously, the hypothesis tests such as two-sample-t-tests are not ideal
approach since they can result in incorrect Type I and Type II errors.

We use the Analysis of Variance (ANOVA) to understand the differences in population means
between more than two populations

ONE-WAY ANOVA
Analysis of Variance (ANOVA) is a hypothesis testing procedure used for comparing means from
several groups simultaneously.

ANOVA plays an important role in multiple linear regression model diagnostics. The overall
significance of the model is tested using ANOVA.

In an one-way ANOVA, we test whether the mean values of an outcome variable for different
levels of factor are different. Using multiple two-sample t-test to simultaneously test group
means will result in incorrect estimation of Type I error and ANOVA overcomes this problem.

It uses only one factor to divide the population it is called one-way ANOVA.

There are few conditions, which needs to be satisfied for one-way ANOVA and they are:

I. First of all, we would like to study the impact of a single treatment also known as factor
at different levels thus forming different groups; on a continuous response variable which
is the outcome variable.
II. Second is in each group, the population response variable follows a normal distribution
and the sample subjects are chosen using random sampling.
III. Third is the population variances for different groups are assumed same. That is,
variability in the response variable it is very important that within these three different
groups is same.
If the mean values of the groups are different then variation within the group will be much
smaller than variation in between the groups

The mean of the observations of individual group-

Overall mean =

Sum of square of total variation =

df=n-1

Mean Square total =

Sum of variations between individual groups =

Df=k-1
Mean Square variation between groups =

So hence, in the hypothesis testing, we are looking at a right tail test. In order to interpret this F-
test we have few rules that we have to take care of:

- If the null hypothesis is true which means there is no difference in the mean values of all the
groups that means MSB and MSW that is no difference in within group and in between the group
variations

- If the means of the groups are different, MSB will be larger than MSW what does that mean
that if the means of different groups are different which is our alternate hypothesis which is what
we want to prove, then we want to say that in between the group variations which is MSB which
will much larger than within the group variation.

Now if we sum up both these points together we can say the ratio of MSB/MSW in between the
group variations to the within group variation will be close to 1 if there is no difference in means
and it will be larger than 1 if the means are different.

TWO-WAY ANOVA
In statistics, the two-way analysis of variance (ANOVA) is an extension of the one-way
ANOVA that examines the influence of two different categorical independent variables on one
continuous dependent variable. The two-way ANOVA not only aims at assessing the main effect
of each independent variable but also if there is any interaction between them.
In a Two-Way ANOVA, we check the impact of more than one factor simultaneously on several
groups.

R and R-Studio
R and its libraries implement a wide variety of statistical and graphical techniques, including
linear and nonlinear modeling, classical statistical tests, time-series analysis, classification,
clustering, and others. R is easily extensible through functions and extensions, and the R
community is noted for its active contributions in terms of packages. Many of R's standard
functions are written in R itself, which makes it easy for users to follow the algorithmic choices
made.

R-Studio is a free and open-source integrated development environment (IDE) for R, a


programming language for statistical computing and graphics.

Setting working directory

setwd(dir)

Library/Package – a collection of predefined functions or modules.

Installing a package

Package → Install

data()

To know what a data is all about, use the following command-

?dataset name

Saving a file in your directory

write.csv(new name, “existing file name”)

write.csv(my_data, “mtcars_data.csv”)

To get the first 6 rows of the data

head(my_data)

In case more than 6 observations, use the following command-


head(my_data, n) where n is the number of observations

To get the last 6 rows of the data

tail(my_data)

head(my_data, n)

Want to know about a particular function

Go to console → write the following code

?function name

Structure of the data

str(my_data)

Descriptive stats of the data

summary(my_data)

Python
Python features a dynamic type system and automatic memory management. It supports multiple
programming paradigms, including object-oriented, imperative, functional and procedural, and
has a large and comprehensive standard library.Python interpreters are available for many
operating systems.

Python is an interpreted high-level programming language for general-purpose programming


that emphasizes code readability, notably using significant white space

Module 1
Introduction to Analytics
Analytics is a body of knowledge consisting of statistical, mathematical, and operations research
techniques; artificial intelligence techniques such as machine learning and deep learning
algorithms; data collection and storage; data management processes such as data extraction,
transformation and loading (ETL); and computing and big data technologies such as Hadoop,
Spark, and Hive that create value by developing actionable items from data. The primary macro-
level objectives of analytics are problem solving and decision-making.
"Analytics help organizations to create value by solving problems effectively and assisting in
decision making"

HIPPO Algorithm of decision-making → Highest Paid Person’s Opinion


Business Analytics is a multidisciplinary field that uses expertise such as statistical learning,
machine learning, artificial intelligence, computer science, information technology, and
management strategies to generate value from the data.
It has three main components Business Context, Data Science, and Technology.
Business Context is important since the success of Analytics project will depend on the ability of
the organization to ask rights questions.
Uses of Analytics in Business
 Removing inefficiencies within an organization
 Problem Solving
 Decision making
 Competitive strategy

One of reasons for increase in use of analytics is the theory of bounded rationality proposed by
Herbert Simon (1972).
According to Herbert Simon, the increasing complexity of business problems, the existence of
several alternative solutions, and the limited time available for decision-making demand a highly
structured decision-making process using past data for the management of the organizations.

Analytics and Decision-making


Components of Analytics
Descriptive Analytics- In descriptive analytics, we try to find hidden patterns using descriptive
statistics and data visualization. It states what has happened in the past?
Predictive Analytics- In predictive analytics, we predict future events such as customer churn,
employee attrition, revenue forecasts and so on. It predicts what can happen in the future?
Predictive Analytics- in Prescriptive analytics, we arrive at the optimal decision for a given
problem. It suggests what actions to take?

The purpose of analytics is to understand the patterns that exists in data and connect the dots to
understand what it means to business.
Framework – Data-driven decision making
1. Problem or opportunity identification
2. Collection of relevant data
3. Data preparation or Pre-processing
4. Model development
5.

Missing data can be handled using concept of data imputation.


Data Imputation is nothing but process of filling the missing data. Many techniques such as
regression and K Nearest Neighbors are used for data imputation.

Another important task in data preparation is feature engineering.


Feature engineering is deriving new features or new variables from the existing features.
Interaction variables and ratios are two standard approaches used for generating new features.
Interaction variables are product of two variables.
Training data (70-80%) and Validation data (20-30%)

Classification of algorithms used in Machine Learning


1. Supervised Learning Algorithm
2. Unsupervised Learning Algorithm
3. Reinforcement Learning Algorithm
4. Evolutionary Learning Algorithm

Module-2
Regression is one of the most important techniques in predictive analytics since many prediction
problems are modelled using regression. It is one of the supervised learning algorithms, that is, a
regression model requires the knowledge of both the dependent and the independent variables in
the training data set.

Simple Linear Regression (SLR) is a statistical model in which there is only one independent
variable and the functional relationship between the dependent variable and the regression
coefficient is linear.
Regression is the tool for finding the existence of an association relationship between a
dependent variable we call it Y and one or more independent variable in the study. Therefore, the
relationship can be either linear or nonlinear. Regression is a statistical relationship. Linear
regression means that the relationship is linear with respect to the regression parameters.

 Regression is the study of existence of a relationship between variables.


 Correlation is the study of strength of relationship between two variables.

Regression

Simple Multiple
Regression Regression

Linear Simple Linear Multiple


Regression Regression

Non-Linear Non-linear
Simple Multiple
Regression Regression

Regression is:
 A method for establishing an association relationship between a response variable and
few independent variables.
 A method for generating new theories and hypothesis.
Broadly classified into two categories.
o Simple Regression: One independent variable, further classified as linear and non-linear.
o Multiple Regression: More than one independent variable, further classified as linear
and non-linear.
Regression Model Development Process
Regression Model: Assumptions

The following table provides a snapshot of the different assumptions and their impacts on
the Regression model.

Assumptions Impacts on the Regression Model

Error follows normal distribution This condition is necessary for reliability of statistical tests (t and F).

Homoscedasticity (constant It is necessary for statistical tests (F and t).


variance)

Inflates standard error of estimate of regression coefficients (beta


Multi-collinearity coefficients).
May reject significant variables.

Underestimates standard error of estimates of regression coefficients.


Auto-correlation May accept insignificant variables.
Scatter plot is a visualization technique that is used while building Regression models.
Scatter plot (also called scatter diagram):
 Is a graph used to display and compare two or more variables?
 Doesn’t require the user to specify dependent and independent variable.

How do we analyze a Scatter Plot?

The following can be considered while interpreting a Scatter Plot:

Graph Data
In any data graph, look for the following:

 Overall pattern
 Striking deviations from that pattern

Overall Pattern
The overall pattern of a Scatter Plot is described by the following parameters:

 Form (Linear pattern?)


 Direction (Positive, Negative or Flat)
 Strength of the relationship (Correlation)

Watch for Outliers


An outlier is an individual value(s) that falls outside the overall pattern of the relationship.
Estimation of Regression Parameters
In linear regression, the estimation of regression parameters is carried out using a technique
called Ordinary Least Squares or OLS.
OLS is the Best Linear Unbiased Estimate (BLUE).
Advantage of LSE
 Unbiased estimates
 Minimum variance
 Consistency

Confidence Interval is:


 The interval estimate of expected value of Y given X, E(Y|X).
 Used for interpolation of the data within the X range.
Prediction Interval:
 Gives the interval estimate of Y for a given value of X.
 Is wider than the confidence interval.

 Confidence Interval is used for the interpolation of data within the x range.
 Prediction Interval (PI) is wider than the Confidence Interval (CI) because PI must take
account of the tendency of y to fluctuate from its mean value; CI simply needs to account
for the uncertainty in estimating the mean value.

Regression Model Validation


We validate the model using three different approaches.

The first one is coefficient of determination, to check the goodness of fit of regression.
Coefficient of determination is also called R square and it is very frequently used and sometimes
misused metric in regression model building.
The second is analysis of variance. Here we check the overall fitness of the model develop. The
third is T-Test here we check whether the independent variable and dependent variables are
statistically related. Here we set the null hypothesis as beta one equal to zero and the alternative
hypothesis is beta one is not equal to zero.
Finally, we do a residual analysis. Residual analysis is a plot between residuals or errors and the
predicted outcome variable values. We use the residual analysis to check normality of the errors
or homoscedasticity also if there is any pattern in the residual analysis then it may imply that we
may have used in correct functional form. For example, we may have used linear relationship
between Y and X instead of log-linear or log-log kind of relationship.
Standard error of estimate
The standard deviation of the errors
The S.D of the outcome variables
 The R-Square test is used to measure the Goodness of a model.
 The Standard Error test provides an estimate of standard deviation of regression errors.
 The ANOVA / F test checks the overall model significance.
 The T-test validates the relationship between dependent and individual independent
variable.
Residual Plot - Introduction
Residual plot is a plot of error (or standardized error) against one of the following variables:
 The dependent variable Y.
 The independent variable X.
 The standardized independent or dependent variable.
Here are some reasons why we should use Residual Analysis.
 Analysis of residuals reveal whether the assumption of normally distributed errors hold.
 Residual plots are used to check if there is heteroscedasticity problem (non constant
variance for the error term).
 Residual analysis could also indicate if there are any missing variables.
 Residual plot can also reveal if the actual relationship is non-linear.

Regression Assumption Checks: Introduction


Residual analysis is performed for the following reasons:
 To check the normality of error terms.
 To check for heteroscedasticity.
 To test the non-linear relationship.

Check the normality of error terms


 Probability plot is a graphical technique for checking whether a data set follows a given
distribution.
 The data is plotted against a theoretical distribution in such a way that the points should
form a straight line.
 In Regression, we create a probability plot of error against normal distribution.

Check for heteroscedasticity


 A graph of the residuals versus independent variable Y or dependent variable X will
reveal whether the variance of the errors are constant.
 If the width of the scatter plot of the residuals either increases or decreases as X (or Y)
increases, then the assumption of constant variance is not met.

Check for non-linearity


 If the residual plot exhibits a curve when plotted, then the actual relationship is non-
linear.

Residual Analysis reveals whether the following regression assumptions hold:


 Normality of errors
 Homoscedasticity
 Linearity

Multiple Linear Regression (MLR)


Multiple Linear Regression (MLR) is a statistical technique for finding existence of an
association relationship between a dependent variable (aka response variable or outcome
variable) and several independent variables (aka explanatory variables or predictor variable).

MLR - Modelling Steps:


1. Start with a belief or hypothesis.
2. Estimate unknown model parameters.
3. Assume that random error term follows normal distribution.
4. Check for normality, heteroscedasticity and multi-collinearity.
5. Evaluate and use model for prediction, estimation and other purposes.

Points to Remember
 Ordinary Least Squares method is used to estimate multiple regression coefficients.
 Multi-collinearity can affect the statistical significance of the variable included in model
building.
 Partial correlation coefficient measures the relationship between two variables (say Y and
X1) when the influence of all other variables (say X2, X3, ..., Xk) connected with these
two variables (Y and X1) are removed.
 Part correlation coefficient measures the relationship between two variables (say Y and
X1) when the influence of all other variables (say X2, X3, ..., Xk) connected with these
two variables (Y and X1) are removed from one of the variables (X1).

Adjusted R-Square
Partial F-Test
Why do we need Partial F-Test?
We may have a model with large number of variables. Another model with less number of
variables. So, we are trying to understand whether the additional variables in the large model
adds value in explaining the variation in response variable.
Points to Remember
 R-square and Adjusted R-square are used to test the overall model fitness.
 F test is used to test the overall model statistical significance.
 Partial F test is used to test portions of the model.
 T test is used to test the statistical significance of individual explanatory variables.

Dummy Variable
Whenever I have a qualitative variable, I had to convert the qualitative variable using dummy
variable or indicator variable. Dummy variables take values 0 or 1, or basically they are binary
variables. Therefore, when I have a categorical variable or qualitative variable with n categories,
I use n-1 dummy variables.
Derived Variables
Whenever we have a dataset, we have large number of variables. One of the decisions that we
have to take is how we are going to use those variables. I can use the variables as it is or I can
derive new variables from the existing variable. Two frequent ways of deriving new variable is
to take ratios or to take interaction variable.
Whenever I talk about ratios, let us say I have a variable X1 and another variable X2; I can
derive a new variable X3, which is X1/X2. Similarly, I can create a new variable, which is
product of X1 and X2. These are the two frequently used approaches for deriving new variables.

When we incorporate these derived variables in our regression coefficient, we call them as
Interaction Variables.

Multi-collinearity
Variance Inflation Factor
One of the measures, which is used to identify existence of multi-collinearity, is called variance
inflation factor. As the name suggests, it captures the quantity by which the variants of the
regression parameter is estimated. So typically, many statisticians or analytics experts would put
a threshold of 10 or 4. Therefore, a variance inflation factor of more than 4 or 10 is treated as
dangerous to the model.

Points to remember
 n-1 dummy variables are created for a categorical variable with n categories.
 Two frequent ways of deriving new variables from the existing explanatory variables are
ratios and interactions.
 Multi-collinearity is nothing but the high correlation between explanatory variables.
 Multi-collinearity can lead to unstable regression coefficients.
Variables Selection Methods

Points to remember
 In Forward selection method, the entry variable is the one with smallest p-value.
 In Backward elimination method, all variables are entered into the equation and then
sequentially removed starting with the most insignificant variable.
 In Step-wise Regression, the entry variable is the one with smallest p-value. At each step,
the independent variable with the smallest probability of F will not be in the equation.

You might also like