Predictive Analytics Notes1
Predictive Analytics Notes1
Data at a macro-level can be classified as structured and unstructured data. Structured data
means that the data is described in a matrix form with labelled rows and columns. Any data that
is not originally in the matrix form with rows and columns is an unstructured data (Ex: e-mails,
images, videos, etc.). There is an increasing trend in the generation of unstructured data due to
social media platforms such as Facebook and YouTube and analysis of unstructured data is
important for effective management. Internet of Things is another popular source of unstructured
data.
Data is classified into different categories based on data structure and scale of measurement of
the variables
Based on the type of data collected, the data is grouped into the following three classes:
Cross-Sectional Data: A data collected on many variables of interest at the same time or
duration of time is called cross-sectional data.
Time Series Data: A data collected for a single variable over several time intervals (weekly,
monthly, etc.) is called a time series data.
Panel Data: Data collected on several variables (multiple dimensions) over several time
intervals is called panel data (also known as longitudinal data).
Structured data can be either numeric or alpha numeric and may follow different scales of
measurement (level of measurement). The following are the measurement scales:
Nominal Scale: Refers to the variables that are names and also known as categorical variables.
Ordinal Scale: Refers to the variables in which the value of the data is captured from an ordered
set, which is recorded in the order of magnitude.
Interval Scale: Refers to the variables in which the value is chosen from an interval set.
Ratio Scale: Refers to the variables for which the ratios can be computed and are meaningful.
Measures of central tendency are the measures that are used for describing the data using a single
value. Mean Median and Mode are the three measures of central tendency and are frequently
used to compare different data sets.
Measures of central tendency help users to summarize and comprehend the data
Mean: It involves adding up all the scores and dividing by the number of scores. The most
important property of mean is that “the summation of deviation of observations from the mean is
always zero.
Median: The value divides the data into two equal parts.
Note – Mean is affected to the outliers while median in not affected. Median is appropriate to
interval, ratio, and ordinal level data.
Percentile, Decile and Quartile are frequently used to identify the position of the observation in
the data set. Percentile is a measure indicating the value below which a given percentage of
observations in a group of observations fall. Decile corresponds to special values of percentile
that divide the data into 10 equal parts. Quartile divides the data into 4 equal parts.
Px=x(n+1)/100
One of the primary objectives of analytics is to understand the variability in the data. Measure of
dispersion is the variability in the data. Predictive analytics techniques such as regression attempt
to explain variation in the outcome variable (Y) using predictor variables (X). Variability in the
data is measured using the following measures:
Variance =
Standard Deviation: It represents the deviation of all the data points around the mean.
Measures of Shape
The shape of the distribution helps in identifying the right descriptive statistic for the given data.
If the data is symmetrical, then MEAN or MEDIAN is the best measure of Central Tendency.
Kurtosis aimed at shape of the tail, that is, whether the tail of the data distribution is heavy or
light.
Probability theory
Random Variables
In probability and statistics, a random variable, random quantity, aleatory variable, or
stochastic variable is a variable whose possible values are outcomes of a random phenomenon.
A random variable is defined as a function that maps the outcomes of unpredictable processes to
numerical quantities (labels), typically real numbers. In this sense, it is a procedure for assigning
a numerical quantity to each physical outcome. Contrary to its name, this procedure itself is
neither random nor variable. Rather, the underlying process providing the input to this procedure
yields random (possibly non-numerical) output that the procedure maps to a real-numbered
value.
1. Probability mass function (PMF) gives you the probability that a discrete random
variable is exactly equal to some real value.
Expected value
Variance and Standard deviation of D.R.V
2. Cumulative distribution function (CDF) will give you the probability that a random
variable is less than or equal to a certain real number.
3. Probability density function (PDF) of a random variable X, when integrated over a set
of real numbers a, will give the probability that X lies in A.
Normal Distribution
In probability theory, the normal (or Gaussian or Gauss or Laplace–Gauss) distribution is a very
common continuous probability distribution. Normal distributions are important in statistics and
are often used in the natural and social sciences to represent real-valued random variables whose
distributions are not known.
The normal distribution is useful because of the central limit theorem. In its most general form,
under some conditions (which include finite variance), it states that averages of samples of
observations of random variables independently drawn from independent distributions converge
in distribution to the normal, that is, become normally distributed when the number of
observations is sufficiently large.
The normal distribution is sometimes informally called the bell curve. However, many other
distributions are bell-shaped (such as the Cauchy, Student's t, and logistic distributions).
Chi-Square Distribution
In probability theory and statistics, the chi-squared distribution (also chi-square or χ2-
distribution) with k degrees of freedom is the distribution of a sum of the squares of k
independent standard normal random variables.
The chi-square distribution is a special case of the gamma distribution and is one of the most
widely used probability distributions in inferential statistics, notably in hypothesis testing or in
construction of confidence intervals.
Student's t Distribution
In probability and statistics, Student's t-distribution (or simply the t-distribution) is any member
of a family of continuous probability distributions that arises when estimating the mean of a
normally distributed population in situations where the sample size is small and population
standard deviation is unknown. It was developed by William Sealy Gosset under the pseudonym
Student.
Sampling
Sampling is a process of selecting subset of observations from a population to make inference
about various population parameters such as mean, proportion, standard deviation, etc. Sampling
process itself has several steps and each step is important to ensure that the ideal sample is used
for estimation of population parameters and for making inferences about the population.
Sample Statistic: When population parameters are estimated from sample they are called sample
statistic or statistic.
Random Sampling
Sampling Distribution
In statistics, a sampling distribution or finite-sample distribution is the probability distribution of
a given random-sample-based statistic.
The sampling distribution of a statistic is the distribution of that statistic, considered as a random
variable, when derived from a random sample of size n. It may be considered as the distribution
of the statistic for all possible samples from the same population of a given sample size. The
sampling distribution depends on the underlying distribution of the population, the statistic being
considered, the sampling procedure employed, and the sample size used.
There is often considerable interest in either whether the sampling distribution can be
approximated by an asymptotic distribution, which corresponds to the limiting case as the
number of random samples of finite size, taken from an infinite population and used to produce
the distribution, tends to infinity, or when just one equally-infinite-size "sample" is taken of that
same population.
The theorem is a key concept in probability theory because it implies that probabilistic and
statistical methods that work for normal distributions can be applicable to many problems
involving other types of distributions.
Hypothesis Testing
"Beware of the problem of testing too many hypotheses; the more you torture the data, the more
likely they are to confess, but confession obtained under duress may not be admissible in the
court of scientific opinion"
- Stephen M Stigler
Hypothesis testing is an integral part of many predictive analytics techniques such as multiple
linear regression and logistic regression. It plays an important role in providing evidence of an
association relationship between an outcome variable and predictor variables.
HYPOTHESIS TESTING
Data analysis in general can be classified as exploratory or confirmatory data analysis. In
exploratory data analysis, the idea is to look for new or previously unknown hypothesis or
suggest hypotheses. In case of confirmatory data analysis, the objective is to test the validity of a
hypothesis using techniques such as hypothesis testing and regression.
Confirmatory data analysis looks for evidence in support of hypotheses using techniques such as
hypothesis testing
Null and Alternative Hypothesis
Null hypothesis refers to the statement that there is no relationship or no difference between
different groups with respect to the value of a population parameter. It states that null condition
exists, nothing new is happening, old standard is correct, and system is in control. It is denoted
by 𝐻0
Alternative hypothesis is the complement of null hypothesis. It states that something new is
happening, the new theory is true, and the system is out of control. It is denoted by 𝐻1
Hypothesis test checks the validity of the null hypothesis based on the evidence from the sample.
At the beginning of the test, we assume that the null hypothesis is true. Since the researcher may
believe in alternative hypothesis, she/he may like to reject the null hypothesis. However, in many
cases (such as goodness of fit tests), we would like to retain or fail to reject the null hypothesis.
A one-tailed test is appropriate if the estimated value may depart from the reference value in
only one direction, for example, whether a machine produces more than one-percent defective
products. In one-tailed problems, the researcher is trying to prove that something is higher,
lower, more, less, older, younger, greater, and so on.
A two-tailed test is appropriate if the estimated value may be more than or less than the reference
value, for example, whether a test taker may score above or below the historical average. It
always uses = or ≠ sign.
Alternative names are one-sided and two-sided tests; the terminology "tail" is used because the
extreme portions of distributions, where observations lead to rejection of the null hypothesis, are
small and often "tail off" toward zero as in the normal distribution or "bell curve".
The purpose of hypothesis testing is not to question the computed value of the sample statistic
but to make a judgment about the difference between that sample statistic and a hypothesized
population parameter.
Significance value, usually denoted by alpha (α), is the criteria used for taking the decision of
whether to accept or reject the null hypothesis.
Type-I and Type-II Error
In hypothesis test, we end up with the following two decisions:
Type-I error is defined as the conditional probability of rejecting a null hypothesis when it is
true. The significance value α is the value of Type I error.
Type I error alpha (α) is the probability of rejecting the null hypothesis given that H0 is true.
Type-II error is defined as the conditional probability of retaining a null hypothesis when the
alternative hypothesis is true. It is denoted by the symbol β.
Type II error beta (β) is the probability of retaining null hypothesis given that H0 is false.
The value (1 – β) is known as the power of the hypothesis test. The power of a test is the
probability that a false null hypothesis will be detected by the test.
TEST STATISTIC
Test statistic is the standardized difference between the estimated values of the parameter being
tested calculated from the sample and the hypothesis value in order to establish the evidence in
support of the null hypothesis. Also, remember that p-value is nothing but the conditional
probability of observing the statistic value given that the null hypothesis is true.
Test statistic is the standardized value used for calculating the p-value in support of null
hypothesis
𝑥̅ − 𝜇
𝑧= 𝜎
√𝑛
𝑥̅ − 𝜇
𝑡= 𝑆
√𝑛
A critical value divides the sampling distribution into two parts, a rejection region, and a
nonrejection region. If the test statistic falls into the rejection region, we reject the null
hypothesis; otherwise, we fail to reject it.
According to the central limit theorem of proportions, the sampling distribution of proportions p ̂
(p-hat) for a large sample follows an approximate normal distribution with mean p, the
population proportion and standard deviation √(p(1-p))/n .
One-Tailed and Two-Tailed Test of Means: t Statistic
In statistical significance testing, a one-tailed test and a two-tailed test are alternative ways of
computing the statistical significance of a parameter inferred from a data set, in terms of a test
statistic.
A one-tailed test is appropriate if the estimated value may depart from the reference value in only
one direction, for example, whether a machine produces more than one-percent defective
products. A two-tailed test is appropriate if the estimated value may be more than or less than the
reference value, for example, whether a test taker may score above or below the historical
average.
Scenario 1
Difference in Two Population Means when the Population Standard
Deviations are known: Two-Sample-Z-Test
In this case, the following assumptions are made-
a) The sample sizes say n1 and n2 of the two samples drawn from two populations are large,
like at least 30, and the corresponding standard deviations σ1 and σ2 are known.
b) The samples are drawn from two normally distributed populations
Scenario 2
Difference in Two Population Means when the Population Standard
Deviations are Unknown and Believed to be Equal: Two-Sample-t-Test
Pooled Variance =
Scenario 3
Difference in Two Population Means when the Population Standard
Deviations are Unknown and Not Equal: Two-Sample-t-Test with Unequal
Variance
We use the Analysis of Variance (ANOVA) to understand the differences in population means
between more than two populations
ONE-WAY ANOVA
Analysis of Variance (ANOVA) is a hypothesis testing procedure used for comparing means from
several groups simultaneously.
ANOVA plays an important role in multiple linear regression model diagnostics. The overall
significance of the model is tested using ANOVA.
In an one-way ANOVA, we test whether the mean values of an outcome variable for different
levels of factor are different. Using multiple two-sample t-test to simultaneously test group
means will result in incorrect estimation of Type I error and ANOVA overcomes this problem.
It uses only one factor to divide the population it is called one-way ANOVA.
There are few conditions, which needs to be satisfied for one-way ANOVA and they are:
I. First of all, we would like to study the impact of a single treatment also known as factor
at different levels thus forming different groups; on a continuous response variable which
is the outcome variable.
II. Second is in each group, the population response variable follows a normal distribution
and the sample subjects are chosen using random sampling.
III. Third is the population variances for different groups are assumed same. That is,
variability in the response variable it is very important that within these three different
groups is same.
If the mean values of the groups are different then variation within the group will be much
smaller than variation in between the groups
Overall mean =
df=n-1
Df=k-1
Mean Square variation between groups =
So hence, in the hypothesis testing, we are looking at a right tail test. In order to interpret this F-
test we have few rules that we have to take care of:
- If the null hypothesis is true which means there is no difference in the mean values of all the
groups that means MSB and MSW that is no difference in within group and in between the group
variations
- If the means of the groups are different, MSB will be larger than MSW what does that mean
that if the means of different groups are different which is our alternate hypothesis which is what
we want to prove, then we want to say that in between the group variations which is MSB which
will much larger than within the group variation.
Now if we sum up both these points together we can say the ratio of MSB/MSW in between the
group variations to the within group variation will be close to 1 if there is no difference in means
and it will be larger than 1 if the means are different.
TWO-WAY ANOVA
In statistics, the two-way analysis of variance (ANOVA) is an extension of the one-way
ANOVA that examines the influence of two different categorical independent variables on one
continuous dependent variable. The two-way ANOVA not only aims at assessing the main effect
of each independent variable but also if there is any interaction between them.
In a Two-Way ANOVA, we check the impact of more than one factor simultaneously on several
groups.
R and R-Studio
R and its libraries implement a wide variety of statistical and graphical techniques, including
linear and nonlinear modeling, classical statistical tests, time-series analysis, classification,
clustering, and others. R is easily extensible through functions and extensions, and the R
community is noted for its active contributions in terms of packages. Many of R's standard
functions are written in R itself, which makes it easy for users to follow the algorithmic choices
made.
setwd(dir)
Installing a package
Package → Install
data()
?dataset name
write.csv(my_data, “mtcars_data.csv”)
head(my_data)
tail(my_data)
head(my_data, n)
?function name
str(my_data)
summary(my_data)
Python
Python features a dynamic type system and automatic memory management. It supports multiple
programming paradigms, including object-oriented, imperative, functional and procedural, and
has a large and comprehensive standard library.Python interpreters are available for many
operating systems.
Module 1
Introduction to Analytics
Analytics is a body of knowledge consisting of statistical, mathematical, and operations research
techniques; artificial intelligence techniques such as machine learning and deep learning
algorithms; data collection and storage; data management processes such as data extraction,
transformation and loading (ETL); and computing and big data technologies such as Hadoop,
Spark, and Hive that create value by developing actionable items from data. The primary macro-
level objectives of analytics are problem solving and decision-making.
"Analytics help organizations to create value by solving problems effectively and assisting in
decision making"
One of reasons for increase in use of analytics is the theory of bounded rationality proposed by
Herbert Simon (1972).
According to Herbert Simon, the increasing complexity of business problems, the existence of
several alternative solutions, and the limited time available for decision-making demand a highly
structured decision-making process using past data for the management of the organizations.
The purpose of analytics is to understand the patterns that exists in data and connect the dots to
understand what it means to business.
Framework – Data-driven decision making
1. Problem or opportunity identification
2. Collection of relevant data
3. Data preparation or Pre-processing
4. Model development
5.
Module-2
Regression is one of the most important techniques in predictive analytics since many prediction
problems are modelled using regression. It is one of the supervised learning algorithms, that is, a
regression model requires the knowledge of both the dependent and the independent variables in
the training data set.
Simple Linear Regression (SLR) is a statistical model in which there is only one independent
variable and the functional relationship between the dependent variable and the regression
coefficient is linear.
Regression is the tool for finding the existence of an association relationship between a
dependent variable we call it Y and one or more independent variable in the study. Therefore, the
relationship can be either linear or nonlinear. Regression is a statistical relationship. Linear
regression means that the relationship is linear with respect to the regression parameters.
Regression
Simple Multiple
Regression Regression
Non-Linear Non-linear
Simple Multiple
Regression Regression
Regression is:
A method for establishing an association relationship between a response variable and
few independent variables.
A method for generating new theories and hypothesis.
Broadly classified into two categories.
o Simple Regression: One independent variable, further classified as linear and non-linear.
o Multiple Regression: More than one independent variable, further classified as linear
and non-linear.
Regression Model Development Process
Regression Model: Assumptions
The following table provides a snapshot of the different assumptions and their impacts on
the Regression model.
Error follows normal distribution This condition is necessary for reliability of statistical tests (t and F).
Graph Data
In any data graph, look for the following:
Overall pattern
Striking deviations from that pattern
Overall Pattern
The overall pattern of a Scatter Plot is described by the following parameters:
Confidence Interval is used for the interpolation of data within the x range.
Prediction Interval (PI) is wider than the Confidence Interval (CI) because PI must take
account of the tendency of y to fluctuate from its mean value; CI simply needs to account
for the uncertainty in estimating the mean value.
The first one is coefficient of determination, to check the goodness of fit of regression.
Coefficient of determination is also called R square and it is very frequently used and sometimes
misused metric in regression model building.
The second is analysis of variance. Here we check the overall fitness of the model develop. The
third is T-Test here we check whether the independent variable and dependent variables are
statistically related. Here we set the null hypothesis as beta one equal to zero and the alternative
hypothesis is beta one is not equal to zero.
Finally, we do a residual analysis. Residual analysis is a plot between residuals or errors and the
predicted outcome variable values. We use the residual analysis to check normality of the errors
or homoscedasticity also if there is any pattern in the residual analysis then it may imply that we
may have used in correct functional form. For example, we may have used linear relationship
between Y and X instead of log-linear or log-log kind of relationship.
Standard error of estimate
The standard deviation of the errors
The S.D of the outcome variables
The R-Square test is used to measure the Goodness of a model.
The Standard Error test provides an estimate of standard deviation of regression errors.
The ANOVA / F test checks the overall model significance.
The T-test validates the relationship between dependent and individual independent
variable.
Residual Plot - Introduction
Residual plot is a plot of error (or standardized error) against one of the following variables:
The dependent variable Y.
The independent variable X.
The standardized independent or dependent variable.
Here are some reasons why we should use Residual Analysis.
Analysis of residuals reveal whether the assumption of normally distributed errors hold.
Residual plots are used to check if there is heteroscedasticity problem (non constant
variance for the error term).
Residual analysis could also indicate if there are any missing variables.
Residual plot can also reveal if the actual relationship is non-linear.
Points to Remember
Ordinary Least Squares method is used to estimate multiple regression coefficients.
Multi-collinearity can affect the statistical significance of the variable included in model
building.
Partial correlation coefficient measures the relationship between two variables (say Y and
X1) when the influence of all other variables (say X2, X3, ..., Xk) connected with these
two variables (Y and X1) are removed.
Part correlation coefficient measures the relationship between two variables (say Y and
X1) when the influence of all other variables (say X2, X3, ..., Xk) connected with these
two variables (Y and X1) are removed from one of the variables (X1).
Adjusted R-Square
Partial F-Test
Why do we need Partial F-Test?
We may have a model with large number of variables. Another model with less number of
variables. So, we are trying to understand whether the additional variables in the large model
adds value in explaining the variation in response variable.
Points to Remember
R-square and Adjusted R-square are used to test the overall model fitness.
F test is used to test the overall model statistical significance.
Partial F test is used to test portions of the model.
T test is used to test the statistical significance of individual explanatory variables.
Dummy Variable
Whenever I have a qualitative variable, I had to convert the qualitative variable using dummy
variable or indicator variable. Dummy variables take values 0 or 1, or basically they are binary
variables. Therefore, when I have a categorical variable or qualitative variable with n categories,
I use n-1 dummy variables.
Derived Variables
Whenever we have a dataset, we have large number of variables. One of the decisions that we
have to take is how we are going to use those variables. I can use the variables as it is or I can
derive new variables from the existing variable. Two frequent ways of deriving new variable is
to take ratios or to take interaction variable.
Whenever I talk about ratios, let us say I have a variable X1 and another variable X2; I can
derive a new variable X3, which is X1/X2. Similarly, I can create a new variable, which is
product of X1 and X2. These are the two frequently used approaches for deriving new variables.
When we incorporate these derived variables in our regression coefficient, we call them as
Interaction Variables.
Multi-collinearity
Variance Inflation Factor
One of the measures, which is used to identify existence of multi-collinearity, is called variance
inflation factor. As the name suggests, it captures the quantity by which the variants of the
regression parameter is estimated. So typically, many statisticians or analytics experts would put
a threshold of 10 or 4. Therefore, a variance inflation factor of more than 4 or 10 is treated as
dangerous to the model.
Points to remember
n-1 dummy variables are created for a categorical variable with n categories.
Two frequent ways of deriving new variables from the existing explanatory variables are
ratios and interactions.
Multi-collinearity is nothing but the high correlation between explanatory variables.
Multi-collinearity can lead to unstable regression coefficients.
Variables Selection Methods
Points to remember
In Forward selection method, the entry variable is the one with smallest p-value.
In Backward elimination method, all variables are entered into the equation and then
sequentially removed starting with the most insignificant variable.
In Step-wise Regression, the entry variable is the one with smallest p-value. At each step,
the independent variable with the smallest probability of F will not be in the equation.