DexLab Analytics Business Analytics - Data Science - Study Material
DexLab Analytics Business Analytics - Data Science - Study Material
DISCLAIMER
No part of this book may be reproduced or distributed in any form or by any electronic or
mechanical means including information storage and retrieval systems, without permission in
writing from the management.
Copyright © 2017-2018 DexLab Solutions Corp All Rights Reserved
Contents
1. Introduction to Analytics 6
2. Concept of Probability 38
3. Sampling Theory 66
4. Parametric Tests 85
to Analytics 1
1. Introduction to Analytics
Every decade or so, the business world invents another term for how it
extracts managerial and decision-making value from computerized data. In the
1970s the favoured term was “decision support systems,” accurately reflecting the
importance of a decision-centred approach to data analysis. In the early 80s,
“executive information systems” was the preferred nomenclature, which addressed
the use of these systems by senior managers.
Later in that decade, emphasis shifted to the more technical-sounding “online
analytical processing,” or OLAP. The 90s saw the rise of “business intelligence” as a
descriptor. In the middle of 2000’s first decade, “analytics” began to come into
favour, at least for the more statistical and mathematical forms of data analysis.
Each of these terms has its virtues and its ambiguities. No supreme being
has provided us with a clear, concise definition of what anything should be called, so
we mortals will continue to wrestle with appropriate terminology. It appears,
however, that another shift is taking place in the label for how we take advantage of
data to make better decisions and manage organizations. The new label is “business
analytics”.
In one sense, “business analytics” is simply the combination of business
intelligence and analytics. Such a combination reflects the increased importance of
quantitative analysis of data for understanding, prediction, and optimization.
Business intelligence is a vague term that primarily denoted reporting-related
activity—certainly useful, but perhaps somewhat commoditized. “Analytics” in some
organizations could be a somewhat academic activity that lacked clear business
objectives, but that has changed as companies increasingly compete on their
analytical capabilities. After a quick review of the shortcomings of these two terms
by themselves, I’ll provide a definition of the merged term “business analytics.”
Net Result:
Instead of the costly alternative of increasing production everywhere in the
world—which would have resulted in excess inventory and expensive shipping to
redistribute products later—the company had an accurate, just-in-time, precisely
targeted delivery approach that met customers’ needs for a dramatically lower cost
Example: How do grocery cashiers know to hand you coupons you might
actually use?
Each Tuesday, you head to the grocery store and fill up your cart. The
cashier scans your items, then hands you a coupon for 50 cents off your favourite
brand of whole-grain cereal, which you didn't get today but were planning to buy
next week.
With hundreds of thousands of grocery items on the shelves, how do stores
know what you're most likely to buy? Computers using predictive analytics are
able to crunch terabytes and terabytes of a consumer's historical purchases to
figure out that your favourite whole-grain cereal was the one item missing from
your shopping basket that week. Further, the computer matches your past cereal
history to ongoing promotions in the store, and bingo - you receive a coupon for
the item you are most likely to buy.
Example: Why were the Oakland A's so successful in the early 2000s,
despite a low payroll?
During the early 2000s, the New York Yankees were the most acclaimed
team in Major League Baseball. But on the other side of the continent, the
Oakland A's were racking up success after success, with much less fanfare - and
much less money.
While the Yankees paid its star players tens of millions, the A's managed to
be successful with a low payroll. How did they do it? When signing players, they
didn't just look at basic productivity values such as RBIs, home runs, and earned-
run averages. Instead, they analysed hundreds of detailed statistics from every
player and every game, attempting to predict future performance and production.
Some statistics were even obtained from videos of games using video recognition
techniques. This allowed the team to sign great players who may have been
lesser-known, but who were equally productive on the field. The A's started a
trend, and predictive analytics began to penetrate the world of sports with a
splash, with copycats using similar techniques. Perhaps predictive analytics will
someday help bring Major League salaries into line.
3. Types of Analytics
Descriptive: Analytics, which use data aggregation and data mining
techniques to provide insight into the past and answer: “What has happened?”
Descriptive analysis or statistics does exactly what the name implies they
“Describe”, or summarize raw data and make it something that is interpretable by
humans. They are analytics that describe the past. The past refers to any point of
time that an event has occurred, whether it is one minute ago, or one year ago.
Descriptive analytics are useful because they allow us to learn from past behaviors,
and understand how they might influence future outcomes.
certainty. Companies use these statistics to forecast what might happen in the
future. This is because the foundation of predictive analytics is based on
probabilities.
Use Predictive analysis any time you need to know something about the
future, or fill in the information that you do not have.
Use prescriptive statistics anytime you need to provide users with advice on
what action to take.
Confirmatory Analysis
Inferential Statistics - Deductive Approach
Heavy reliance on probability models
Must accept untestable assumptions
Look for definite answers to specific questions
Emphasis on numerical calculations
Hypotheses determined at outset
Hypothesis tests and formal confidence interval estimation
Exploratory Analysis
Descriptive Statistics - Inductive Approach
Look for flexible ways to examine data without preconceptions
Attempt to evaluate validity of assumptions
Heavy reliance on graphical displays
Let data suggest questions
Focus on indications and approximate error magnitudes
5. Scales of Measurement
Nominal
Let’s start with the easiest one to understand. Nominal scales are used for
labelling variables, without any quantitative value. “Nominal” scales could simply
be called “labels.” Here are some examples, below. Notice that all of these scales are
mutually exclusive (no overlap) and none of them have any numerical significance.
A good way to remember all of this is that “nominal” sounds a lot like “name” and
nominal scales are kind of like “names” or labels.
Ordinal
With ordinal scales, it is the order of the values is what’s important and
significant, but the differences between each one is not really known.
Take a look at the example below. In each case, we know that a #4 is better
than a #3 or #2, but we don’t know–and cannot quantify–how much better it is. For
example, is the difference between “OK” and “Unhappy” the same as the difference
between “Very Happy” and “Happy?” We can’t say.
Ordinal scales are typically measures of non-numeric concepts like
satisfaction, happiness, discomfort, etc.
“Ordinal” is easy to remember because is sounds like “order” and that’s the key to
remember with “ordinal scales”–it is the order that matters, but that’s all you really
get from these.
Advanced note: The best way to determine central tendency on a set of ordinal
data is to use the mode or median; the mean cannot be defined from an ordinal set.
Interval
Interval scales are numeric scales in which we know not only the order, but
also the exact differences between the values. The classic example of an interval
scale is Celsius temperature because the difference between each value is the same.
For example, the difference between 60 and 50 degrees is a measurable 10 degrees,
as is the difference between 80 and 70 degrees. Time is another good example of an
interval scale in which the increments are known, consistent, and measurable.
Interval scales are nice because the realm of statistical analysis on these data
sets opens up. For example, central tendency can be measured by mode, median, or
mean; standard deviation can also be calculated.
Like the others, you can remember the key points of an “interval scale” pretty
easily. “Interval” itself means “space in between,” which is the important thing to
remember–interval scales not only tell us about order, but also about the value
between each item.
Here’s the problem with interval scales: they don’t
have a “true zero.” For example, there is no such thing as
“no temperature.” Without a true zero, it is impossible to
compute ratios. With interval data, we can add and
subtract, but cannot multiply or divide. Confused? Ok,
consider this: 10 degrees + 10 degrees = 20 degrees. No
problem there. 20 degrees is not twice as hot as 10
degrees, however, because there is no such thing as “no
temperature” when it comes to the Celsius scale. I hope
that makes sense. Bottom line, interval scales are great, Figure 5: Interval
but we cannot calculate ratios, which brings us to our last Scale of Measurement
measurement scale…
Ratio
Ratio scales are the ultimate nirvana when it comes to
measurement scales because they tell us about the order, they tell us
the exact value between units, AND they also have an absolute zero–
which allows for a wide range of both descriptive and inferential
statistics to be applied. At the risk of repeating myself, everything
above about interval data applies to ratio scales + ratio scales have a
clear definition of zero. Good examples of ratio variables include
height and weight.
Ratio scales provide a wealth of possibilities when it comes to
statistical analysis. These variables can be meaningfully added,
subtracted, multiplied, divided (ratios). Central tendency can be Figure 6:
measured by mode, median, or mean; measures of dispersion, such as Ratio Scale
standard deviation and coefficient of variation can also be calculated of Measurement
from ratio scales.
SUMMARY
In summary, nominal variables are used to “name,” or label a series of
values. Ordinal scales provide good information about the order of choices, such as
in a customer satisfaction survey. Interval scales give us the order of values + the
ability to quantify the difference between each one. Finally, Ratio scales give us the
ultimate–order, interval values, plus the ability to calculate ratios since a “true
zero” can be defined.
6. Attribute
Attributes are qualitative character that cannot be numerically expressed.
Individuals possessing an attribute can be grouped into several disjoint classes.
Attribute may be of two types-ordinal and nominal.
Qualitative data are nonnumeric.
{Poor, Fair, Good, Better, Best}, colors (ignoring any physical causes), and
types of material {straw, sticks, bricks} are examples of qualitative data.
Qualitative data are often termed categorical data. Some books use the
terms individual and variable to reference the objects and characteristics
described by a set of data. They also stress the importance of exact definitions of
these variables, including what units they are recorded in. The reason the data
were collected is also important.
7. Variable
The term variable means a character of an item or an individual that can be
expressed in numeric terms. It is also called a qualitative character and such
characters can be measured or counted.
Quantitative data are numeric.
Quantitative data are further classified as either discrete or continuous.
Discrete data are numeric data that have a finite number of possible values.
● When data represent counts, they are discrete. An example might be how
many students were absent on a given day. Counts are usually considered
exact and integer. Consider, however, if three tradies make an absence, then
aren't two tardies equal to 0.67 absences?
Mean (Arithmetic)
The mean (or average) is the most popular and well known measure of
central tendency. It can be used with both discrete and continuous data, although
its use is most often with continuous data (see our Types of Variable guide for data
types). The mean is equal to the sum of all the values in the data set divided by the
number of values in the data set.
So, if we have n values in a data set and they have values x1, x2, ..., xn, the
sample mean, usually denoted by (pronounced x bar), is:
If we are working with a continuous data set and the values are given in
intervals, we calculate the mid-point of the interval then multiply it with the
frequency for calculating arithmetic mean.
Frequency 2 5 1 3
Mid-pt Frequency
Items Fm
M F
0-10 5 2 10
10-20 15 5 75
20-30 25 1 25
30-40 35 3 105
N=11 ∑fm=215
Mean (Geometric)
The arithmetic mean is relevant any time several quantities add together to
produce a total. The arithmetic mean answers the question, "if all the quantities
had the same value, what would that value have to be in order to achieve the same
total?"
In the same way, the geometric mean is relevant any time several quantities
multiply together to produce a product. The geometric mean answers the question,
"if all the quantities had the same value, what would that value have to be in order
to achieve the same product?"
Let us calculate geometric mean on a discrete data set:
Given xi = 4, 9
Here, n = 2
Mean (Harmonic)
Harmonic mean is another measure of central tendency and also based on
mathematics footing like arithmetic mean and geometric mean. Like arithmetic
mean and geometric mean, harmonic mean is also useful for quantitative data.
Harmonic mean is defined in following terms:
Staff 1 2 3 4 5 6 7 8 9 10
Salary 15k 18k 16k 14k 15k 15k 12k 17k 90k 95k
The mean salary for these ten staff is $30.7k. However, inspecting the raw
data suggests that this mean value might not be the best way to accurately reflect
the typical salary of a worker, as most workers have salaries in the $12k to 18k
range. The mean is being skewed by the two large salaries. Therefore, in this
situation, we would like to have a better measure of central tendency. As we will
find out later, taking the median would be a better measure of central tendency in
this situation.
Median
The median is the middle score for a set of data that has been arranged in
order of magnitude. The median is less affected by outliers and skewed data. In
order to calculate the median, suppose we have the data below:
65 55 89 56 35 14 56 55 87 45 92
We first need to rearrange that data into order of magnitude (smallest first):
14 35 45 55 55 56 56 65 87 89 92
Our median mark is the middle mark - in this case, 56 (highlighted in bold).
It is the middle mark because there are 5 scores before it and 5 scores after it. This
works fine when you have an odd number of scores, but what happens when you
have an even number of scores? What if you had only 10 scores? Well, you simply
have to take the middle two scores and average the result. So, if we look at the
example below:
65 55 89 56 35 14 56 55 87 45
14 35 45 55 55 56 56 65 87 89
Only now we have to take the 5th and 6th score in our data set and average them to
get a median of 55.5.
0-99 26
100-199 32
200-299 65
300-399 75
400-499 60
500-599 42
Let us convert the class intervals given, to class boundaries and construct the less
than type cumulative frequency distribution.
0-99 0-99.5 26 26
100-199 99.5-199.5 32 58
N 300
Here, 150
2 2
Here, the cumulative frequency just greater than or equal to 150 is 198.
Fk 198 and
xk 399.5
the median class is the class for which upper class boundary is xk 399.5
In other words, 299.5-399.5 is the median class, i.e. the class containing the median
value.
using the formula for median we have,
300
123
Median or x 299.5 2 100
75
where, xl 299.5 (lower class boundary of the median class),
Mode
The mode is the most frequent score in our data set. On a histogram it
represents the highest bar in a bar chart or histogram. You can, therefore,
sometimes consider the mode as being the most popular option. An example of a
mode is presented below:
Normally, the mode is used for categorical data where we wish to know which is the
most common category as illustrated below:
Considering an example-
The number of points scored in a series of football
games is listed below. Which score occurred most
often?
7, 13, 18, 24, 9, 3, 18
Solution: Ordering the scores from least to
greatest, we get: 3, 7,9, 13, 18, 18, 24
Answer: The score which occurs most often is 18.
Figure 12: Highlighting the MODE
This was for ungrouped or discrete data,
when we look for grouped data, the calculations
are modified.
Problem:
9. Measures of Dispersion
The values of a variable are generally not equal. In some cases the values are
very close to one another; again, in some cases they are markedly different from one
another. In order to get a proper idea about the overall nature of a given set of
values, it is necessary to know, besides average, the extent to which the given
values differ among themselves or equivalently how they are scattered about the
average. This feature of frequency distribution which represents the variability of
the given values or reflects how scattered the values are, is called its dispersion.
The Range
The Range is the difference between the lowest and highest values.
Standard Deviation
The Standard Deviation is a measure of how spread out numbers are.
Its symbol is σ (the Greek letter sigma)
The formula is easy: it is the square root of the Variance. So now you ask,
"What is the Variance?"
Variance
The Variance is defined as:
The heights (at the shoulders) are: 600mm, 470mm, 170mm, 430mm and 300mm.
Find out the Mean, the Variance, and the Standard Deviation.
so the mean (average) height is 394 mm. Let's plot this on the chart:
To calculate the Variance, take each difference, square it, and then average the
result:
And the good thing about the Standard Deviation is that it is useful. Now we can
show which heights are within one Standard Deviation (147mm) of the Mean:
So, using the Standard Deviation we have a "standard" way of knowing what is
normal, and what is extra large or extra small.
3 6
6 3
6 3
7 2
8 1
11 2
15 6
16 7
10. Quartiles
Quartiles are the values that divide a list of numbers into quarters.
● First put the list of numbers in order
● Then cut the list into four equal parts
● The Quartiles are at the "cuts"
Like this
Example: 24, 25, 26, 27, 30, 32, 40, 44, 50, 52, 55, 57
Cut the list into quarters:
A histogram shows that the data are skewed left, not symmetric.
But how highly skewed are they, compared to other data sets? To answer
this question, you have to compute the skewness.
Begin with the sample size and sample mean. (The sample size was given,
but it never hurts to check.)
n = 5+18+42+27+8 = 100
x = (61×5 + 64×18 + 67×42 + 70×27 + 73×8) ÷ 100
x = 9305 + 1152 + 2814 + 1890 + 584) ÷ 100
x = 6745÷100 = 67.45
Now, with the mean in
hand, you can compute the
skewness. (Of course in real life
you’d probably use Excel or a
statistics package, but it’s good
to know where the numbers come
from.)
Example for Kurtosis: Let’s continue with the example of the college men’s
heights, and compute the kurtosis of the data set. n = 100, x = 67.45 inches, and the
variance m2 = 8.5275 in² were computed earlier.
Rule of thumb
● Kurtosis < 3, Platykurtic
● Kurtosis = 3, Mesokurtic
14. Boxplot
Boxplots are used to better understand how values are spaced out in different
sets of data. When reviewing a boxplot, an outlier is defined as a data point that is
located outside the fences (“whiskers”) of the boxplot (e.g: outside 1.5 times the
interquartile range above the upper quartile and below the lower quartile).
This simplest possible box plot displays the
full range of variation (from min to max), the likely
range of variation (the IQR), and a typical value
(the median). Not uncommonly real datasets will
display surprisingly high maximums or surprisingly
low minimums called outliers. John Tukey has
provided a precise definition for outliers:
Outliers are either 3×IQR or more above the
third quartile or 3×IQR or more below the first
quartile. The values for Q1 – 1.5×IQR and Q3 +
1.5×IQR are the "fences" that mark off the
"reasonable" values from the outlier values. Outliers
Figure 16: Boxplot
lie outside the fences.
Example: Find the outliers, if any, for the following data set:
10.2, 14.1, 14.4. 14.4, 14.4, 14.5, 14.5, 14.6, 14.7,
14.7, 14.7, 14.9, 15.1, 15.9, 16.4
To find out if there are any outliers, I first have to find the IQR. There are
fifteen data points, so the median will be at position (15 + 1) ÷ 2 = 8.
Then Q2 = 14.6. There are seven data points on either side of the median, so
Q1 is the fourth value in the list and Q3 is the twelfth: Q1 = 14.4and Q3 = 14.9.
Then IQR = 14.9 – 14.4 = 0.5.
Outliers will be any points below Q1 – 1.5×IQR = 14.4 – 0.75 = 13.65 or
above Q3 + 1.5×IQR = 14.9 + 0.75 = 15.65.
Then the outliers are at 10.2, 15.9, and 16.4.
Probability 2
1. Concept of Probability
Many events can't be predicted with total certainty. The best we can say is
how likely they are to happen, using the idea of probability.
Tossing a Coin
When a coin is tossed, there are
two possible outcomes:
heads (H) or
tails (T)
We say that the probability of the
coin landing H is ½.
And the probability of the coin
landing T is ½.
Throwing Dice
When a single die is thrown, there are
six possible outcomes: 1, 2, 3, 4, 5, 6. The
probability of any one of them is 1/6.
PROBABILITY
In general
Example: There are 5 marbles in a bag: 4 are blue, and 1 is red. What is
the probability that a blue marble gets picked?
Number of ways it can happen: 4 (there are 4 blues)
Total number of outcomes: 5 (there are 5 marbles in total)
So the probability = ⅘=0.8
Probability Line
We can show probability on a Probability Line:
Example: Toss a coin 100 times, how many Heads will come up?
Probability says that heads have a ½ chance, so we can expect 50 Heads.
But when we actually try it we might get 48 heads, or 55 heads ... or anything
really, but in most cases it will be a number near 50.
2. Words
Some words have special meaning in Probability:
Tossing a coin, throwing dice, seeing what pizza people choose are all
examples of experiments.
Example Events:
● Getting a Tail when tossing a coin is an event
● Rolling a "5" is an event.
An event can include one or more possible outcomes:
● Choosing a "King" from a deck of cards (any of the 4 Kings) is an event
● Rolling an "even number" (2, 4 or 6) is also an event
Figure 18: Sample Space Hey, let's use those words, so you get used to
them:
Example: Alex wants to see
how many times a "double"
comes up when throwing 2
dice.
Each time Alex throws 2 dice is
an Experiment.
It is an Experiment because the
result is uncertain.
The Event Alex is looking for is a
"double", where both dice have the
same number. It is made up of
these 6 Sample Points:
{1,1} {2,2} {3,3} {4,4} {5,5} and {6,6}
The Sample Space is all possible
outcomes (36 Sample Points):
{1,1} {1,2} {1,3} {1,4} ... {6,3} {6,4}
{6,5} {6,6}
These are Alex's Results:
Experiment Is it a
Double?
{3,4} No
{5,1} No
{2,2} Yes
{6,3} No
After 100 Experiments, Alex has 19 “double” Events
... ... ... is that close to what you would expect?
3. Types of Events
Mutually Exclusive: can't happen at the same time.
Examples:
● Turning left and turning right are Mutually Exclusive (you can't do both at
the same time)
● Tossing a coin: Heads and Tails are Mutually Exclusive
● Cards: Kings and Aces are Mutually Exclusive
What is not Mutually Exclusive:
● Turning left and scratching your head can happen at the same time
● Kings and Hearts, because we can have a King of Hearts!
Like here:
Probability
Let's look at the probabilities of Mutually Exclusive events. But first, a definition:
Mutually Exclusive:
When two events (call them "A" and "B") are Mutually Exclusive it
is impossible for them to happen together:
P(A and B) = 0
"The probability of A and B together equals 0 (impossible)"
But the probability of A or B is the sum of the individual probabilities:
P(A or B) = P(A) + P(B)
"The probability of A or B equals the probability of A plus the probability of B"
Special Notation
Instead of "and" you will often see the symbol ∩ (which is the "Intersection"
symbol used in Venn Diagrams).
Instead of "or" you will often see the symbol ∪ (the "Union" symbol)
SUMMARY
Mutually Exclusive
● A and B together is impossible: P(A and B) = 0
● A or B is the sum of A and B: P(A or B) = P(A) + P(B)
Exhaustive Events
When a sample space is distributed down into some mutually exclusive
events such that their union forms the sample space itself, then such events are
called exhaustive events.
OR,
When two or more events from the sample space collectively than it is known
as collectively exhaustive events.
OR,
When at least one of the events occur compulsorily from the list of events,
then it is also known as exhaustive events.
Solution:
Event x, y, z are mutually exclusive events because;
Xnynz=ø
Now check whether the events are exhaustive events or not?
For this, take the union of all events;
x u y u z = {1, 2, 3} u {4, 5, 6} u {7, 8, 9, 10} = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10} = S
Event x, y & z are exhaustive events, because they form a complete
sample space itself.
Equally Likely
If all the outcomes of a sample space have the same chance of occurrence,
then it is known as equally likely outcomes. It is not necessary that the outcomes
are equally likely, but during an experiment we shall assume that the outcomes are
equally likely outcomes in many cases.
should be mentioned. In case of tossing more than one coin it is assumed that
on all the coins, head and tail are equally likely.
3. Throwing a Dice: There are six “6” possible outcomes when rolling a single
dice. In this case all the six outcomes are assumed to be equally likely
outcome. Each has a probability of occurrence 1/6.
4. Drawing Balls from a Bag: This is the last case in which probability of
occurrence is assumed as equally likely. For example a ball is selected
randomly from a bag having balls of different colors. In this case it is
assumed that each ball in the bag has an equal outcome.
4. Random Variables
A Random Variable is a set of possible values from a random experiment.
So:
● We have an experiment (such as tossing a coin)
● We give values to each event
● The set of values is a Random Variable
Example: x + 2 = 6
In this case we can find that x=4
But a Random Variable is different ...
A Random Variable has a whole set of values ...
... and it could take on any of those values, randomly.
Example: X = {0, 1, 2, 3}
X could be 1, 2, 3 or 4, randomly.
And they might each have a different probability.
Probability
We can show the probability of any one value using this style:
HHH 3
HHT 2
HTH 2
HTT 1
THH 2
THT 1
TTH 1
TTT 0
Looking at the table we see just 1 case of Three Heads, but 3 cases of Two Heads,
3 cases of One Head, and 1 case of Zero Heads. So:
● P(X = 3) = 1/8
● P(X = 2) = 3/8
● P(X = 1) = 3/8
● P(X = 0) = 1/8
1st Die
1 2 3 4 5 6
1 2 3 4 5 6 7
2 3 4 5 6 7 8
3 4 5 6 7 8 9
2nd
Die
4 5 6 7 8 9 10
5 6 7 8 9 10 11
6 7 8 9 10 11 12
Let's count how often each value occurs, and work out the probabilities:
● 2 occurs just once, so P(X = 2) = 1/36
● 3 occurs twice, so P(X = 3) = 2/36 = 1/18
● 4 occurs three times, so P(X = 4) = 3/36 = 1/12
● 5 occurs four times, so P(X = 5) = 4/36 = 1/9
● 6 occurs five times, so P(X = 6) = 5/36
● 7 occurs six times, so P(X = 7) = 6/36 = 1/6
● 8 occurs five times, so P(X = 8) = 5/36
● 9 occurs four times, so P(X = 9) = 4/36 = 1/9
● 10 occurs three times, so P(X = 10) = 3/36 = 1/12
● 11 occurs twice, so P(X = 11) = 2/36 = 1/18
12 occurs just once, so P(X = 12) = 1/36
SUMMARY
A Random Variable is a set of possible values from a random experiment.
The set of possible values is called the Sample Space.
A Random Variable is given a capital letter, such as X or Z.
Random Variables can be discrete or continuous.
Random Variables can be either
● Discrete Data can only take certain values (such as 1,2,3,4,5)
● Continuous Data can take any value within a range (such as a person's
height)
Examples:
number of students present
number of red marbles in a jar
number of heads when flipping three coins
students’ grade level
Examples:
height of students in class
weight of students in class
time it takes to get to school
distance traveled between classes
1 2 3 4 5 6 5 4 3 2 1
P(X)
36 36 36 36 36 36 36 36 36 36 36
P(x)=Pr(X=x).
Often, we denote the random variable of the probability mass function with a
subscript, so may write
PX(x)=Pr(X=x).
For a function P(x) to be valid probability mass function, P(x) must be non-
negative for each possible value x. Moreover, the random variable must take on
some value in the set of possible values with probability one, so we require
that P(x) must sum to one. In equations, the requirements are
P(x)≥0 for all x∑xP(x)=1,
where the sum is implicitly over all possible values of X.
Example:
Experiment: Toss a fair coin 3 times
Sample Space: S = {HHH, HHT, HTH, THH, HTT, THT, TTH, TTT}
Random variable X is the number of tosses.
Thus X : S → R looks like this
X(HHH) = 3
X(HHT) = X(HTH) = X(THH)=2
X(HTT) = X(THT) = X(TTH)=1
X(TTT) = 0
Thus, Range(X) = {0,1,2,3} and
P(X = 0) =1/8, P(X = 1) =3/8 , P(X = 2) = 3/8, P(X = 3) =1/8
Hence the probability mass function is given by
P(0) =1/8
P(1) =3/8
P(2) =3/8
P(3) =1/8
Probability Distributions
An example will make clear the relationship between random variables and
probability distributions. Suppose you flip a coin two times. This simple statistical
experiment can have four possible outcomes: HH, HT, TH, and TT. Now, let the
variable X represent the number of Heads that result from this experiment. The
variable
0, 1, or 2. In this example, X is a random variable; because its value is determined
by the outcome of a statistical experiment.
A probability distribution is a table or an equation that links each
outcome of a statistical experiment with its probability of occurrence. Consider the
coin flip experiment described above. The table below, which associates each
outcome with its probability, is an example of a probability distribution.
Number of
Probability
heads
0 0.25
1 0.50
2 0.25
The above table represents the probability distribution of the random variable X.
Consider the following statistical experiment. You flip a coin 2 times and
count the number of times the coin lands on heads. This is a binomial experiment
because:
The experiment consists of repeated trials. We flip a coin 2 times.
Each trial can result in just two possible outcomes - heads or tails.
The probability of success is constant - 0.5 on every trial.
The trials are independent; that is, getting heads on one trial does not affect
whether we get heads on other trials.
Notation
The following notation is helpful, when we talk about binomial probability.
x: The number of successes that result from the binomial experiment.
n: The number of trials in the binomial experiment.
P: The probability of success on an individual trial.
Q: The probability of failure on an individual trial. (This is equal to 1 - P.)
n!: The factorial of n (also known as n factorial).
b(x; n, P): Binomial probability - the probability that an n-trial binomial
experiment results in exactly x successes, when the probability of success on
an individual trial is P.
nCr: The number of combinations of n things, taken r at a time.
Binomial Distribution
A binomial random variable is the number of successes x in n repeated
trials of a binomial experiment. The probability distribution of a binomial random
variable is called a binomial distribution.
Suppose we flip a coin two times and count the number of heads (successes).
The binomial random variable is the number of heads, which can take on values of
0, 1, or 2. The binomial distribution is presented below.
Number of Probabili
heads ty
0 0.25
1 0.50
2 0.25
● If a new drug is introduced to cure a disease, it either cures the disease (it’s
successful) or it doesn’t cure the disease (it’s a failure).
● If you purchase a lottery ticket, you’re either going to win money, or you
aren’t.
Basically, anything you can think of that can only be a success or a failure can be
represented by a binomial distribution.
8. Poisson Distribution
A Poisson distribution is the probability distribution that results from a
Poisson experiment.
Attributes of a Poisson Experiment
A Poisson Experiment is a statistical experiment that has the following
properties:
▪ The experiment results in outcomes that can be classified as successes or
failures.
▪ The average number of successes (μ) that occurs in a specified region is
known.
▪ The probability that a success will occur is proportional to the size of the
region.
▪ The probability that a success will occur in an extremely small region is
virtually zero.
Note that the specified region could take many forms. For instance, it could be a
length, an area, a volume, a period of time, etc.
Notation
The following notation is helpful, when we talk about the Poisson distribution.
▪ e: A constant equal to approximately 2.71828. (Actually, e is the base of the
natural logarithm system.)
▪ μ: The mean number of successes that occur in a specified region.
▪ x: The actual number of successes that occur in a specified region.
▪ P(x; μ): The Poisson probability that exactly x successes occur in a Poisson
experiment, when the mean number of successes is μ.
Poisson Distribution
A Poisson random variable is the number of successes that result from a
Poisson experiment. The probability distribution of a Poisson random variable is
called a Poisson distribution.
Given the mean number of successes (μ) that occur in a specified region, we
can compute the Poisson probability based on the following formula:
Any function satisfying both the conditions (i) and (ii) may be accepted as density
function.
A continuous random variable is a random variable that can take on any
value from a continuum, such as the set of all real numbers or an interval. We
cannot form a sum over such a set of numbers. (There are too many, since such a
continuum is uncountable.) Instead, we replace the sum used for discrete random
variables with an integral over the set of possible values.
For a continuous random variable X, we cannot form its probability
distribution function by assigning a probability that X is exactly equal to each
value. The probability distribution function we must use in the case is called
a probability density function, which essentially assigns the probability that X is
near each value. In probability theory, a probability density function (PDF),
or density of a continuous random variable, is a function that describes the
relative likelihood for this random variable to take on a given value. The probability
of the random variable falling within a particular range of values is given by
the integral of this variable’s density over that range—that is, it is given by the
area under the density function but above the horizontal axis and between the
lowest and greatest values of the range. The probability density function is
nonnegative everywhere, and its integral over the entire space is equal to one.
9. Normal Distribution
The normal distribution is the most widely known and used of all
distributions. Because the normal distribution approximates many natural
phenomena so well, it has developed into a standard of reference for many
probability problems.
x 2
1
f ( x, , ) e 2 2
2 2
The notation N(µ, σ 2 ) means normally distributed with mean µ and variance
σ2 . If we say X ∼ N(µ, σ2 ) we mean that X is distributed N(µ, σ2 ).
About 2/3 of all cases fall within one standard deviation of the mean, that is
P(µ - σ ≤ X ≤ µ + σ) = .6826
. • About 95% of cases lie within 2 standard deviations of the mean, that is
P(µ - 2σ ≤ X ≤ µ + 2σ) = .9544
Normally Distributed
1. The values of mean, median and mode should be close to each other.
2. The values of skewness and kurtosis should be close to zero.
3. The standard deviation should be low
4. Median should lie exactly in between the upper and lower quartile
Theory 3
1. Sampling Theory
In statistics, quality assurance, & survey methodology, sampling is
concerned with the selection of a subset of individuals from within a statistical
population to estimate characteristics of the whole population.
Each observation measures one or more properties (such as weight, location,
color) of observable bodies distinguished as independent objects or individuals.
2. Saves time.
3. Much reliable.
5. Scientific in nature.
Population
A typical statistical investigation is interested studying the various
characteristics relating to items on individuals belonging to a group. the group of
individuals under study is known as the population on the universe. A population
containing a finite number of objects is called a finite population, population
containing infinite on a very large number of objects is called an infinite population.
Sample
A finite representative subset of a population is called a sample. It is selected
from a population with the objectives of investing its population and principles.
There are two types of Sampling-Probabilistic and Nonprobabilistic
Probability Sampling
It is a procedure of drawing sample from a population. It enables us to draw a
conclusion about the characteristics of the population after studying only these
objects included in the sample. The theory that provided guidelines for choosing
sample from population is called the sampling theory. the theory aims at obtaining
the optimum results in respect of the characteristics of the population, within the
available sources at our disposal in terms of time, manpower and money. Secondly,
the theory of sampling aims at providing us with the idea of the best possible
estimator of the population through proper construction of the population.
The ‘n’ units of the sample are drawn from the population in such
a way that at each drawing, each of the ‘n’ numbers of the population
gets the same probability 1/N of being selected. Hence this method is
called the simple random sampling with replacement. Clearly, the
same unit of population may occur more than one in a sample. Thus,
there are Nn samples, regard being to the order in which n sample
units occur and each such sample has the probability 1/Nn.
And if the units of a sample are drawn one by one from the population in
such a way that after every drawing the selected unit is not returned to the
population then this is called as the simple random sampling without
replacement (SRSWOR)
The ‘n’ members of the sample are drawn one by one but the
members once drawn are not returned back to the population at each
stage remaining amount of population is given the same probability of
being included in the sample. This method is called SRSWOR.
Therefore, under SRSWOR at any rth number of draw there remains
(N-r+1) units and each unit has the probability of 1/(N-r+1) of being
drawn.
It is noted that if one takes all the N individuals al at a time
from the population giving equal probability to each of the observation
then the total number of possible sample is NCn i.e., combination of n
members out of N members of the population will form the total
number of possible sample in simple random sampling without
replacement.
A. Lottery Method
A lottery is drawn by writing a number or the names of various units and
then putting them in a container.
They are completely mixed and then certain numbers are picked up from the
container.
Those picked are taken up for the sampling.
Convenience Sampling
Convenience sampling is probably the most common of all sampling
techniques. With convenience
sampling, the samples are selected
because they are accessible to the
researcher. Subjects are chosen
simply because they are easy to
recruit. This technique is considered
easiest, cheapest and least time
consuming.
Figure 22: Convenience Sampling
Judgmental Sampling
Judgmental sampling is more
commonly known as purposive
sampling. In this type of sampling,
subjects are chosen to be part of the
sample with a specific purpose in
mind. With judgmental sampling, the
researcher believes that some subjects
are more fit for the research compared
to other individuals. This is the reason
why they are purposively chosen as
subjects. Figure 23: Judgemental Sampling
Properties of Estimators
To choose between estimating principles, we look into the properties satisfied
by them. These properties are classified into two groups- small sample properties
and large sample properties.
No hard and fast rule to distinguish between small and large samples →
working definition → a small sample has <=30 observations while the large sample
has more than 30 observations.
● Efficiency
hat is an efficient estimator if the following two conditions are satisfied together:
I. is unbiased and
hat
● Linearity
An estimator is said to have the property of linearity if it is possible to
express it as a linear combination of sample observations. Linearity is associated
with linear (i.e), additive) calculation rather than multiplicative or non-linear
calculation.
● Asymptotic Unbiasedness
C is an asymptotically unbiased estimator of ( hat) if:
Lim n→ E( hat)=
This means that the estimator ( hat), which is otherwise biased, becomes
unbiased as the sample size approaches to infinity.
If an estimator is unbiased, it is also asymptotically unbiased, but the reverse
is not necessarily true.
● Consistency
Whether or not an estimator is considered is understood by looking at the
behaviour of its bias and variance as the sample size approaches to infinity.
If the increase in sample size reduces bias (if there were one) and variance of
the estimate, and this continues until both bias and variance become zero an n →
, then the estimator is said to be consistent.
So, ( hat) is a consistent estimator if
E[ hat- ]=0
and Var ( hat) =0
It has already been seen that sample mean and sample median are consistent
estimators for . It is to be remembered that A.M or sample mean is affected by the
existence of extreme values, however sample median will be free from any such
effects of outliers because sample median would give the middlemost value of the
distribution and the outliers would lie at the extreme ends. Despite of this fact, it is
seen that sample mean is more preferred as an estimator for population mean than
sample median. This is because sample mean as estimator is seen to contain all
information about the population parameter.
This property is know as sufficiency. An estimator based on sample
observations is a sufficient estimator for a parameter if it contains all information
on the sample regarding the population parameter. in this sense, sufficiency is the
most important property for the choice of estimator.
Table 3: Showing the comparison between sample Statistics and Population Parameters
Estimation in Statistics
In statistics, estimation refers to the process by which one makes inferences
about a population, based on information obtained from a sample.
▪ Point Estimate
A point estimate of a population parameter is a single value of a statistic. For
example, the sample mean x is a point estimate of the population mean μ. Similarly,
the sample proportion p is a point estimate of the population proportion P.
▪ Interval Estimate
An interval estimate is defined by two numbers, between which a population
parameter is said to lie. For example, a < x < b is an interval estimate of the
population mean μ. It indicates that the population mean is greater than a but less
than b.
Sampling Distribution
Say ‘n’ be a sample drawn from a finite population of size ‘N’. Then the total
number of sample is NCn = k. For each of these ‘k’ samples, we can compute some
statistic t(x1,x2,x3,.......,xn). The set of the values of the statistic so obtained , one for
each sample, constitutes the sampling distribution of the statistic. Eg: t1,t2,t3….tk
determines the sampling distribution of the statistic t. In other words , statistic t
may be regarded as a random variable which can take the values t1,t2,t3….tk.
Sampling distributions are mostly continuous in nature. The most common
types of sampling distribution are:
● Gamma distribution
● Exponential distribution
● Chi square distribution
● t-test
● F-test
3. Testing of Hypothesis
The entire process of statistical reference is mainly inductive in nature,i.e., it
is based on deciding the characteristics of the population on the basis of sample
study. Such decision always involves an element of risk. That is the risk of taking
wrong decisions. It is here that the modern theory of probability plays a vital role
and the statistical technique that helps us at arriving at the criterion for such
dimension is known as the testing of hypothesis.
Hypothesis is a statistical statement on a conjecture about the value of a
parameter. the basic hypothesis being tested is called the null hypothesis (H0). It wis
sometimes regarded as representing the current state of knowledge and belief about the
value being tested. In a test the null hypothesis is constructed with an alternative
hypothesis (H1).
● When a hypothesis is completely satisfied, then it is called a simple hypotheses.
There are two types of statistical hypothesis.
● When all factors of a distribution are not known, then the hypothesis is
known as a composite hypothesis.
The actual decision is based on the values of the suitable function of the data,
the test states,the set of all possible values of a test statistic which is consistent
with (H0) is the acceptance region and all those values of the test statistic which is
inconsistent with (H0) is called the critical region. One important condition which
must be kept in mind for efficient working of a test statistic is that the distribution
must be well specified.
The truth or fallacy of a statistical hypothesis is based on the information
contained in a sample. The rejection or the acceptance of the hypothesis is
contingent on the consistency or the inconsistency of the (H0) with the sample
observation. Therefore, it should be clearly borne in mind that acceptance of a
statistical hypothesis is due to the insufficient evidence provided by the sample to
reject it and it does not necessary mean that it is true.
Valid/True Invalid/False
Correct
Type I Error
Reject Inference
False Positive
Judgement of True Positive
Null Hypothesis
(H0) Correct
Fail to Reject Type II Error
Inference
(accept) False Negative
True Negative
Example:
Hypothesis: "The evidence produced before the court proves that this man is
guilty."
Null Hypothesis (H0): "This man is innocent."
A type I error occurs when convicting an innocent person. A type II error
occurs when letting a guilty person go free.
A positive correct outcome occurs when convicting a guilty person. A negative
correct outcome occurs when letting an innocent person go free.
4. Significance Level
A Type I error occurs when the researcher rejects a null hypothesis when it
is true. The probability of committing a Type I error is called the significance level,
and is often denoted by α.
The researchers are curious about the level of significance in their studies
and research, this means they attempted to find the chance that their statistical
test with go against their data and hypothesis - even if the hypothesis was actually
true.
The significance level α is the probability that the test statistic will fall in the
critical region when the null hypothesis is actually true.
Scientists usually turn these results around indicating that there is only a
5% chance that the results were due to statistical errors and chance. They will
refute the null hypothesis and support the alternative. Falsifiability ensures that
the hypothesis is never completely accepted, only that the null is rejected.
Popular levels of
significance are 10%
(0.1), 5% (0.05), 1%
(0.01), 0.5% (0.005),
and 0.1% (0.001). If a
test of significance
gives a p-value lower
than the significance
level α, the null
hypothesis is rejected. Table 6: Table value of tests at different level of significance
5. P-Value
The p-value is the exact value of committing type I error. It is used as an
alternative to rejection points to provide the smallest level of significance at which
the null hypothesis would be rejected.
The level of marginal significance within a statistical hypothesis test,
representing the probability of the occurrence of a given event. The p-value is used
as an alternative to rejection points to provide the smallest level of significance at
which the null hypothesis would be rejected. The smaller the p-value, the stronger
the evidence is in favor of the alternative hypothesis.
P-values are calculated using p-value tables, or spreadsheet /
statistical software.
For Example
This concept may sound confusing and impractical, but consider a simple
example - suppose you work for a company that produces running shoes:
You need to plan production for the number of pairs of shoes your company
should make in each size for men and for women. You don't want to base your
production plans on the anecdotal evidence that men usually have bigger feet than
women, you need hard data to base your plans on. Therefore, you should look at a
statistical study that shows the correlation between gender and foot size.
If the report's p-value was only 2%, this would be a statistically significant
result. You could reasonably use the study's data to prepare your company's
production plans, because the 2% p-value indicates there is only a 2% chance that
the connection between foot size and gender was the result of chance/error. On the
other hand, if the p-value was 20%, it would not be reasonable to use the study as a
basis for your production plans, since there would be a 20% chance that the
relationship presented in the study could be due to random chance alone.
Statistical Significance
To determine if an observed outcome is statistically significant, we compare
the values of alpha and the p -value. There are two possibilities that emerge:
● The p-value is less than or equal to alpha (p<= ). In this case we reject the
null hypothesis. When this happens we say that the result is statistically
significant. In other words, we are reasonably sure that there is something
besides chance alone that gave us an observed sample.
● The p-value is greater than alpha (p> ). In this case we fail to reject the null
hypothesis. When this happens we say that the result is not statistically
significant. In other words, we are reasonably sure that our observed data
can be explained by chance alone.
The implication of the above is that the smaller the value of is, the more
difficult it is to claim that a result is statistically significant. On the other hand, the
larger the value of alpha is the easier is it to claim that a result is statistically
significant. Coupled with this, however, is the higher probability that what we
observed can be attributed to chance.
6. Confidence Interval
Statisticians use a confidence interval to express the degree of uncertainty
associated with a sample statistic. A confidence interval is an interval
estimate combined with a probability statement.
For example, suppose a statistician conducted a survey and computed an
interval estimate, based on survey data. The statistician might use a confidence
level to describe uncertainty associated with the interval estimate. He/she might
describe the interval estimate as a "95% confidence interval". This means that if we
used the same sampling method to select different samples and computed an
interval estimate for each sample, we would expect the true population parameter
to fall within the interval estimates 95% of the time. In the language of hypothesis
testing, the 100(1 − α)% confidence interval established in is known as the region of
acceptance (of the null hypothesis) and the region(s) outside the confidence interval
is (are) called the region(s) of rejection (of H0) or the critical region(s). As noted
previously, the confidence limits, the endpoints of the confidence interval, are also
called critical values.
A term used in inferential statistics that measures the probability that a
population parameter will fall between two set values. The confidence interval can
take any number of probabilities, with the most common being 95% or 99%.
In other words, a confidence interval is the probability that a value will fall
between an upper and lower bound of a probability distribution. For example, given
a 99% confidence interval, stock XYZ's return will fall between -6.7% and +8.3%
over the next year. In layman's terms, we are 99% confident that the returns of
holding XYZ stock over the next year will fall between -6.7% and +8.3%.
Tests 4
1. Parametric Tests
In the literal meaning of the terms, a parametric statistical test is one that
makes assumptions about the parameters (defining properties) of the population
distribution(s) from which one's data are drawn, while a non-parametric test is
one that makes no such assumptions.
Parametric Assumptions
● Interval or ratio scale of measurement (approximately interval)
● Random sampling from a defined population
● Samples are independent/dependent (varies by statistic)
● Characteristic is normally distributed in the population
● Population variances are equal (if two or more groups/variables in
the design)
2. Z Test
A Z-test is any statistical test for which the distribution of the test statistic
under the null hypothesis can be approximated by a normal distribution.
Suppose that in a particular geographic region, the mean and standard
deviation of scores on a reading test are 100 points, and 12 points, respectively. Our
interest is in the scores of 55 students in a particular school who received a mean
score of 96. We can ask whether this mean score is significantly lower than the
regional mean — that is, are the students in this school comparable to a simple
random sample of 55 students from the region as a whole, or are their scores
surprisingly low?
Assumptions
The parent population from which the sample is drawn should be normal
The sample observations are independent, i.e., the given sample is random
The population standard deviation σ is known.
Example: Blood glucose levels for obese patients have a mean of 100 with a
standard deviation of 15. A researcher thinks that a diet high in raw cornstarch
will have a positive or negative effect on blood glucose levels. A sample of 30
patients who have tried the raw cornstarch diet have a mean glucose level of 140.
Test the hypothesis that the raw cornstarch had an effect.
x 0
Z
/ n
Step 5: Find the test statistic using this formula:
z=(140-100)/(15/√30)=14.60.
Step 6: If Step 5 is less than -1.96 or greater than 1.96 (Step 3), reject the null
hypothesis. In this case, it is greater, so you can reject the null.
3. T Test
A t-test is any statistical hypothesis test in which the test statistic follows a
Student's t distribution if the null hypothesis is supported. Among the most
frequently used t-tests are:
● A one-sample location test of whether the mean of a normally distributed
population has a value specified in a null hypothesis.
● A two sample location test of the null hypothesis that the means of two
normally distributed populations are equal.
● A test of the null hypothesis that the difference between two responses
measured on the same statistical unit has a mean value of zero.
● A test of whether the slope of a regression line differs significantly from zero.
PROCEDURE
Set up the hypothesis:
A. Null Hypothesis: assumes that there are no significant differences between
the population mean and the sample mean.
B. Alternative Hypothesis: assumes that there is a significant difference
between the population mean and the sample mean.
S
(X X ) 2
n 1
Where,
S = Standard deviation
X = Sample mean
n = number of observations in sample
ii. Calculate the value of the one sample t-test, by using this
formula:
X
t
S
N
Where,
t = one sample t-test value
= population mean
4. Hypothesis Testing
In hypothesis testing, statistical decisions are made to decide whether or not
the population mean and the sample mean are different. In hypothesis testing, we
will compare the calculated value with the table value. If the calculated value is
greater than the table value, then we will reject the null hypothesis, and accept the
alternative hypothesis.
Assumptions:
1. Dependent variables should be normally distributed.
2. Samples drawn from the population should be random.
3. Cases of the samples should be independent.
4. We should know the population mean.
Assumptions
Along with the independent single sample t-test, this test is one of the most
widely tests. However, this test can be used only if the background assumptions are
satisfied.
● The populations from which the samples have been drawn should be normal-
appropriate statistical methods exist for testing this assumption .One needs
to note that the normality assumption has to be tested individually and
separately for the two samples. It has however been shown that minor
departures from normality do not affect this test - this is indeed an
advantage.
● The standard deviation of the populations should be equal i.e. σX2 = σY2 =
σ2, where σ2 is unknown. This assumption can be tested by the F-test.
Conceptual Examples
Question Strategy
Does the presence Begin with a "subject pool" of seeds of the type of plant
of a certain kind in question. Randomly sort them into two groups, A
of mycorrhizal and B. Plant and grow them under conditions that are
fungi enhance the identical in every respect except one: namely, that the
growth of a seeds of group A (the experimental group) are grown
certain kind of in a soil that contains the fungus, while those of group
plant? B (the control group) are grown in a soil that does not
contain the fungus. After some specified period of time,
harvest the plants of both groups and take the relevant
measure of their respective degrees of growth. If the
presence of the fungus does enhance growth, the
average measure should prove greater for group A than
for group B.
Do two strains of With this type of situation you are in effect starting out
mice, A and B, with two subject pools, one for strain A and one for
differ with respect strain B. Draw a random sample of size Na from pool A
to their ability to and another of size Nb from pool B. Run the members of
learn to avoid an each group through a standard aversive-conditioning
aversive stimulus? procedure, measuring for each one how well and
quickly the avoidance behavior is acquired. Any
difference between the avoidance-learning abilities of
the two strains should manifest itself as a difference
between their respective group means.
Hypothesis Testing
NOTE: If one is performing hand calculations using the UNPOOLED method, the
choice of degrees of freedom can be made by choosing the smaller of n1−1 and n2−1.
d
t
s2 / n
Where d bar is the mean difference between two samples, s² is the sample
variance, n is the sample size and t is a paired sample t-test with n-1 degrees of
freedom.
Where d bar is the mean difference between two samples, s² is the sample
variance, n is the sample size and t is a paired sample t-test with n-1 degrees of
freedom. An alternate formula for paired sample t-test is:
d
t
n( d 2 ) ( d ) 2
n 1
Assumptions
1. Only the matched pairs can be used to perform the test.
2. Normal distributions are assumed.
3. The variance of two samples is equal.
6. Cases must be independent of each other.
Example:
Trace metals in drinking water affect the flavour and an unusually high
concentration can pose a health hazard. Ten pairs of data were taken measuring
zinc concentration in bottom water and surface water.
Does the data suggest that the true average concentration in the bottom water
exceeds that of surface water?
0.43 0.415
0.266 0.238
0.567 0.39
0.531 0.41
0.707 0.605
0.716 0.609
0.651 0.632
0.589 0.523
0.469 0.411
0.723 0.612
Thus, we conclude that the difference may come from a normal distribution.
d 0.0804
t * s 0.0523 4.86
d
n 10
Step 5. Check whether the test statistic falls in the rejection and determine
whether to reject H 0 .
t * 4.86 1.833
reject H 0
Between 5
Variables
1. Association between Variables
Very frequently social scientists want to determine the strength of the
association of two or more variables. For example, one might want to know if
greater population size is associated with higher crime rates or whether there are
any differences between numbers employed by sex and race. For categorical data
such as sex, race, occupation, and place of birth, tables, called contingency tables,
that show the counts of persons who simultaneously fall within the various
categories of two or more variables are created. The Bureau of the Census reports
many tables in this form such as sex by age by race or sex by occupation by region.
For continuous data such as population, age, income, and housing the strength of
the association can be measured through correlation statistics.
The alternative hypothesis is that knowing the level of Variable A can help
you predict the level of Variable B.
Note: Support for the alternative hypothesis suggests that the variables are
related; but the relationship is not necessarily causal, in the sense that one
variable "causes" the other.
Interpret Results
If the sample findings are unlikely, given the null hypothesis, the researcher
rejects the null hypothesis. Typically, this involves comparing the P-value to
the significance level, and rejecting the null hypothesis when the P-value is less
than the significance level.
Problem
A public opinion poll surveyed a simple random sample of 1000 voters.
Respondents were classified by gender (male or female) and by voting preference
(Republican, Democrat, or Independent). Results are shown in the contingency
table below.
Voting Preferences
Row
Republic Democr Independe total
an at nt
Column
450 450 100 1000
total
DF = (r - 1) * (c - 1) = (2 - 1) * (3 - 1) = 2
2. Scatterplots
A scatterplot is used to graphically represent the relationship between two
variables. Explore the relationship between scatterplots and correlations, the
different types of correlations, how to interpret scatterplots, and more.
SCATTERPLOT
Imagine that you are interested in studying patterns in individuals with
children under the age of 10. You collect data from 25 individuals who have at least
one child. After you've collected your data, you enter it into a table.
You try to draw conclusions about the data from the table; however, you find
yourself overwhelmed. You decide an easier way to analyze the data is by
comparing the variables two at a time. In order to see how the variables relate to
each other, you create scatterplots.
So what is a scatterplot? A scatterplot is a graph that is used to plot the
data points for two variables. Each scatterplot has a horizontal axis (x-axis) and a
vertical axis (y-axis). One variable is plotted on each axis. Scatterplots are made up
of marks; each mark represents one study participant's measures on the variables
that are on the x-axis and y-axis of the scatterplot.
Most scatterplots contain a line of best fit, which is a straight line drawn
through the center of the data points that best represents the trend of the data.
Scatterplots provide a visual representation of the correlation, or relationship
between the two variables.
One of the most commonly used formulas in stats is Pearson’s correlation
coefficient formula. In fact, if you’re taking a basic stats class, this is the one you’ll
probably use:
n
x x y y
i i
r i 1
n n
xi x y y
2 2
i
i 1 i 1
3. Types of Correlation
All correlations have two properties: strength and direction. The strength of
a correlation is determined by its numerical value. The direction of the correlation
is determined by whether the correlation is positive or negative.
● Positive correlation: Both variables move in the same direction. In other
words, as one variable increases, the other variable also increases. As one
variable decreases, the other variable also decreases.
o i.e., years of education and yearly salary are positively correlated.
● Negative correlation: The variables move in opposite directions. As one
variable increases, the other variable decreases. As one variable decreases,
the other variable increases.
o i.e., hours spent sleeping and hours spent awake are negatively
correlated.
No Correlations
What does it mean to say
that two variables have no
correlation? It means that there is
no apparent relationship between
the two variables.
For example, there is no
correlation between shoe size and
salary. This means that high scores
on shoe size are just as likely to
occur with high scores on salary as
they are with low scores on salary.
If your line of best fit is horizontal or vertical like the scatterplots on the top row, or
if you are unable to draw a line of best fit because there is no pattern in the data
points, then there is little or no correlation.
Strength
The strength of a correlation indicates how strong the relationship is between
the two variables. The strength is determined by the numerical value of the
correlation. A correlation of 1, whether it is +1 or -1, is a perfect correlation. In
perfect correlations, the data points lie directly on the line of fit. The further the
data are from the line of fit, the weaker the correlation. A correlation of 0 indicates
that there is no correlation. The following should be considered when determining
the strength of a correlation:
The closer a positive correlation lies to +1, the stronger it is.
i.e., a correlation of +.87 is stronger than a correlation of +.42.
The closer a negative correlation is to -1, the stronger it is.
i.e., a correlation of -.84 is stronger than a correlation of -.31.
Interpretations of Scatterplots
So what can we learn from scatterplots? Let's create scatterplots using some
of the variables in our table. Let's first compare age to Internet use. Now let's put
this on a scatterplot. Age is plotted on the y-axis of the scatterplot and Internet
usage is plotted on the x-axis.
We see that there
is a negative correlation
between age and Internet
usage. That means that
as age increases, the
amount of time spent on
the Internet declines,
and vice versa. The
direction of the
scatterplot is a negative
correlation!In the upper
right corner of the scatter
plot, we see r = -.87.
Since r signifies the
correlation, this means
that our correlation is -.87. Figure 28: Scatterplot between Age and Internet usage
Partial Correlation
A correlation between two variables in which the effects of other variables are
held constant is known as partial correlation.
The partial correlation for 1 and 2 with controlling variable 3 is given by:
r12.3 = (r12 – r13 r23) / [√ (1 – r132) √ (1 – r232)].
Examples:
1. Study of partial correlation between price and demand would involve studying
the relationship between price and demand excluding the effect of money supply,
exports, etc.
2. We might find the ordinary correlation between blood pressure and blood
cholesterol might be a high, strong positive correlation. We could potentially find
a very small partial correlation between these two variables, after we have taken
into account the age of the subject. If this were the case, this might suggest that
both variables are related to age, and the observed correlation is only due to their
common relationship to age.
ANOVA 6
(Analysis of Variance)
1. Introduction
Analysis of Variance (ANOVA) is a hypothesis-testing technique used to test
the equality of two or more population (or treatment) means by examining the
variances of samples that are taken. ANOVA allows one to determine whether the
differences between the samples are simply due to random error (sampling errors)
or whether there are systematic treatment effects that causes the mean in one
group to differ from the mean in another. Briefly, ANOVA is used for testing
overall significance of regression.
2. One-Way ANOVA
A One-Way ANOVA (Analysis of Variance) is a statistical technique
by which we can test if three or more means are equal. It tests if the value
of a single variable differs significantly among three or more levels of a
factor.
We can say we have a framework for one-way ANOVA when we have a single
factor with three or more levels and multiple observations at each level.
In this kind of layout, we can calculate the mean of the observations within
each level of our factor.
The concepts of factor, levels and multiple observations at each level can be
best understood by an example.
Assumptions
For the validity of the results, some assumptions have been checked to hold
before the technique is applied. These are:
Assumptions of ANOVA:
(i) All populations involved follow a normal distribution.
(ii) All populations have the same variance (or standard deviation).
(iii) The samples are randomly selected and independent of one another.
Advantages
One of the principle advantages of this technique is that the number of
observations need not be the same in each group.
Additionally, layout of the design and statistical analysis is simple.
their respective occupations. On the other hand, significance would imply that
stress affects different age groups differently.
Hypothesis Testing
Formally, the null hypothesis to be tested is of the form:
H0: All the age groups have equal stress on the average or μ1 = μ2 = μ3 ,
where μ1, μ2, μ3 are mean stress scores for the three age groups.
H1: The mean stress of at least one age group is significantly different.
3. Two-Way ANOVA
A Two-Way ANOVA is useful when we desire to compare the effect of
multiple levels of two factors and we have multiple observations at each
level.
One-Way ANOVA compares three or more levels of one factor. But some
experiments involve two factors each with multiple levels in which case it is
appropriate to use Two-Way ANOVA.
Assumptions
The assumptions in both versions remain the same:
normality
independence and
equality of variance.
Advantages
● An important advantage of this design is it is more efficient than its one-way
counterpart. There are two assignable sources of variation - age and gender
in our example - and this helps to reduce error variation thereby making this
design more efficient.
● Unlike One-Way ANOVA, it enables us to test the effect of two factors at the
same time.
● One can also test for independence of the factors provided there are more
than one observation in each cell. The only restriction is that the number of
observations in each cell has to be equal (there is no such restriction in case
of one-way ANOVA).
Further suppose that the employees have been classified into three groups
or levels:
● age less than 40,
● 40 to 55
● above 55
In addition employees have been labeled into gender classification (levels):
● male
● female
In this design, factor age has three levels and gender two. In all, there are 3 x
2 = 6 groups or cells. With this layout, we obtain scores on occupational stress from
employee(s) belonging to the six cells.
Hypothesis Testing
In the basic version there are two null hypotheses to be tested.
● H01: All the age groups have equal stress on the average
● H02: Both the gender groups have equal stress on the average.
In the second version, a third hypothesis is also tested:
● H03: The two factors are independent or that interaction effect is not present.
The computational aspect involves computing F-statistic for each hypothesis.
Analysis 7
In analytics, we always have a motive of explaining the causation of an event.
To describe the causal relationship, we use regression analysis. Linear regression is
an important tool for predictive analytics. There are some set of assumptions that
we need to check for before applying linear regression on the data. Multicollinearity
is one of the most important conditions that statisticians check for going for
analysis.
Multicollinearity is a state of very high intercorrelations or inter-
associations among the independent variables. It is therefore a type of disturbance
in the data, and if present in the data the statistical inferences made about the data
may not be reliable.
Remedial Measures
● Increasing sample size
● Transformation of variables
● Dropping variables
These measures are difficult to apply in real life data. Most of the time
increasing sample size becomes a huge matter of money. In predictive analysis
dropping variable is not a suitable way of handling data. So, we go for FACTOR
ANALYSIS.
Curse of Dimensionality
For understanding factor analysis, it is extremely important to understand
what is “curse of dimensionality”. The number of samples per variable increase
exponentially with the number of variables to maintain a given level of accuracy is
called the "Curse of Dimensionality." Grouping is a fundamental theory.
Homogeneous items will have similar characteristics. If there are too many
variables without homogeneous characteristics, then grouping becomes difficult.
When we have a data set with heterogeneous variables we call it a n-
dimensional data set where all points become very unique so we cannot group them.
When we say n-dimensional, that means there are n variables. When n is very
large, the variance measures do not work properly. Variance becomes an ineffective
and inefficient parameter.
Let’s take a very simple example: Suppose a girl likes a guy, she will
never go to him and ask him directly, so uses a set of indirect questions to know his
feelings for her.
Example: Bonus and Direct salary are similar for employee perspective.
While applying SVD nothing is added or subtracted from different
perspective, just the preference point changes. The information or data does not
change. A new system is created by changing the reference point.
Graphically:
The above figure shows the rotation technique used in PCA and dropping low
variation axis in Factor Analysis.
• The first principal component accounts for as much of the variability in the
data as possible, and each succeeding component accounts for as much of the
remaining variability as possible.
while each unique factor influences only one measured variable and does not
explain correlations among measured variables.
4. Factor Loadings
To obtain a principal component, each of the weights of an Eigen vector is
multiplied by the square root of the principal component ‘s associated Eigen value.
These newly generated weights are called factor loadings and represent the
correlation of each item with the given principal component.
7. Problem of Factor
Loadings
Initially, the weights are distributed
across all the variables. So it is not possible
to understand the underlying factor of one
or more variables. To remove this problem,
we apply rotation to the axes.
Orthogonal Rotations
VARIMAX (simplifies factors) dilutes the problem of tie. When there is a tie
between two factor correlation values, we can twist or rotate the origin forcefully
then variable will be correlated to one factor more. This approach maximizes the
variance. As it is an orthogonal system even after Varimax the component will be
perpendicular to each other.
Oblique Rotations
PROMAX: The problem
with oblique rotation is that it
makes the factors correlated.
Varimax rotation is used in
principal component analysis so
that the axes are rotated to a
position in which the sum of the
variances of the loadings is the
maximum possible.
Analysis 8
1. Cluster Analysis
Cluster analysis is a multivariate method which aims to classify a sample of
subjects (or objects) on the basis of a set of measured variables into a number of
different groups such that similar subjects are placed in the same group. An
example where this might be used is in the field of psychiatry, where the
characterisation of patients on the basis of clusters of symptoms can be useful in the
identification of an appropriate form of therapy. In marketing, it may be useful to
identify distinct groups of potential customers so that, for example, advertising can
be appropriately targeted.
Hierarchical methods
Agglomerative methods, in which subjects start in their own separate
cluster. The two ’closest’ (most similar) clusters are then combined and this is done
repeatedly until all subjects are in one cluster. At the end, the optimum number of
clusters is then chosen out of all cluster solutions.
Divisive methods, in which all subjects start in the same cluster and the
above strategy is applied in reverse until every subject is in a separate cluster.
Agglomerative methods are used more often than divisive methods, so this handout
will concentrate on the former rather than the latter.
Non-hierarchical methods
(often known as k-means clustering methods)
3. Similarity
Euclidean distance
In general, if you have p variables X1, X2, . . . , Xp measured on a sample of n
subjects, the observed data for subject i can be denoted by xi1, xi2, . . . , xip and the
observed data for subject j by xj1, xj2, . . . , xjp. The Euclidean distance between
these two subjects is given by
n
d E ( x, y ) ( x1 y1 ) ( x2 y2 ) ... ( xn yn )
2 2 2
(x y )
i 1
i i
2
Cosine similarity
It is a measure of similarity between two non-zero vectors of an inner product
space that measures the cosine of the angle between them. The cosine of 0° is 1, and
it is less than 1 for any other angle. It is thus a judgment of orientation and not
magnitude: two vectors with the same orientation have a cosine similarity of 1, two
vectors at 90° have a similarity of 0, and two vectors diametrically opposed have a
similarity of -1, independent of their magnitude. Cosine similarity is particularly
used in positive space, where the outcome is neatly bounded in [0,1]. The name
derives from the term "direction cosine": in this case, note that unit vectors are
maximally "similar" if they're parallel and maximally "dissimilar" if they're
orthogonal (perpendicular). It should not escape the alert reader's attention that
this is analogous to cosine, which is unity (maximum value) when the segments
subtend a zero angle and zero (uncorrelated) when the segments are perpendicular.
good idea to try two or three of the above methods. If the methods agree reasonably
well then the results will be that much more believable.
clusters from one stage to another then this suggests that at one stage clusters that
are relatively close together were joined whereas, at the following stage, the clusters
that were joined were relatively far apart. This implies that the optimum number of
clusters may be the number present just before that large jump in distance. This is
easier to understand by actually looking at a dendrogram.
Regression
9
1. Linear Regression
In a cause and effect relationship, the independent variable is the cause,
and the dependent variable is the effect. Least squares linear regression is a
method for predicting the value of a dependent variable Y, based on the value of an
independent variable X.
Yi = ( 0 + 1Xi) + εi
Here Yi is
the outcome that
we want to
predict and Xi is
the i-th score on
the predictor
variable. The
intercept and
the slope 1 are
the parameters
in the model and
are known as
regression
Figure 37: Population Regression Line
coefficients.
There is a residual term εi which represents the difference between the score
predicted by the line and the i-th score in reality of the dependent variable. This
term is the proof of the fact that our model will not fit perfectly the data collected.
With regression we strive to find the line that best describes the data.
Ingredients:
Y is a continuous response variable (dependent variable).
X is an explanatory or predictor variable (independent variable).
Y is the variable we’re mainly interested in understanding, and we want to
make use of x for explanatory reasons.
Correlation Regression
Regression examines the relationship between
Correlation examines the relationship between
one dependent variables and one or more
two variables using a standardized unit.
independent variables. Calculations may us
However, most applications use raw units as an
either raw unit values, or standardized units as
input.
input.
For e.g. Yi = a + b.Xi where “a*” = estimated intercept for 'a' and “b*” =
estimated slope coefficient for 'b'. The estimated regression equation will be: Yi * =
a* + b*. Xi.
It is important to validate that a* represents the intercept in the population.
If 'a' differs from 'a*' then there must be sufficient evidences to show that the
difference is just an observable difference and not a significant difference. The
difference observed will just be a result of sampling fluctuation.
In other words, the estimated intercept (a*) is the value of Y if X =0. So,
intuitively what does the intercept capture? It captures those factors apart from X,
which can influence the variable Y. It will capture the average behaviour of Y which
will not be captured by X. So, if we think in the frame work of testing hypothesis
framework then:
H0: a = a*
v/s
H1: a <> a*.
Then this would hold if H0 is accepted at the specified level of significance.
Therefore, if the average of Y Be Y-bar and the average of X Be X-bar. a* = (Y-bar)-
b*(X-bar). Similarly, we can also state this for b*.
The estimated slope (b*): Therefore, when the total variation in set X is Var
(X), the total variation in Y w.r.t X is Cov (X, Y) when the total variation in set X is
1, the total variation in Y w.r.t X is {Cov (X, Y)/ V(X)} = b*. Therefore, the estimated
parameters are:
a* = (Y-bar)- b*(X-bar)
b* = {Cov (X, Y)/V(X)
Coefficient of Determination
The coefficient of determination (R2) for a linear regression model with one
independent variable is:
2
( X X )(Y Y )
r squared X n Y
where N is the number of observations used to fit the model, Σ is the summation
symbol, xi is the x value for observation i, x is the mean x value, yi is the y value for
observation i, y is the mean y value, σx is the standard deviation of x, and σy is the
standard deviation of y.
If you know the linear correlation (r) between two variables, then the
coefficient of determination (R2) is easily computed using the following formula:
R2 = r2.
Adjusted R2
The adjusted R-squared is a modified version of R-squared that has
been adjusted for the number of predictors in the model. The adjusted R-
squared increases only if the new term improves the model more than would be
expected by chance. It decreases when a predictor improves the model by less than
expected by chance.
Example:
A fund has a sample R-squared value close to 0.5 and it is most likely offering
higher risk-adjusted returns with the sample size of 50 for 5 predictors.
Given,
Sample size = 50
Number of predictors = 5
Sample R -square = 0.5
To Find,
Adjusted R square value
Solution:
Substitute the values in the formula,
= 1 - 0.8352
= 0.1648
stock price. MLR could be used to model the impact that each of these variables has
on stock's price.
Regression
10
1. Logistic Regression
The binary logistic model is used to predict a binary response based on one or
more predictor variables (features). That is, it is used in estimating the parameters
of a qualitative response model. The probabilities describing the possible outcomes
of a single trial are modeled, as a function of the explanatory (predictor) variables,
using a logistic function. Frequently (and hereafter in this article) "logistic
regression" is used to refer specifically to the problem in which the dependent
variable is binary—that is, the number of available categories is two—while
problems with more than two categories are referred to as multinomial logistic
regression, or, if the multiple categories are ordered, as ordinal logistic regression.
Logistic regression measures the relationship between the categorical
dependent variable and one or more independent variables, which are usually (but
not necessarily) continuous, by estimating probabilities. Thus, it treats the same set
of problems as does probit regression using similar techniques; the first assumes
a logistic function and the second a standard normal distribution function.
The prediction is based on the use of one or several predictors (numerical and
categorical).
A linear regression is not appropriate for predicting the value of a binary
variable for two reasons:
● A linear regression will predict values outside the acceptable range (e.g.
predicting probabilities outside the range 0 to 1)
● Since the dichotomous experiments can only have one of two possible values
for each experiment,
the residuals will not be normally distributed about the predicted line. On the other
hand, a logistic regression produces a logistic curve, which is limited to values
between 0 and 1. Logistic regression is similar to a linear regression, but the curve
is constructed using the natural logarithm of the “odds” of the target variable,
rather than the probability. Moreover, the predictors do not have to be normally
distributed or have equal variance in each group.
In the logistic regression the constant (b0) moves the curve left and right and
the slope (b1) defines the steepness of the curve. By simple transformation, the
logistic regression equation can be written in terms of an odds ratio.
Finally, taking the natural log of both sides, we can write the equation in
terms of log-odds (logit) which is a linear function of the predictors. The coefficient
(b1) is the amount the logit (log-odds) changes with a one unit change in x.
2. Odds Ratio
Example: Getting heads Example: Getting a 1 in
Definition
in a 1 flip of a coins a single roll of a dice
# of times something
happens = 1/5 = 0.2
Odds = 1/1 = 1 (or 1:1)
# of times it does NOT (or 1:5)
happen
# of times something
= 1/6 = .16
Probability happens = 1/2 = .5 (or 50%)
(or 16%)
# of times it could happen
Let's begin with probability. Probabilities range between 0 and 1. Let's say
that the probability of success is .8, thus
p = .8
Then the probability of failure is
q = 1 - p = .2
Odds are determined from probabilities and range between 0 and infinity. Odds are
defined as the ratio of the probability of success and the probability of failure. The
odds of success are
odds(success) = p/(1-p) or p/q = .8/.2 = 4,
that is, the odds of success are 4 to 1. The odds of failure would be
odds(failure) = q/p = .2/.8 = .25.
This looks a little strange but it is really saying that the odds of failure are 1
to 4. The odds of success and the odds of failure are just reciprocals of one another,
i.e., 1/4 = .25 and 1/.25 = 4. Next, we will add another variable to the equation so
that we can compute an odds ratio.
Example:
This example is adapted from Pedhazur (1997). Suppose that seven out of
10 males are admitted to an engineering school while three of 10 females are
admitted. The probabilities for admitting a male are,
p = 7/10 = .7 q = 1 - .7 = .3
If you are male, the probability of being admitted is 0.7 and the probability of not
being admitted is 0.3.
Here are the same probabilities for females,
p = 3/10 = .3 q = 1 - .3 = .7
If you are female it is just the opposite, the probability of being admitted is 0.3
and the probability of not being admitted is 0.7.
Now we can use the probabilities to compute the odds of admission for both males
and females,
odds(male) = .7/.3 = 2.33333
odds(female) = .3/.7 = .42857
Next, we compute the odds ratio for admission,
OR = 2.3333/.42857 = 5.44
Thus, for a male, the odds of being admitted are 5.44 times larger than the odds
for a female being admitted.
We then work out the likelihood of observing the data we actually did observe
under each of these hypotheses. The result is usually a very small number, and to
make it easier to handle, the natural logarithm is used, producing a log likelihood
(LL). Probabilities are always less than one, so LL’s are always negative. Log
likelihood is the basis for tests of a logistic model.
The likelihood ratio test is based on –2LL ratio. It is a test of the
significance of the difference between the likelihood ratio (–2LL) for the researcher’s
model with predictors (called model chi square) minus the likelihood ratio for
baseline model with LOGISTIC REGRESSION.
B 2j
Wj
SEB2 j
2. Hosmer–Lemeshow Test
The Hosmer–Lemeshow test uses a test statistic that asymptotically follows
a distribution to assess whether or not the observed event rates match expected
event rates in subgroups of the model population.
The Hosmer–Lemeshow test statistic is given by:
G (Og Eg ) 2
H
g 1 N g g (1 g )
Here Og, Eg, Ng, and πg denote the observed events, expected events,
observations, predicted risk for the gth risk decile group, and G is the number of
groups. The test statistic asymptotically follows a 2 distribution with G − 2
degrees of freedom. The number of risk groups may be adjusted depending on how
many fitted risks are determined by the model. This helps to avoid singular decile
groups.
2 LLnull 2 LLk
2
Rlogistic
2 LLnull
SStotal SSresidual SSregression
2
ROLS
SStotal SStotal
Where the null model is the logistic model with just the constant and the k
model contains all the predictors in the model.
In SPSS, there are two modified versions of this basic idea, one developed by
Cox & Snell and the other developed by Nagelkerke. The Cox and Snell R-square is
computed as follows:
Because this R-squared value cannot reach 1.0, Nagelkerke modified it. The
correction increases the Cox and Snell version to make 1.0 a possible value for R-
squared.
Nagelkerke Pseudo-R2
2/ n
2 LLnull
1
2 LLk
R
2
1 (2 LLnull ) 2/ n
Observation 1 2 3 4
In this table, we are working with unique observations. The model was
developed for Y = Success. So it should show high probability for the observation
where the real outcome has been Success and a low probability for the observation
where the real outcome has been No.
Related Measures
Let nc, nd and t be the number of concordant pairs, discordant pairs and
unique observations in the dataset of N observations. Then (t - nc - nd ) is the
number of tied pairs.
c (nc 0.5(t nc nd )) / t
Somers' D (nc nd ) / t
Goodman-Kruskal Gamma (nc nd ) / (nc nd )
Kendall's Tau-a (nc nd ) / (0.5 N ( N 1))
In ideal case, all the yes events should have very high probability and the no
events with very low probability as shown in the left chart. But the reality is
somehow like the right chart. We have some yes events with very low probability
and some no events with very high probability.
Analysis
11
1. Definition of Time Series: An ordered sequence of values of a variable
at equally spaced time intervals.
Secular Trend
The trend is the long term pattern of a time series. A trend can be positive or
negative depending on whether the time series exhibits an increasing long term
pattern or a decreasing long term pattern. If a time series does not show an
increasing or decreasing pattern then the series is stationary in the mean. For
example, population increases over a period of time, price increases over a period of
years, production of goods of the country increases over a period of years. These are
the examples of upward trend. The sales of a commodity may decrease over a period
of time because of better products coming to the market. This is an example of
declining trend or downward trend.
Cyclical Variation
The second component of a time series is cyclical variation. A typical business
cycle consists of a period of prosperity followed by periods of recession, depression,
and then recovery with no fixed duration of the cycle. There are sizable fluctuations
unfolding over more than one year in time above and below the secular trend. In a
recession, for example, employment, production and many other business and
economic series are below the long-term trend lines. Conversely, in periods of
prosperity they are above their long-term trend lines.
Seasonal Variation
The third component of a time series is the seasonal component. Many sales,
production, and other series fluctuate with the seasons. The unit of time reported is
either quarterly or monthly.
spreading their fixed costs over the entire year rather than a few months. Chart 16–
2 shows the quarterly sales, in millions of dollars, of Hercher Sporting Goods, Inc.
They are a sporting goods company that specializes in selling baseball and
softball equipment to high schools, colleges, and youth leagues. They also have
several retail outlets in some of the larger shopping malls. There is a distinct
seasonal pattern to their business. Most of their sales are in the first and second
quarters of the year, when schools and organizations are purchasing equipment for
the upcoming season. During the early summer, they keep busy by selling
replacement equipment. They do some business during the holidays (fourth
quarter). The late summer (third quarter) is their slow season.
Where:
● The coefficient α represents the degree of weighting decrease, a constant
smoothing factor between 0 and 1. A higher α discounts older observations
faster.
● Yt is the value at a time period t.
● St is the value of the EMA at any time period t.
You can use the following forecasting methods. For each of these methods,
you can specify linear, quadratic, or no trend.
The stepwise autoregressive method is used by default. This method
combines time trend regression with an autoregressive model and uses a stepwise
method to select the lags to use for the autoregressive process.
The exponential smoothing method produces a time trend forecast, but in
fitting the trend, the parameters are allowed to change gradually over time, and
earlier observations are given exponentially declining weights. Single, double, and
triple exponential smoothing are supported, depending on whether no trend, linear
trend, or quadratic trend is specified. Holt two-parameter linear exponential
smoothing is supported as a special case of the Holt-Winters method without
seasons.
The Winters method (also called Holt-Winters) combines a time trend with
multiplicative seasonal factors to account for regular seasonal fluctuations in a
series. Like the exponential smoothing method, the Winters method allows the
parameters to change gradually over time, with earlier observations given
exponentially declining weights. You can also specify the additive version of the
Winters method, which uses additive instead of multiplicative seasonal factors.
When seasonal factors are omitted, the Winters method reduces to the Holt two-
parameter version of double exponential smoothing.
5. Stochastic Processes
A random or stochastic process is a collection of random variables ordered in
time. An example of the continuous stochastic process is an electrocardiogram and
an example of the discrete stochastic process is GDP.
The dynamic phenomena that we observe in a time series can be grouped into
two classes:
The first are those that take stable values in time around a constant level,
without showing a long term increasing or decreasing trend. For example, yearly
rainfall in a region, average yearly temperatures or the proportion of births
corresponding to males. These processes are called stationary.
A second class of processes are the non-stationary processes, which are
those that can show trend, seasonality and other evolutionary effects over time. For
example, the yearly income of a country, company sales or energy demand is series
that evolve over time with more or less stable trends.
If the time series is non-stationary, then each set of time set data will have
its own characteristics. So we cannot generalize the behaviour of one set to other
sets.
Testing
Null hypothesis is that δ=0, that is, ρ=1 which means non-stationarity in time-
series data
But the t value of the estimated coefficient of Xt−1 does not follow the t distribution
even in large samples.
7. Dickey–Fuller Test
The Dickey–Fuller Test tests whether a unit root is present in
an autoregressive model. It is named after the statisticians David Dickey
and Wayne Fuller, who developed the test in 1979. Dickey and Fuller have shown
that, under the null hypothesis, the estimated t value of the coefficient of Xt−1
follows the τ (Tau) Statistic. This test is known as Dickey – Fuller (DF) Test. In
conducting DF test, we assumed that the error terms ut are uncorrelated. But in
case the ut are correlated, Dickey and Fuller have developed a test, known as the
Augmented Dickey–Fuller (ADF) test.
where 1 ,....., p are the parameters of the model, c is a constant, and t is white
noise.
Identification Stage
In third stage, the researcher virtually examines, the time plot of the series;
the autocorrelation function and the partial autocorrelation function. Plotting the
time path of the {yt} sequence provides useful information concerning outliers,
missing values and structural breaks in the data. Non-stationary data have a
These statistic do not have the standard normal t- and F- distributions. The
critical values of these statistics are tabulated in Dickey and Fuller test tables. The
relevant time-series is non-stationary under H0, and therefore, the standard t-test
would not be applicable in this situation. One has to apply Dickey-Fuller (DF) test
in this context, provided the alternative hypothesis suggests
H0: ρ = 1;
H1: ρ=0, i.e., stationary and non-autocorrelated error.
● Augmented Dickey Fuller tests:
The augmented Dickey-Fuller test is one that tests for a unit root in a time
series sample. The test is used in statistical research and econometrics, or the
application of mathematics, statistics, and computer science to economic data. The
primary differentiator between the two tests is that the ADF is utilized for a larger
and more complicated set of time series models. The augmented Dickey-Fuller
statistic used in the ADF test is a negative number, and the more negative it is, the
stronger the rejection of the hypothesis that there is a unit root. Of course, this is
only at some level of confidence. That is to say that if the ADF test statistic is
positive, one can automatically decide not to reject the null hypothesis of unit root.
In one example, with three lags, a value of -3.17 constituted rejection at the p-value
of .10.
For the testing of stationarity
H0: ρ = 1;
H1: |ρ| <1, i.e., stationary but autocorrelated error one has to apply augmented
Dickey Fuller (ADF) test.
● Tests for structural breaks:
In performing unit root tests, special care must taken to check whether a
structural change has occurred. When there are structural breaks, the various
Dickey- Fuller tests statistics are biased towards non- rejection of a unit root. A
structural break can be shown graphically:
The large simulated break is useful for iterating the problem of using a
Dickey Fuller test. The straight line shown in the figure highlights the fact that the
series appears to have a deterministic trend. In fact, the straight line is the best-
fitting OLS equation:
Yt = 0 + 1 Yt + t
So, the misspecified equation will tend to mimic the trend line, biasing 1
towards unity. The bias in 1 means that the Dickey Fuller is biased towards
accepting the null hypothesis of a unit root even though the series is stationary
within each of the short intervals. Here we need to develop a formal procedure to
test for unit roots in the presence of a structural change at time period t = .
Consider the Null hypothesis of a one- time jump in the level of a unit root process
against the alternative of a one- time change in the intercept of a trend stationary
process. Formally, the null hypothesis will be:
H 0: t = 0+ t-1+ 1 Dp+ t
H 1: t = 0+ 1t + 2 D L+ t
NOTE: Researchers should be aware that if AIC and SBC select the
same model, they can be confident of their results. However, it is a
matter of caution if these measures select two different models.
The SBC has superior large sample properties. Let (p*, q*) be the true order of
data generating process. Suppose we use AIC and SBC to estimate all ARMA
models of order (p,q) where p p* and q q*. Both AIC and SBC will select models of
orders greater than or equal to (p*, q*) and the sample size approaches infinity. The
SBC is asymptotically consistent while the AIC is biased towards selecting an over-
parameterized model. In small samples AIC can work better than SBC.
To check for outliers and evidence in which periods the data does
not fit well
The standard practice here is to plot residuals to look for outliers and for
evidence periods in which the model does not fit the data well. If all possible ARMA
models show evidence of a poor fit during a reasoning long portion of the sample, it
is wise to consider alternate methods such as:
1. Intervention Analysis
2. Transfer function Analysis
3. Multivariate Estimation Technique
Forecasting
The most important use of an ARMA model is to forecast future values of the
{Yt} sequence . Assuming that the true data generating process and the current and
past realisations of { t} and {Yt} are known. The forecasts of the AR(1) model takes
the form :
Yt-1 = a0 + a1 Yt + t+1
Given the coefficients a0 and a1 we can forecast Yt-1 conditional on the time period ‘t’
as :
EtYt+1 = a0 + a1Yt where EtYt+j is a symbol to represent the conditional
expectation of Yt+j given the information available at time t. So,
EtYt+j = E(Yt+j/ Yt , Yt-1, Yt-2 ,....., t , t-1 ,....)
Generalising the expression for the forecasting of the time series values:
EtYt+j = a0 (1+ a1 + a12 + a13 +....+ a1j-1) + a1j Yt
The above equation is called the forecast function. So, if the time series
converges, i.e., |a1| <1, EtYt+j a0/1-a1. For any stationary model, ARMA model; the
conditional expectation forecast of Yt+j converges to the unconditional mean.
The properties of the forecast is such that it will never be accurate. So, every
forecast should be having n error. Let us analyze the properties of the errors:-
● Et(et(j))= 0
● Var (et(j)) = 2 (1/ 1+a12). The variance of the forecast error is increasing in j.
So, we can have more confident on the short-run forecasts, than in the long-
run. As j , the forecast error variance converges to 2 (1/ 1+a12).
The { t} sequence is normally distributed, and hence, we can place confidence
intervals around the forecasts. The one-step- ahead forecast of Yt-1 is a0 + a1Yt and
the forecast error is 2. As such 95% confidence interval of the one-step ahead
forecast can be constructed as:-
a0 + a1Yt 1.96
For generalised analysis to a higher order model-
We assume:
● All coefficients are known
● All variables are subscripted t,t-1,t-2… are known at period ‘t’
● EtYt+j = 0, for j>0, the conditional expectation Yt-1 is: EtYt+1 = a0 + a1Yt +a2Yt-
1+ 1 t
The one-step ahead forecast error is the difference between Yt and EtYt+1,
i.e., between the actual and expected values.
It is too difficult to construct confidence interval for the estimated parameters.
Forecast Evaluation
A major error that is made is in thinking that the one with the best fit is the
one that will forecast the best. Let consider the example of an ARMA(2,1) process.
The one-step ahead error of the forecast is:
e1(1)= Yt-1 - a0 - a1Yt - a2Yt-1 - 1 t = t+1
Rule of Thumb: ARIMA model should never be trusted if the model is estimated
with fewer than 50 observations