Statistics Introduction: Dr. Sudeep Mallick
Statistics Introduction: Dr. Sudeep Mallick
Why statistics?
Decision making is often based on
analysis of data.
Statistics helps you to make sense of the
data by using tools that summarize,
present and analyze the data
Decision maker can also ascertain the
confidence in the decisions.
Types of Decisions
Analyze: effect of the variable
Predict: relationship between two
variables
Predict: likely outcome
Observe: trend
Generalize: about population at large
Finance
HR metrics
visualization
HR Policy
effectiveness
analysis
Comparison
of attrition
rates with
industry
averages
Statistical
models of
portfolio
management
(BlackScholes
model uses
normal
distribution)
Financial
modelling
using
probabilities
Marketing
Marketing
research
CRM and
analytics
Operations
Six sigma
SPC
Quality
Management
Examples
How many newspapers should the vendor stock
to maximize revenue?
Depends on the probability distribution of demand and
expected profit
Descriptive Statistics
Collect
Organize
Summarize
Display
Analyze
Inferential Statistics
Predict and forecast
values of population
parameters
Test hypotheses about
values of population
parameters
Make decisions
Descriptive Statistics
Graphical statistics / Visualization
pictures
Picture is worth a thousand words
Visualization
Ordinal data
Interval data
Ratio data
Frequency Distribution
12
45
13
40
13
20
45
95
38
67
47
55
56
45
50
27
50
15
26
34
12
25
48
40
25
50
42
48
53
44
23
56
46
22
Frequency distribution
Delay in
minutes
Frequency
Relative
frequency
015
12
0.286
15 - 30
0.190
30 45
0.143
45 60
14
0.333
0.048
42
60 or more
Total
Graphical Representation of
Data
The next stage of analysis
after the data has been
tabulated is to graph the
data using a variety of
methods to provide a
suitable graph. In this
section we will explore:
1.
2.
3.
4.
5.
6.
Bar charts
Pie charts
Histograms
Frequency polygons
Scatter plots
Time series plots
Histogram
A graph of the data in a frequency distribution is called a
histogram. The area of each bar is a measure of the
frequency of occurrence (number of values) within each
category. If the bar widths are the same (constant) then
the height of the bar is directly related to the frequency
and this information can then be used to construct the
histogram.
8
6
4
2
0
0-15
15-30
30-45
Delay in Minutes
45-60
60 or more
0-15
15-30
30-45
Delay in Minutes
45-60
60 or more
Bar Chart
Party
Frequency
Frequency
600
Conservative
400
500
Labour
510
300
Democrat
78
Green
55
Other
67
400
200
Frequency
100
0
Party
Pink
5200
4100
6000
6900
6050
7000
Blue
2100
1050
2950
5000
6300
5200
M
o
n
t
h
Blue
Pink
March
February
January
0
2000
4000
6000
Number of cars
Pie Chart
Frequency Polygon
A frequency polygon is formed from a histogram by
joining the mid-points of the tops of the rectangles by
straight lines. The mid-points of the first and last class
are joined to the x-axis to either side at a distance equal
to (1/2)th the class interval of the first and last class.
0 to 10
0 - 10
11 to 20
11 - 20
Frequency
0 to less than 11
0 - 11
11 to less than 21
11 - 21
Both the class structures are equivalent, none is better than the other.
It is just a matter and style and taste which one to adopt.
Now less than 11 implies either 10, or 10.50 or 10.9 or 10.99 or 10.999
depending upon the nature of data
Frequency
0 to less than 11
Class (UCB
excluded)
Frequency
11 to less than 21
0 - 11
Class (UCB
included)
Frequency
11 - 21
0 up to 11
11 up to 21
11 - 20
2
Problem with this structure is that in case of
decimal data we would need to modify
boundary so that there are no gaps.
Example for a data point such as 10.33 we
would need to modify boundary such that it
has precision of 2 decimal places
10.51 - 20
Ogive
Cumulative frequency distribution
Less than
More than
Cumulative Frequency
Ungrouped Data
X
More than
(X)
c.f.
Less
than (X)
c.f
14
14-6=8
0+6=6
8-0=8
6+0=6
8-1=7
6+1=7
7-4=3
7+4=11
Total = 14 7
3-3=0
11+3=14
Cumulative Frequency
Grouped Data
X
More than
(X)
c.f.
Less
than (X)
c.f
1-10
14
11-20
10
14-6=8
11
0+6=6
21-30
20
8-0=8
21
6+0=6
31-40
30
8-1=7
31
6+1=7
41-50
40
7-4=3
41
7+4=11
Total = 14 50
3-3=0
51
11+3=14
Ogive Example
Cumulative Frequency
Distribution
Helps answer less than, more than type questions
with ease
Helps create cumulative probability distribution which
answers cut-off probability questions
Exercise
Analysing class marks
Working with EXCEL/SPSS
Choosing appropriate class boundaries
Experimenting with class boundaries
Cross tabulation
Scatter Plot
Shows relationship between two variables
More
Pivot Tables of EXCEL
Visualization software such as Tableau
Visualization
Summary Statistics
Measure of central tendency
Measure of dispersion
Measure of shape
Summary Statistics
Arithmetic Mean
Median
Mode
Percentiles
Quartiles
Arithmetic mean
The mean of a data set is the average
of all the data values.
xi
x
n
xi
Sample mean
Population mean
Mean example
Average delay in flight departure
Pros:
Cons:
Affected by extreme values
Good for only symmetrical distribution
Excel Function Method
Mean = Cell E12 Formula:=AVERAGE(B4:B16)=56.4615
Mean
General formula
f X
X
f
Weighted Average
Example - Calculation of CGPA
Median
It is the middle item in a data set that is
arranged in ascending/descending order
If there are n observations then the
Median = (n+1)/2 th observation.
computation rule
if n is odd then (n+1)/2 is an integer
Example
Sorted 42
observations
median is average of
21st and 22nd
observation
= (34+38)/2
= 36
22
45
23
46
25
47
25
48
26
48
27
50
34
50
10
38
50
12
40
53
12
40
55
13
42
56
13
44
56
15
45
67
20
45
95
Median
Not affected by extreme values
Does not use full data
Good measure of central tendency for
non-symmetrical data distribution
(skewed)
Mode
Mode is the highest occurring observation
mode in the example is 0
The greatest frequency can occur at two or more
different values.
If the data have exactly two modes, the data are
bimodal.
If the data have more than two modes, the data are
multimodal.
Excel Function Method
Mode = Cell E14 Formula:=MODE(A5:A17)=52
Quartiles
Measures of Variability
Range
Interquartile Range
Variance
Standard Deviation
Coefficient of Variation
Range
The range of a data set is the difference between the
largest and smallest data values.
It is the simplest measure of variability.
It is very sensitive to the smallest and largest data
values.
Example from airline delay data
Range = 95 0 = 95 minutes
Interquartile range
The interquartile range of a data set is the
difference between the third quartile and the first
quartile.
It is the range for the middle 50% of the data.
It overcomes the sensitivity to extreme data
values.
Excel Function Method
Q1 = Cell F14 Formula:=QUARTILE.INC (B4:B16,1)
Q3 = Cell F16 Formula:= QUARTILE.INC(B4:B16,3)
QR = Cell F17 Formula:= F16-F14
SIQR = Cell F18 Formula:=(F16-F14)/2
Variance
The variance is a measure of variability
that utilizes all the data.
It is based on the difference between the
value of each observation (xi) and the
mean (x for a sample, for a population).
2
2 ( xi )
N
2
(
x
x
)
i
s2
n 1
Variance
X X
Variance
f
Variance
2
X
( X )2
Variance
For frequency distribution use a slightly
different formula:
Variance
2
f
X
( X )2
Sample Variance
Standard deviation
The standard deviation of a data set is the positive
square root of the variance.
It is measured in the same units as the data, making it
more easily comparable, than the variance, to the mean.
If the data set is a sample, the standard deviation is
denoted s.
If the data set is a population, the standard deviation is
denoted (sigma).
SD Var
Use of EXCEL
Coefficient of Variation
The coefficient of variation indicates how large the
standard deviation is in relation to the mean.
If the data set is a sample, the coefficient of variation
is computed as follows:
s s (100)
(100)
xx
(100)
Skewness
Skewness - is a measure of the degree of
asymmetry of a distribution
Pearsons coefficient of skewness
PCS =
6
N
Kurtosis
Kurtosis is a measure of whether the data are peaked or
flat relative to a normal distribution.
Mesokurtic (bell shaped) (ZERO)
Leptokurtic (peaked) (POSITIVE)
Platykurtic (flat) (NEGATIVE)
Fishers Kurtosis
FS =
Excel Function Method
Fishers kurtosis = Cell E10 Formula:=KURT(B4:B16)= - 0.4253
Cri 2
24
N