0% found this document useful (0 votes)
107 views73 pages

Statistics Introduction: Dr. Sudeep Mallick

Statistics can help analyze and make sense of data to support decision making. Descriptive statistics summarize and describe data through tables, charts, and summary calculations. Inferential statistics are used to predict unknown population parameters, test hypotheses, and generalize samples to populations. The document discusses using statistics to design an attractive cell phone plan for students, including collecting call data, analyzing descriptive statistics, testing hypotheses, and predicting behavior through techniques like regression and ANOVA. It also outlines applying statistics in business functions like marketing, finance, HR, and operations.

Uploaded by

VishalRathore
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
107 views73 pages

Statistics Introduction: Dr. Sudeep Mallick

Statistics can help analyze and make sense of data to support decision making. Descriptive statistics summarize and describe data through tables, charts, and summary calculations. Inferential statistics are used to predict unknown population parameters, test hypotheses, and generalize samples to populations. The document discusses using statistics to design an attractive cell phone plan for students, including collecting call data, analyzing descriptive statistics, testing hypotheses, and predicting behavior through techniques like regression and ANOVA. It also outlines applying statistics in business functions like marketing, finance, HR, and operations.

Uploaded by

VishalRathore
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 73

Statistics Introduction

Dr. Sudeep Mallick

Why statistics?
Decision making is often based on
analysis of data.
Statistics helps you to make sense of the
data by using tools that summarize,
present and analyze the data
Decision maker can also ascertain the
confidence in the decisions.

Types of Decisions
Analyze: effect of the variable
Predict: relationship between two
variables
Predict: likely outcome
Observe: trend
Generalize: about population at large

Cell Phone Scheme


How to make an attractive plan for IMI, Cal students
What data to collect
Number of calls made
Duration of calls made
Time of call
Amount of data usage

How does the data look and feel (descriptive statistics)


Can the IMI, Cal student data be used for predicting behavior of other
colleges? (estimation and hypothesis testing)
What would be the confidence in the prediction (probability)
What is the use of existing database?
Can the previous years data hold for this year too? (hypothesis testing)
Is the call rate for students of IMI Cal and IMI Delhi similar (ANOVA)
Does the call rate distribution follow a normal distribution (chi-square)
Can we use the age of the user group to predict the average call duration
during peak hours of students (regression and correlation)

Use of Statistics in Business Functions


HR

Finance

HR metrics
visualization
HR Policy
effectiveness
analysis
Comparison
of attrition
rates with
industry
averages

Statistical
models of
portfolio
management
(BlackScholes
model uses
normal
distribution)
Financial
modelling
using
probabilities

Marketing
Marketing
research
CRM and
analytics

Operations
Six sigma
SPC
Quality
Management

Examples
How many newspapers should the vendor stock
to maximize revenue?
Depends on the probability distribution of demand and
expected profit

Are two or more market segments significantly


different?
Hypothesis testing

What proportion of people are happy with the


Sixth-pay commission report?
Parameter estimation

Business Research Methods


Statistics course lays the foundation
Business Research Methods of Research
Methodology courses

Subdivision within Statistics

Descriptive Statistics
Collect
Organize
Summarize
Display
Analyze

Inferential Statistics
Predict and forecast
values of population
parameters
Test hypotheses about
values of population
parameters
Make decisions

Descriptive Statistics
Graphical statistics / Visualization
pictures
Picture is worth a thousand words

Summary statistics numbers


Simplify information
Use single number to describe characteristics
of a data set

Visualization

Types of Data Variables

Variable - A variable is any measured characteristic or attribute that differs for


different subjects e.g. height of a building, eye colour.
Qualitative (or categorical) Descriptive variable measuring a particular
characteristic (e.g. eye colour) or the variable can be ranked (e.g. finished
first, fourth etc.)
Quantitative A numerical variable measured on two scales (interval/ratio)
Nominal Assigning items to categories e.g. number of people with blue
eyes. Frequency distributions are usually used to tabulate and analyse
problems involving nominal data.
Ordinal A set of data is said to be ordinal if the values belonging to it can be
ranked
Interval - An interval scale is a scale of measurement where the distance
between any two adjacent units of measurement (or intervals) is the same
but the zero point is arbitrary
Ratio - Ratio data are continuous data where both differences and ratios are
interpretable and have a natural zero

Measurement Scale Examples


Measurement
Scale
Nominal data

Ordinal data

Interval data

Ratio data

Recognising a measure scale


1. Classification data e.g. male or female, red or
black car.
2. Arbitrary labels e.g. m or f, r or b, 0 or 1.
3. No ordering e.g. it makes no sense to state that
r > b.
1. Ordered list e.g. student satisfaction scale of 1,
2, 3, 4, and 5.
2. Differences between values are not important
e.g. political parties can be given labels: far
left, left, mid, right, far right etc. and student
satisfaction scale of 1, 2, 3, 4, and 5.
1. Ordered, constant scale, with no natural zero
e.g. temperature, dates, psychological scales,
etc.
2. One unit on the scale represents same
magnitude across the whole range of the scale
3. Differences make sense, but ratios do not e.g.
temperature difference
1. Ordered, constant scale, and a natural zero e.g.

Math or Statistics allowed for


various scales
Nominal frequency distribution (f.d.) mode
Ordinal f.d., median, mode
Interval f.d., mean, median, mode, SD,
variance
Ratio all as for interval scale in addition to
geometric mean, harmonic mean, coefficient of
variation and many other statistical measures
involving ratios

the number of pound lost during a six-week diet


ratio
the proportion of weight lost during a six-week diet
ratio
the heart rate of the participant
ratio
the percent shift in heart rate over baseline during an emotionally demanding task
ratio
the percent of errors made on a classification task
ratio
the number of false alarm responses in a monitoring task
ratio
the types of gramatical errors made in a writing sample
nominal
one's ice cream preference
nominal
how quickly a person gives up on an impossible task that looks like it should be possibl
ratio
a student's SAT score
Interval
the religious group that one affiliates with
nominal

the percentile rank from an achievement test


ordinal
the type of categorization errors in a sorting task
nominal
the age at which one went on his or her first date
ratio
the number of children in your family
ratio
the score on an anxiety sensitivity scale
interval
whether one has a pet (yes/no)
Nominal
Whether one has a pet (0 for none, 1 for non-zero)
Ordinal
Number of pets
Ratio
the rank of a person's salary within the company
Ordinal
the square footage of each participant's house or apartment
Ratio
the number of frustrated comments made during a project
assignment
ratio

Frequency Distribution

Example - Frequency Distribution


The following are the departure delay in minutes of 52 flights
selected at random from a particular airport.
10

12

45

13

40

13

20

45

95

38

67

47

55

56

45

50

27

50

15

26

34

12

25

48

40

25

50

42

48

53

44

23

56

46

22

Grouped Frequency Distribution


When there is a wider variety of data
points
Usually create 5 12 classes in the
grouped frequency distribution
Class width = LCB(k) LCB(k+1) =
UCB(k) UCB(k+1)

(Largest value 1) - smallest value


Class width
Number of Classes

Frequency distribution
Delay in
minutes

Frequency

Relative
frequency

015

12

0.286

15 - 30

0.190

30 45

0.143

45 60

14

0.333

0.048

42

60 or more
Total

Graphical Representation of
Data
The next stage of analysis
after the data has been
tabulated is to graph the
data using a variety of
methods to provide a
suitable graph. In this
section we will explore:
1.
2.
3.
4.
5.
6.

Bar charts
Pie charts
Histograms
Frequency polygons
Scatter plots
Time series plots

The type of graph you will use to graph the


data depends upon the type of variable you
are dealing with within your data set e.g.
category (or nominal), ordinal, or interval (or
ratio) data as follows:
Data type
Which graph to use?
Category Bar chart, pie chart, cross tab
or
tables (or contingency tables)
nominal
Ordinal
Bar chart, pie chart, scatter
plots.
Interval or Histogram, frequency polygon,
ratio
histogram.
Cumulative frequency curve (or
ogive), scatter plots, time series
plots.

Histogram
A graph of the data in a frequency distribution is called a
histogram. The area of each bar is a measure of the
frequency of occurrence (number of values) within each
category. If the bar widths are the same (constant) then
the height of the bar is directly related to the frequency
and this information can then be used to construct the
histogram.

Frequency distribution- histogram


Frequency Histogram
16
14
12
10
Frequency - absolute numbers

8
6
4
2
0

0-15

15-30

30-45
Delay in Minutes

45-60

60 or more

Relative frequency Histogram


Relative frequency histogram
0.35
0.3
0.25
0.2
Relative frequency - fraction/percent
0.15
0.1
0.05
0

0-15

15-30

30-45
Delay in Minutes

45-60

60 or more

Bar Chart
Party

Frequency

Proposed voting behaviour

Frequency

600

Conservative

400

500

Labour

510

300

Democrat

78

Green

55

Other

67

400
200

Frequency

100
0

Party

Horizontal Bar Chart


Month
January
February
March
April
May
June

Pink
5200
4100
6000
6900
6050
7000

Blue
2100
1050
2950
5000
6300
5200

M
o
n
t
h

Half yearly car sales


June
May
April

Blue
Pink

March
February
January
0

2000

4000

6000

8000 10000 12000 14000

Number of cars

Pie Chart

Frequency Polygon
A frequency polygon is formed from a histogram by
joining the mid-points of the tops of the rectangles by
straight lines. The mid-points of the first and last class
are joined to the x-axis to either side at a distance equal
to (1/2)th the class interval of the first and last class.

Note on Class Boundary Styles


Class (inclusive) Frequency

Class (inclusive) Frequency

0 to 10

0 - 10

11 to 20

11 - 20

If the next data item is 10 it goes to the first class, if it is 11 it goes


to the next class
The above structure is EXACTLY SAME as the one below
Class (UCB
Frequency
Class (UCB
excluded)
excluded)

Frequency

0 to less than 11

0 - 11

11 to less than 21

11 - 21

Both the class structures are equivalent, none is better than the other.
It is just a matter and style and taste which one to adopt.
Now less than 11 implies either 10, or 10.50 or 10.9 or 10.99 or 10.999
depending upon the nature of data

Note on Class Boundary Styles


Class (UCB
excluded)

Frequency

0 to less than 11

Class (UCB
excluded)

Frequency

11 to less than 21

0 - 11

Class (UCB
included)

Frequency

11 - 21

0 up to 11

11 up to 21

Problem with this structure is that it is not


immediately clear if the overlapping boundary is
included in the upper or the lower class. A
convention has to be followed.
Often the convention is that the UCB is not
included in the class. That is it means 0 to less
than 11, 11 to less than 21, etc.
This provides an advantage for cases of decimal
data.
Example for a data point such as 10.97 we know
that it lies in the class (0 - 11)

Note on Class Boundary Styles


Class (inclusive) Frequency
0 - 10

11 - 20

2
Problem with this structure is that in case of
decimal data we would need to modify
boundary so that there are no gaps.
Example for a data point such as 10.33 we
would need to modify boundary such that it
has precision of 2 decimal places

Class (inclusive) Frequency


0 10.50

10.51 - 20

Ogive
Cumulative frequency distribution
Less than
More than

Cumulative Frequency
Ungrouped Data
X

More than
(X)

c.f.

Less
than (X)

c.f

14

14-6=8

0+6=6

8-0=8

6+0=6

8-1=7

6+1=7

7-4=3

7+4=11

Total = 14 7

3-3=0

11+3=14

Cumulative Frequency
Grouped Data
X

More than
(X)

c.f.

Less
than (X)

c.f

1-10

14

11-20

10

14-6=8

11

0+6=6

21-30

20

8-0=8

21

6+0=6

31-40

30

8-1=7

31

6+1=7

41-50

40

7-4=3

41

7+4=11

Total = 14 50

3-3=0

51

11+3=14

(Extra class needed here)

Ogive Example

Cumulative Frequency
Distribution
Helps answer less than, more than type questions
with ease
Helps create cumulative probability distribution which
answers cut-off probability questions

Exercise
Analysing class marks
Working with EXCEL/SPSS
Choosing appropriate class boundaries
Experimenting with class boundaries

Cross tabulation

A joint frequency distribution of two variables (e.g. nature of airline, delay in


minutes)

Scatter Plot
Shows relationship between two variables

More
Pivot Tables of EXCEL
Visualization software such as Tableau

Visualization

Descriptive statistics Summary Statistics

Summary Statistics
Measure of central tendency
Measure of dispersion
Measure of shape

Summary Statistics

Measures of Central Tendency

Arithmetic Mean
Median
Mode
Percentiles
Quartiles

Arithmetic mean
The mean of a data set is the average
of all the data values.
xi
x
n
xi

Sample mean

Population mean

Mean example
Average delay in flight departure

Pros:

1354/42 = 32.2381 minutes

Makes use of full data

Cons:
Affected by extreme values
Good for only symmetrical distribution
Excel Function Method
Mean = Cell E12 Formula:=AVERAGE(B4:B16)=56.4615

Mean
General formula

f X
X
f

For grouped data, X is the class mid-point


Class mid-point = LCB + (class-width/2)

Weighted Average
Example - Calculation of CGPA

Median
It is the middle item in a data set that is
arranged in ascending/descending order
If there are n observations then the
Median = (n+1)/2 th observation.
computation rule
if n is odd then (n+1)/2 is an integer

if n is even then use average of n/2 and n/2 +1 th


observation
Excel Function Method
Mean = Cell E13 Formula:=MEDIAN(B4:B16)=53

Example
Sorted 42
observations
median is average of
21st and 22nd
observation
= (34+38)/2
= 36

22

45

23

46

25

47

25

48

26

48

27

50

34

50

10

38

50

12

40

53

12

40

55

13

42

56

13

44

56

15

45

67

20

45

95

Median for Grouped Data


Compute Cumulative frequency
Find median class holding the median element using
(n+1)/2 formula
Use formula:
(Levin)
(Davis Pecar)
L = LCB of median class
C = median class width
F = cumulative frequency before median class
f = frequency within median class

Median
Not affected by extreme values
Does not use full data
Good measure of central tendency for
non-symmetrical data distribution
(skewed)

Mode
Mode is the highest occurring observation
mode in the example is 0
The greatest frequency can occur at two or more
different values.
If the data have exactly two modes, the data are
bimodal.
If the data have more than two modes, the data are
multimodal.
Excel Function Method
Mode = Cell E14 Formula:=MODE(A5:A17)=52

Mode for Grouped Data

L = LCB of the modal class


f0 = frequency of the class below the modal class
f1 = frequency of the modal class
f2 = frequency of the class above the modal class
C = modal class width

Percentiles and Quartiles

Given any set of ordered numerical


observations

nth percentile means n percent of data are equal


or below that value.
Quartiles divide the data into 4 parts (so there are
3 quartiles)

Position of percentile = (n+1)P/100


EXCEL may give you slightly different
values than manual calculation

Quartiles

Quartiles are special names to percentiles


Q1 = 25th percentile
Q2 = 50th percentile = median
Q3 = 75th percentile

Percentile and Quartile


Grouped Data
Percentile P value

L = LCB of percentile class


C = percentile class width
F = cumulative frequency before percentile class
f = frequency within percentile class
P = nth position in 100

Percentile and Quartile


Excel Function Method
25th Percentile = Cell E15 Formula:=PERCENTILE.INC(B4:B16,0.25)=48
First Quartile = Cell E16 Formula:=QUARTILE.INC(B4:B16,1)=48
Second Quartile = Cell E17 Formula:=QUARTILE.INC(B4:B16,2)=53
Third Quartile = Cell E18 Formula:=QUARTILE.INC(B4:B16,3)=60

Measures of Variability

Range
Interquartile Range
Variance
Standard Deviation
Coefficient of Variation

Range
The range of a data set is the difference between the
largest and smallest data values.
It is the simplest measure of variability.
It is very sensitive to the smallest and largest data
values.
Example from airline delay data
Range = 95 0 = 95 minutes

Excel Function Method


Range = Cell F13 Formula:=MAX(B4:B16)-MIN(B4:B16)=71

Interquartile range
The interquartile range of a data set is the
difference between the third quartile and the first
quartile.
It is the range for the middle 50% of the data.
It overcomes the sensitivity to extreme data
values.
Excel Function Method
Q1 = Cell F14 Formula:=QUARTILE.INC (B4:B16,1)
Q3 = Cell F16 Formula:= QUARTILE.INC(B4:B16,3)
QR = Cell F17 Formula:= F16-F14
SIQR = Cell F18 Formula:=(F16-F14)/2

Variance
The variance is a measure of variability
that utilizes all the data.
It is based on the difference between the
value of each observation (xi) and the
mean (x for a sample, for a population).
2
2 ( xi )

N

< - Population variance


Sample variance - >

2
(
x

x
)

i
s2
n 1

Variance

X X

Variance
f

Variance

Excel Function Method


varp = Cell F20 Formula:=VAR.P(B4:B16)
sdp= Cell F21 Formula:=STDEV.P(B4:B16)

2
X

( X )2

Variance
For frequency distribution use a slightly
different formula:
Variance

2
f
X

( X )2

For grouped data use the class midpoint


as the value of X

Sample Variance

Standard deviation
The standard deviation of a data set is the positive
square root of the variance.
It is measured in the same units as the data, making it
more easily comparable, than the variance, to the mean.
If the data set is a sample, the standard deviation is
denoted s.
If the data set is a population, the standard deviation is
denoted (sigma).

SD Var

Use of EXCEL

Coefficient of Variation
The coefficient of variation indicates how large the
standard deviation is in relation to the mean.
If the data set is a sample, the coefficient of variation
is computed as follows:

s s (100)
(100)
xx

If the data set is a population, the coefficient of


variation is computed as follows:

(100)

Measure of Shape - Skewness

Skewness
Skewness - is a measure of the degree of
asymmetry of a distribution
Pearsons coefficient of skewness
PCS =

Excel uses Fishers measure of skewness


FS =
Critical 2

6
N

Excel Function Method


Fishers skew = Cell E7 Formula:=SKEW(B4:B16) = 0.4410

Measure of Shape - Kurtosis

Kurtosis
Kurtosis is a measure of whether the data are peaked or
flat relative to a normal distribution.
Mesokurtic (bell shaped) (ZERO)
Leptokurtic (peaked) (POSITIVE)
Platykurtic (flat) (NEGATIVE)

Fishers Kurtosis
FS =
Excel Function Method
Fishers kurtosis = Cell E10 Formula:=KURT(B4:B16)= - 0.4253

Cri 2

24
N

You might also like