ECON 230 - Statistics and Data Analysis - Lecture 1
ECON 230 - Statistics and Data Analysis - Lecture 1
Data Analysis
Lecture 1
Course Outline
Instructor: Amin Hussain
Contact:
Email: [email protected]
Room 254
Course Information:
Introduction to elementary probability and
statistics
Application and use of statistics in Economics
Evaluation and Interpretation of statistics
Preparation for Econometrics
Course Outline - II
Books
Ross, Sheldon . A First Course in Probability
Moore, David S. and McCabe, George P.,
Introduction to the Practice of Statistics
Newbold, P., Carlson, W. and Thorne, B.
Statistics for Business and Economics
Grading
Quizzes - 20
Assignments - 20
Mid - 25
Final - 35
Topics
Introduction
Probability
Random Variables
Sampling
Estimation
Hypothesis Testing
Regression
Applications
Big Data
Quality Control
Operations Research
Medicine
Law
Etc.
Chap 1-7
Key Definitions
A population is the collection of all items of
interest or under investigation
N represents the population size
A sample is an observed subset of the
population
n represents the sample size
A parameter is a specific characteristic of a
population
forABusiness
statistic is a specific characteristic of a
Statistics
and Economics, 6e
Chap 1-8
sample
2007 Pearson
Education,
Sample
cd
ef gh i jk l m n
gi
o p q rs t u v w
x y
n
r
Examples of Populations
Names of all registered voters in
Pakistan
Incomes of all families living in
Islamabad
Annual returns of all stocks traded on
the New York Stock Exchange
Grade point averages of all the students
Statisticsin
for Business
LUMS
and Economics, 6e
2007 Pearson Education,
Chap 1-10
Random Sampling
Simple random sampling is a procedure in
which
each member of the population is chosen strictly by
chance,
each member of the population is equally likely to be
chosen,
and
every possible sample of n objects is equally likely to be
chosen
Chap 1-11
Inferential statistics
provide the bases for predictions,
forecasts, and estimates that are used to
transform information into knowledge
Statistics for Business
and Economics, 6e
2007 Pearson Education,
Chap 1-12
Descriptive Statistics
Collect data
e.g., Survey
Present data
e.g., Tables and graphs
Summarize data
e.g., Sample mean
=X i
n
Chap 1-13
Inferential Statistics
Estimation
e.g., Estimate the
population mean weight
using the sample mean
weight
Hypothesis testing
e.g., Test the claim that the
population mean weight is
120 pounds
Inference is the process of drawing conclusions or making
decisions about a population based on sample results
Statistics for Business
and Economics, 6e
2007 Pearson Education,
Chap 1-14
Types of Data
Data
Categorical
Numerical
Examples:
Marital Status
Discrete
Continuous
Are you registered to
vote?
Examples:
Examples:
Eye Color
(Defined categories or
Number of Children
Weight
groups)
Measurement Levels
Differences
between
measurements,
true zero exists
Ratio Data
Quantitative
Data
Differences
between
measurements but
no true zero
Interval Data
Ordered Categories
(rankings, order, or
scaling)
Ordinal Data
Qualitative Data
Categories (no
ordering or
direction)
Nominal Data
Graphical
Presentation of Data
Data in raw form are usually not
easy to use for decision making
Some type of organization is
needed
Table
Graph
Graphical
Presentation of Data
Techniques reviewed in this
chapter:
Categorical
Variables
Frequency distribution
Bar chart
Pie chart
Pareto diagram
(continue
d)
Numerical
Variables
Line chart
Frequency distribution
Histogram and ogive
Stem-and-leaf display
Scatter plot
Tabulating Data
Frequency
Distribution
Table
Graphing Data
Bar
Chart
Pie Chart
Pareto
Diagram
The Frequency
Distribution Table
Summarize data by
category
Example: Hospital Patients
by
Unit
Hospital
Unit
Number of Patients
Cardiac Care
Emergency
Intensive Care
Maternity
Surgery
(Variables
are
categorical)
1,052
2,245
340
552
4,630
of
Cardiac Care
1,052
Emergency 2,245
Intensive Care
340
Maternity
552
Surgery
4,630
Total
Number
of
Cardiac Care
1,052
11.93
Emergency 2,245
Intensive Care
340
3.86
Maternity
552
Surgery
4,630
25.46
6.26
52.50
(Percentage
s are
rounded to
the nearest
percent)
Pareto Diagram
Used to portray categorical data
A bar chart, where categories are
shown in descending order of
frequency
A cumulative polygon is often shown
in the same graph
Used to separate the vital few from
the trivial many
Number of defects
Bad Weld
34
Poor Alignment
223
Missing Part
25
Paint Flaw
78
Electrical Short
19
Cracked case
21
Total
400
Manufacturing Error
Number of defects
% of Total Defects
Poor Alignment
223
55.75
Paint Flaw
78
19.50
Bad Weld
34
8.50
Missing Part
25
6.25
Cracked case
21
5.25
Electrical Short
19
4.75
Total
400
100%
cumulative % (line
graph)
% of defects in each
category (bar graph)
(continue
d)
Graphs to Describe
Numerical Variables
Numerical
Data
Frequency Distributions
and
Cumulative Distributions
Histogram
Ogive
Stem-and-Leaf
Display
Frequency Distributions
What is a Frequency Distribution?
A frequency distribution is a list or a
table
containing class groupings
(categories or ranges within which
the data fall) ...
and the corresponding frequencies
with which data fall within each
Class Intervals
and Class Boundaries
Each class grouping has the same
width
largest
number
smallest
number
wDetermine
the
width
of
each
interval
interval width
number of desired intervals
by
Frequency Distribution
Example
Example: A manufacturer of insulation
randomly selects 20 winter days and
records the daily high temperature
24, 35, 17, 21, 24, 37, 26,
46, 58, 30,
32, 13, 12, 38, 41, 43, 44,
27, 53, 27
(continued
)
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46,
53, 58
Find range: 58 - 12 = 46
Select number of classes:
15)
Frequency Distribution
Example
Data in ordered array:
(continued
)
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41,
43, 44, 46, 53, 58
Interval
Relative
Frequency
Frequenc
y
100
Total
.15
6
5
.30
.25
.20
20
Percentag
e15
30
25
20
.10
1.00
10
Histogram
A graph of the data in a frequency
distribution is called a histogram
The interval endpoints are shown on
the horizontal axis
the vertical axis is either frequency,
relative frequency, or percentage
Bars of the appropriate heights are used
to represent the number of observations
within each class
Histogram Example
Interval
10
3
20
6
30
5
40
4
50
2
Frequency
(No gaps
between
bars)
10
20
30
40
70 Temperature in
Degrees
50
60
The Cumulative
Frequency Distribuiton
Data in ordered array:
12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41,
43, 44, 46, 53, 58
Cumulativ Cumulati
Frequenc Percenta
e
ve
y
ge
Frequency Percenta
but less than 20
3
15
3ge
15
but less than 30
6
30
9
45
but less than 40
5
25
14
70
but less than 50
4
20
18
90
Class
10
20
30
40
The Ogive
Upper
interval Cumulative
endpoi Percentage
nt
10
20
20
30
30
40
40
50
50
60
60
Interval endpoints
Distribution Shape
The shape of the distribution is said
to be symmetric if the
observations are balanced, or
evenly distributed, about the
center.
Frequency
Symmetric Distribution
10
9
8
7
6
5
4
3
2
1
0
Distribution Shape
(continued
)
be
8
6
4
2
0
1
8
6
4
2
0
1
Relationships Between
Variables
Graphs illustrated so far have involved only
a single variable
When two variables exist other techniques
are used:
Categorical
(Qualitative)
Variables
Numerical
(Quantitative)
Variables
Cross tables
Scatter plots
Scatter Diagrams
Scatter Diagrams are used for
paired observations taken from
two numerical variables
The Scatter Diagram:
one variable is measured on the
vertical axis and the other variable
is measured on the horizontal axis
Cost per
day
23
125
26
140
29
146
33
160
38
167
42
170
50
188
55
195
60
200
Graphing
Multivariate Categorical Data
(continued
)
10
Inves tor A
20
30
Inves tor B
40
50
Inves tor C
60
Statistics for
Business and Economics
6th Edition
Chapter 3
Describing Data: Numerical
Statistics for Business and
Economics, 6e 2007 Pearson
Education, Inc.
Chap 3-52
Chapter Topics
Measures of central tendency,
variation, and shape
Mean, median, mode, geometric mean
Quartiles
Range, interquartile range, variance and
standard deviation, coefficient of
variation
Symmetric and skewed distributions
Chap 3-53
Chapter Topics
(continued
)
Chap 3-54
Variation
Range
Interquartile
Range
Variance
Mode
Standard
Deviation
Coefficient of
Variation
Chap 3-55
Mean
Median
Mode
x
i1
Arithmetic
average
Statistics for Business
and Economics, 6e
2007 Pearson Education,
Midpoint of
ranked
values
Chap 3-56
Most
frequently
observed
value
Arithmetic Mean
The arithmetic mean (mean) is the
most common measure of central
tendency
N
For a population
of N values:
xi
x1 x 2 x N
N
N
i1
Population
values
Population
size
xi
For a sample
of
x1 size
x 2 n:
xn
i1
Statistics for Business
and Economics, 6e
2007 Pearson Education,
n
Chap 3-57
Observed
values
Sample size
Arithmetic Mean
(continue
d)
0 1 2 3 4 5 6 7 8 9
10
Mean = 3
Mean = 4
1 2 3 4 5 15
3
5
5
Statistics for Business
and Economics, 6e
2007 Pearson Education,
1 2 3 4 10 20
4
5
5
Chap 3-58
Median
In an ordered list, the median is the
middle number (50% above, 50%
below)
0 1 2 3 4 5 6 7 8 9
10
0 1 2 3 4 5 6 7 8 9
10
Median = 3
Median = 3
Chap 3-59
Note
n 1
that2
Chap 3-60
Mode
Mode = 9
Chap 3-61
0 1 2 3 4 5 6
No Mode
Review Example
Five houses on a hill by the beach
$2,000 K
House Prices:
$2,000,000
500,000
300,000
100,000
100,000
$500 K
$300 K
$100 K
$100 K
Statistics for Business
and Economics, 6e
2007 Pearson Education,
Chap 3-62
Review Example:
Summary Statistics
House Prices:
$2,000,000
500,000
300,000
100,000
100,000
Sum
Statistics for Business
3,000,000
and Economics, 6e
2007 Pearson Education,
Mean: ($3,000,000/5)
= $600,000
Median: middle value of ranked
data
= $300,000
Mode: most frequent value
= $100,000
Chap 3-63
Chap 3-64
Shape of a Distribution
Describes how data are distributed
Measures of shape
Symmetric or skewed
Left-Skewed
Symmetric
Right-Skewed
Mean = Median
Chap 3-65
Measures of Variability
Variation
Range
Interquartil
e
Range
Variance
Standard
Deviation
Chap 3-66
Same center,
different
variation
Coefficient
of Variation
Range
Simplest measure of variation
Difference between the largest and
the smallest observations:
Range = Xlargest Xsmallest
Example:
0 1 2 3 4 5 6 7 8 9 10 11 12
Range = 14 - 1 = 13
Statistics for Business
and Economics, 6e
2007 Pearson Education,
Chap 3-67
13 14
Disadvantages of the
Range
Ignores the way in which data are
distributed
7
8
9
10 11
12
Range = 12 - 7 = 5
7
8
9 10
11
12Range = 12 - 7 = 5
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
Sensitive to outliers
Range = 5 - 1 = 4
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Statistics for Business
and Economics, 6e
2007 Pearson Education,
Range = 120 - 1 =
119
Chap 3-68
Interquartile Range
Can eliminate some outlier problems by
using the interquartile range
Eliminate high- and low-valued
observations and calculate the range of
the middle 50% of the data
Interquartile range = 3rd quartile 1st
quartile
IQR = Q3 Q1
Chap 3-69
Interquartile Range
Example:
X
minimum
Q1
25%
12
Median
(Q2)
25%
30
Q3
25%
45
Interquartile range
= 57 30 = 27
Chap 3-70
maximum
25%
57
70
Quartiles
Quartiles split the ranked data into 4
segments with an equal number of values
per segment
25%
Q1
25%
25%
Q2
25%
Q3
Chap 3-71
Quartile Formulas
Find a quartile by determining the value in
the appropriate position in the ranked data,
where
First quartile position:
Q1 = 0.25(n+1)
Q3 = 0.75(n+1)
Chap 3-72
Quartiles
so use the value half way between the 2nd and 3rd values,
so
Q1 = 12.5
Chap 3-73
Population Variance
Average of squared deviations of
values from the mean
N
Population variance:
Where
(x )
i
i1
N -1
= population mean
N = population size
Sample Variance
Average (approximately) of squared
deviations of values from the mean
n
Sample variance:
s
2
Where
(x x)
i1
n -1
X = arithmetic mean
n = sample size
Population Standard
Deviation
Most commonly used measure of
variation
Shows variation about the mean
Has the same units as the original
data
N
2
(x
i
Population standard
deviation:
i1
N -1
Statistics for Business
and Economics, 6e
2007 Pearson Education,
Chap 3-76
Chap 3-77
i 1
n -1
Calculation Example:
Sample Standard
Deviation
Sample
Data (xi) :
10
12
14
n=8
15
17
18
18
24
Mean = x = 16
126
4.2426
Measuring variation
Small standard deviation
Large standard deviation
Chap 3-79
Comparing Standard
Deviations
Data A
11 12
20 21
13
14
15
16
17
18
Mean = 15.5
s = 3.338
19
Data B
Mean = 15.5
11 12
20 21
13
14
15
16
17
18
s=
19
0.926
Data C
Mean = 15.5
11 12
21
13
14
15
16
17
Chap 3-80
18
19
20
s=
4.570
Chap 3-81
1
Chap 3-82
99.7%
Chap 3-83
fimi
where N fi
i1
i 1
fm
i1
Chap 3-84
where n fi
i1
i
i
2 i1
N
For a sample of nK observations, the variance is
Statistics for Business
and Economics, 6e
2007 Pearson Education,
s2
2
f
(m
x
)
i i
i1
n 1
Chap 3-85
Cov (x , y) xy
(x
i
i1
)(y i y )
Cov (x , y) s xy
(x x)(y y)
i1
n 1
Chap 3-86
Interpreting Covariance
Covariance between two variables:
Cov(x,y) > 0
direction
Cov(x,y) < 0
directions
Cov(x,y) = 0
Chap 3-87
Coefficient of Correlation
Measures the relative strength of the linear
relationship between two variables
Population correlation coefficient:
Cov (x , y)
XY
Sample correlation coefficient:
Cov (x , y)
r
sX sY
Statistics for Business
and Economics, 6e
2007 Pearson Education,
Chap 3-88
Features of
Correlation Coefficient, r
Unit free
Ranges between 1 and 1
The closer to 1, the stronger the negative linear
relationship
The closer to 1, the stronger the positive linear
relationship
The closer to 0, the weaker any positive linear
relationship
Statistics for Business
and Economics, 6e
2007 Pearson Education,
Chap 3-89
r = -1
r = -.6
r=0
Y
X
r = +.3
Chap 3-90
X
r=0