0% found this document useful (0 votes)
26 views

1 Descriptive

Uploaded by

mokmehool
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

1 Descriptive

Uploaded by

mokmehool
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 42

Unit 1

Descriptive Statistics

Dr. Avanti Sethi


University of Texas at Dallas

4.1
Overview
These are the final scores of the students in the Statistics class.

90 100 90 99 86 100 84 94 80 92 97 82 88
84 94 100 86 81 86 93 97 95 84 86 81 88
90 86 96 88 84 86 86 82 96 98 81 89 98
97 92 91 89 83 80 87 94 86 89 94 86 82
80 86 80 93 93 97 93 82 94 88 93 85 95
44 52 65 65 69 46 33 96 79 95 38 52 30 How do you make sense
98 82 81 82 90 80 91 93 96 85 100 87 99 of these numbers?
51 78 82 94 78 96 87 84 80 88 83 69 62
82 92 83 85 83 97 89 90 94 84 90 84 85
79 70 70 69 62 76 81 90 81 66 86 69 76 What do you tell the others
94 83 100 94 93 100 89 92 83 92 83 89 83
69 79 91 95 65 89 62 42 87 50 72 53 69 about how your class did?
81 85 82 94 93 94 91 86 95 89 93 92 100
98 86 89 86 84 96 85 99 81 89 93 98 91
85 97 93 96 81 85 94 92 98 95 82 99 92
70 72 93 71 73 52 74 65 62 100 100 71 62 How do you compare this
71
96
74
81
62
80
67
95
61
97
96
100
87
87
98
89
98
92
80
88
95
95
76
91
60
93
class with the other class?
94 82 88 90 94 91 83 92 85 83 83 87 89
95 83 72 91 94 89 100 82 72 89 95 90 72
94 92 87 90 95 92 83 88 89 82 89 98 99
99 77 98 80 74 93 87 89 65 76 64 76 80
91 88 81 92 81 95 80 86 100 84 81 97 94
99 97 93 93 98 92 88 80 83 89 84 94 89

4.2
Numerical Descriptive Techniques…

Measures of Central Location


Mean, Median, Mode

Measures of Variability
Range, Standard Deviation, Variance, Coefficient of Variation

Measures of Relative Standing


Percentiles, Quartiles

Measures of Linear Relationship


Covariance, Correlation, Determination, Least Squares Line

4.3
Measures of Central Location…
The median is calculated by placing all the observations in order;
the observation that falls in the middle is the median.

Data: {0, 7, 12, 5, 14, 8, 0, 9, 22} N=9 (odd)


Sort them bottom to top, find the middle:
0 0 5 7 8 9 12 14 22

Data: {0, 7, 12, 5, 14, 8, 0, 9, 22, 33} N=10 (even)

Sort them bottom to top, the middle is the


simple average between 8 & 9:
0 0 5 7 8 9 12 14 22 33
median = (8+9)÷2 = 8.5

Sample and population medians are computed the same way.


4.5
Mean, Median, Mode: Which Is Best?

With three measures from which to choose, which one should we use?

The mean is generally our first selection. However, there are several
circumstances (like presence of extreme values) when the median is
better.

One advantage the median holds is that it not as sensitive to extreme


values as is the mean.

The mode of a set of observations is the value that occurs most


frequently. It is seldom the best measure of central location.

4.6
Mean, Median, Mode…
If a distribution is asymmetrical, say skewed to the left or to the right,
the three measures may differ.

median
mode

mean
Mean or Median?
The median home value in Plano is $334,300. Plano home
values have gone up 8.4% over the past year and Zillow
predicts they will rise 6.2% within the next year.

The median list price per square foot in Plano is $150, which
is higher than the Dallas-Fort Worth-Arlington Metro
average of $138. Why are we comparing median to average?
Plano: Income by Population Percent

https://www.bestplaces.net/economy/city/texas/plano
Geometric Mean
The arithmetic mean is the single most popular and useful measure of central
location.

However, there is another circumstance where neither the mean nor the
median is the best measure.

When the variable is a growth rate or rate of change, such as the value of an
investment over periods of time or growth of bacteria, we need another
measure – the geometric mean.

4.10
Geometric Mean
Let Ri denote the rate of return (in decimal form) in period i (i = 1, 2, …, n). The
geometric mean Rg of the returns is defined such that
(1  R g ) n  (1  R 1 )(1  R 2 )...(1  R n )

Solving for Rg we produce the following formula.

R g  n (1  R 1 )(1  R 2 )...(1  R n )  1

Question: If your investment gained 9.4% the 1st year, but you lost 4.1% the 2nd year, what
was your average return over the two years?
Ans: 2.4278%

4.11
Measures of Variability…
Measures of central location fail to tell the whole story about the
distribution; that is, how much are the observations spread out around
the mean value?

For example, two sets of class grades are


shown. The mean (=50) is the same in
each case…

But, the red class has greater variability


than the blue class.

4.12
Range…

The range is the simplest measure of variability, calculated as:

Range = Largest observation – Smallest observation

E.g.
Data: {4, 4, 4, 4, 50} Range = 46
Data: {4, 8, 15, 24, 39, 50} Range = 46
The range is the same in both cases,
but the data sets have very different distributions…

4.13
Variance…

Variance and its related measure, standard deviation, are arguably the
most important statistics. Used to measure variability, they also play a
vital role in almost all statistical inference procedures.

Population variance is denoted by σ2

Sample variance is denoted by S2

4.14
Variance…

The variance of a population is:

The variance of a sample is:

The standard deviation is simply the square


root of the variance

4.15
Interpreting Standard Deviation…

The standard deviation can be used to compare the


variability of several distributions and make a statement
about the general shape of a distribution. If the histogram
is bell shaped, we can use the Empirical Rule, which
states:

1) Approximately 68% of all observations fall within one


standard deviation of the mean.
2) Approximately 95% of all observations fall within two
standard deviations of the mean.
3) Approximately 99.7% of all observations fall within three
standard deviations of the mean.
4.16
The Empirical Rule…

Approximately 68% of all observations fall


within one standard deviation of the mean.

Approximately 95% of all observations fall


within two standard deviations of the mean.

Approximately 99.7% of all observations fall


within three standard deviations of the mean.
4.17
Chebysheff’s Theorem…
A more general interpretation of the standard deviation is derived from
Chebysheff’s Theorem, which applies to all shapes of histograms (not
just bell shaped).

The proportion of observations in any sample that lie


within k standard deviations of the mean is at least:

For k=2 (say), the theorem states


that at least 3/4 of all observations
lie within 2 standard deviations of
the mean. This is a “lower bound”
compared to Empirical Rule’s
approximation (95%).

4.18
Chebysheff’s theorem - example
There are 5,000 numbers whose standard deviation is 18. At least how many of
the numbers will be between 55 and 105?

Mean = (55 + 105)/2 = 80


K = 25/18 = 1.3889
We are 1.3889σ’s
around the mean
25 25

55 80 105

1 1
1− 2
=1 − 2
=.4816= 48.16 %
𝐾 1.3889

Numbers in 55 – 105 range = .4816 x 5000 = 2408 (at least these many)
4.19
Coefficient of Variation…

The coefficient of variation of a set of observations is the standard deviation of


the observations divided by their mean,
that is:
Population coefficient of variation = CV =
Sample coefficient of variation = cv =

• Remarks
• Allows comparison of populations with significantly different mean values
• Distributions with Cv < 1 could be as considered low-variance
• Distributions with Cv > 1 could be considered as high-variance

4.20
Analysis of Shape

• Skewness

• Kurtosis

21
Skewness
Skewness determines the asymmetry of the distribution of a random variable. It’s not
always easy to guess skewness by simply looking at the graph as the length of the tail
and it’s fat-ness can influence the skewness.

• Skewness = 0: Symmetry around the mean


• Skewness < 0: Skewed to the left (Negatively skewed)
• Skewness > 0: Skewed to the right (Positively skewed)

In Excel, use SKEW function.

4.22
Skewness

Plano: Skewness = Positive? Negative?

4.23
Kurtosis

Kurtosis measures the “weights” of tails relative to the center. Thus, a


distribution with bigger tails tends to have a higher Kurtosis then the
distribution with shorter tails.

Kurtosis is measured with reference to Normal distribution. A distribution which


has significant data outside the 3 sigma limits will have a higher measurement.
T-distribution with smaller degrees of freedom falls in this category. On the
contrary, Uniform distribution which has no tails will have a low Kurtosis.

4.24
Figure shows a dataset with more
weight in the tails. The kurtosis of
this dataset is 1.86.
Positive (High) Kurtosis Example Negative (Low) Kurtosis Example

4.25
Measures of Relative Standing & Box Plots

Measures of relative standing are designed to provide information about


the position of particular values relative to the entire data set.

Percentile: the Pth percentile is the value for which P percent are less than
that value and (100-P)% are greater than that value.

Suppose you scored in the 60th percentile on the GMAT, that means 60% of
the other scores were below yours, while 40% of scores were above yours.

4.26
Quartiles…

We have special names for the 25th, 50th, and 75th percentiles, namely
quartiles.

The first or lower quartile is labeled Q1 = 25th percentile.

The second quartile, Q2 = 50th percentile (which is also the median).

The third or upper quartile, Q3 = 75th percentile.

4.27
UT Dallas Admission stats for 2016

•University of Texas-Dallas Acceptance Rate: 68 percent

•Test Scores: 25th / 75th Percentile


• SAT Critical Reading: 550 / 670
• SAT Math: 590 / 710
• SAT Writing: 520 / 650
• ACT Composite: 25 / 31
• ACT English: 24 / 32
• ACT Math: 26 / 32
• ACT Writing: - / -

4.28
5-number Summary

• It includes
• The smallest number
• The 1st quartile
• The 2nd quartile
• The 3rd quartile
• The largest number

Thus, S, Q1, Q2, Q3, L will be a 5-number summary. This is used in creating
box plots.

{45, 60, 73, 90, 123}


{ S, Q1, Q2, Q3, L}

4.29
Interquartile Range…

The quartiles can be used to create another measure of variability, the


interquartile range, which is defined as follows:
IQR ::: Interquartile Range = Q3 – Q1
The interquartile range measures the spread of the middle 50% of the
observations.
Large values of this statistic mean that the 1st and 3rd quartiles are far apart
indicating a high level of variability.

• Impact of outlier data:


• IQR is not affected
• Range is directly affected.

4.30
Box Plots…
The box plot is a technique that graphs five statistics (5-number summary):
• the minimum and maximum observations, and

Whisker

• the first, second, and third quartiles. Whisker (1.5*(Q3–Q1))

The lines extending to the left and right are called whiskers. Any points that lie outside
the whiskers are called outliers. The whiskers extend outward to the smaller of 1.5 times
the interquartile range or to the most extreme point that is not an outlier.
4.31
Example 4.15

A large number of fast-food restaurants with drive-through windows offering


drivers and their passengers the advantages of quick service. To measure how
good the service is, an organization called QSR planned a study wherein the
amount of time taken by a sample of drive-through customers at each of five
restaurants was recorded. Compare the five sets of data using a box plot and
interpret the results.
Dataset Xm04-15

4.32
Box Plots…

Wendy’s service time is


shortest and least
variable.

Hardee’s has the greatest


variability.

Jack-in-the-Box has the


longest service times.

4.33
Stroop Interference

Naming the ink color of color words can be difficult. For example, if
asked to name the color of the word "blue" is difficult because the
answer (red) conflicts with the word "blue." This interference is called
"Stroop Interference" after the researcher who first discovered the
phenomenon.
There were 31 female and 16 male students in an intro stat class who
were part of this experiment.

Variable Description
Gender 1 for female, 2 for male
Words Time in seconds to read 60 color words
Colors Time in seconds to name 60 color rectangles
Interfer Time in seconds to name colors of conflicting words

4.34
4.35
Guys' eyes are more sensitive to small details and moving objects, while women are more
perceptive to color changes

They found that the guys required a slightly longer wavelength of a color to experience
the same shade as women and the men were less able to tell the difference between
hues

4.36
Stroop Interference (time for colors)

http://onlinestatbook.com/2/graphing_distributions/boxplots.html
4.38
Measures of Linear Relationship…

We now present three numerical measures of linear relationship that


provide information as to the strength & direction of a linear
relationship between two variables (if one exists).

They are the covariance, the coefficient of correlation, and the


coefficient of determination.

4.39
Beer vs blood alcohol
Student Beers BAC

1 5 0.1
2 2 0.03
3 9 0.19
6 7 0.095
7 3 0.07
9 3 0.02
11 4 0.07
13 5 0.085
4 8 0.12
5 3 0.04
8 5 0.06
10 5 0.05
12 6 0.1
14 7 0.09
15 1 0.01
16 4 0.05
Covariance… (Generally speaking)

When two variables move in the same direction (both increase or


both decrease), the covariance will be a large positive number.

When two variables move in opposite directions, the covariance


is a large negative number.

When there is no particular pattern, the covariance is a small


number.

However, it is often difficult to determine whether a particular


covariance is large or small. The next parameter/statistic
addresses this problem.
4.42
Coefficient of Correlation…

The advantage of the coefficient of correlation over covariance is


that it has fixed range from -1 to +1, thus:

If the two variables are very strongly positively related, the


coefficient value is close to +1 (strong positive linear relationship).

If the two variables are very strongly negatively related, the


coefficient value is close to -1 (strong negative linear relationship).

No straight line relationship is indicated by a coefficient close to


zero.

4.43
Parameters and Statistics
Population Sample
Size N n
Mean

Variance S2
Standard
Deviation S
Coefficient of
Variation CV cv
Covariance Sxy
Coefficient of
Correlation r
4.44

You might also like