0% found this document useful (0 votes)
42 views

Notes Chapter 1

The document discusses various statistical measures used to summarize quantitative data, including the mean, median, quartiles, and interquartile range. It provides examples and explanations of each measure using a sample dataset of executive salaries. The mean is the average and is calculated by summing all values and dividing by the total number. The median is the middle value when values are ordered from lowest to highest. Quartiles divide a dataset into four equal parts, with the first and third quartiles being the 25th and 75th percentiles. The interquartile range describes the spread of the middle 50% of values by taking the difference between the third and first quartiles.

Uploaded by

Michael Sheng
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views

Notes Chapter 1

The document discusses various statistical measures used to summarize quantitative data, including the mean, median, quartiles, and interquartile range. It provides examples and explanations of each measure using a sample dataset of executive salaries. The mean is the average and is calculated by summing all values and dividing by the total number. The median is the middle value when values are ordered from lowest to highest. Quartiles divide a dataset into four equal parts, with the first and third quartiles being the 25th and 75th percentiles. The interquartile range describes the spread of the middle 50% of values by taking the difference between the third and first quartiles.

Uploaded by

Michael Sheng
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

Displaying and Describing Quantitative

Data
1.2 Histograms in Descriptive Statistics, 1.3 Measures of
Location, 1.4 Measures of Variability

Thought Question
1. If you were to read the results of a study showing
that daily use of a certain exercise machine resulted
in an average 10-pound weight loss, what more
would you want to know about the numbers in
addition to the average?

(Hint: Do you think everyone who used the machine


lost 10 pounds?)

30

Sample Distribution of Men's Height in U.S.


Sample Size = 200

25

Sample mean ( x)= 69.65


Sample standard deviation(s) = 3.36

Frequency

20

15

10

0
60.00

62.00

64.00

66.00

68.00

70.00

72.00

74.00

76.00

78.00

80.00

82.00

Men's Height - Sample Size n = 200

We use a theoretical distribution, instead of the sample distribution. This allows us to answer
certain questions using only the mean and standard deviation of the data

Histogram for quantitative variable

The height values are


grouped into bins. X axis
plots the bins.
Y axis plots the frequency
(aka counts).
The height of each bar
represents the number of
the subjects whose
heights fall into a specific
bin, e.g, between 170cm
and 173cm

Displaying Measurement (Quantitative) Data


Histograms
Most commonly used tool in descriptive
statistics.
Histogram for discrete data:
Determine the frequency and relative frequency of
each x value.

Mark possible x values on a horizontal scale.

Above each value, draw a rectangle whose height is the


relative frequency (or the frequency) of that value.

Displaying Measurement (Quantitative) Data


Histograms
Histogram for continuous data:
Divide the range of the data into classes (5-10) of
equal width.
Determine the frequency and relative frequency for
each class.
Mark the class boundaries on a horizontal
measurement axis.
Above each class interval, draw a rectangle whose
height is the corresponding frequency or relative
frequency (Frequency/(Sample size)).
.

Bin width for histogram


Narrower bins,
width<0.5cm

Wider bins,
width=20cm

Histograms: Displaying the Distribution


of Earthquake Magnitudes

A histogram plots
the bin counts as the
heights of bars
(like a bar chart).
It displays the
distribution at a
glance.
Here is a histogram of
earthquake
magnitudes:

Histograms: Displaying the Distribution


of Earthquake Magnitudes (cont.)

A relative frequency histogram displays the percentage of


cases in each bin instead of the count.

Here is a relative
histogram of
earthquake magnitudes:

frequency

Use of histogram

The shape of the histogram tells how quantitative values


are distributed.

What ranges of values occur more frequently and what


ranges of value occur less frequently?
2. Whether the distribution is symmetric
3. The modality of the distribution: is there one major hump
(unimodal) or more (bimodal or multimodal)
4. Check outliers: any data points far from the rest? Oddballs or errors?
1.

Examining Distributions: Various


Shapes of Histograms

Symmetry

The (usually) thinner ends of a distribution are called the


tails. If one tail stretches out farther than the other, the
histogram is said to be skewed to the side of the longer
tail.
In the figure below, the histogram on the left is said to be
skewed left, while the histogram on the right is said to be
skewed right.

Slide 4 - 12

Statistical Methods - A workman is known by his


tools. - Anonymous proverb

There are many hundreds of useful toolsstatistical


methodsfor analyzing data and drawing conclusions.

Like all tools, the effectiveness of the statistical methods


depends on using them appropriately.

They are often concerned with summarizing data so that


we can draw some conclusions without looking at the
data in detail.

Examples of such tools of summarization are mean,


median, standard deviation (a measure of the scatter, or
dispersion, of the data)

Quantitative Data Typical Value

The most commonly used statistical summary measure is a typical value


for a set of data.

Why would someone want a typical value for a set of data?

An athlete might want to know the typical time for a particular knee
injury to heal.

A researcher might want to know the average cholesterol reduction


of a particular drug.

An investor might want to know the typical annual return of mutual


funds in an industry sector.

We also think of a typical value as a measure of central


tendency, showing where the data tend to cluster.

Central Tendency The Mean

The data below are the annual salaries of 10 business executives (in thousands of
dollars):

890
1,110
1,460
1,420
2,000
1,430
1,520
1,110
2,400
1,680

The arithmetic mean, usually called the mean or the average, is the sum of all data
values divided by the number of such values.
In this case, the total for all the salaries is $15 million; divided by 10 you get a
mean executive salary of $1.5 million.

Central Tendency The Mean

Central Tendency The Mean


The arithmetic mean has the most meaning when the values
are closely centered, with few exceptional values and tending
to symmetry about the mean.
Salary Example
But suppose that the one executive who earned $1,460,000
has had a profit-sharing bonanza one year and earned $5
million more for a total salary of $6,460,000 instead of
$1,460,000.

While most of the executive salaries are still around $1.5


million and only one other makes more than $2 million, the
mean has jumped from $1.5 million to $2 million, an increase
in the value of the mean of more than 30%.

Mean - Continued

The mean: pros and cons


Pros:

Easy to understand and


calculate
Uses all observations
Stable across different
samples from the same
population

Cons:

Not applicable to qualitative


variables
Problems when some
observations are missing
VERY MUCH AFFECTED BY
EXTREME VALUES

Central Tendency The Median


The median is that value that about half the population
have values below and half have values above.
Salary Example
To get the value of the median, take all the numbers
you have collected, and order them by increasing
value.
Once the numbers have been ordered, the median is
the middle value (if the number of values is odd) or
the average of the two middle values (if the number
of values is even).

Central Tendency The Median


To get the median of the salaries, order the values as shown below:
890
1,110
1,110
1,420
1,430
1,460
1,520
1,680
2,000
2,400

Then find the middle value (or as in this case, the average of the
middle two values) to get a median executive salary of $1,445,000
($1,430,000 + $1,460,000 divided by 2).

Central Tendency The Median

Salary Example
Note that in the original data set, the median of
$1,445,000 is only a little less than the arithmetic
mean $1.5 million.

But when the one executive's $1,460,000 salary is


increased to $6,460,000, the median does not change.

At $1,445,000, the median is still typical of the


executive salaries. The mean does change, however,
and the new mean of $2 million is not a typical value.

The Median

The median: pros and cons

Pros
Easily defined
Easy to calculate
Stable, not affected by
outliers

24

Cons
Stability has its downside
The median is not based on
all observations

Trimmed Mean

To explicitly remove the impact of outliers, Trimmed


Means can be used.

A 10% trimmed mean, for example, would be computed


by eliminating the smallest 10% and largest 10% of the
sample and then averaging what remains.

In sports like Gymnastics and Diving, trimmed means are


used to calculate the scores of athletes.

What Is Middle Class in Manhattan?


NY Times, Jan 18th, 2013
The average sale price of a home in Manhattan last
year was $1.46 million, according to a recent Douglas
Elliman report, while the average sale price for a new
home in the United States was just under $230,000.
The average of $1.46 million they were referring to was
the mean sales price.
The median sales price of a home in Manhattan was
$837, 500.

F.D.A. Revokes Approval of Avastin for Use as


Breast Cancer Drug NY Times, Nov. 18th, 2011

The commissioner of the Food and Drug


Administration on Friday revoked the approval of the
drug Avastin as a treatment for breast cancer, ruling on
an emotional issue that pitted the hopes of some
desperate patients against the statistics of clinical trials.

Many breast cancer specialists say that Avastin does


appear to work very well for some patients, and some
advocates have said the drug should be left on the
market for the sake of those patients.

How Spread Out is the Distribution?

Variation matters, and Statistics is about variation.


Are the values of the distribution tightly clustered around
the center or more spread out?

The range of the data is the difference between the maximum


and minimum values:
Range = max min

A disadvantage of the range is that a single extreme value can


make it very large and, thus, not representative of the data
overall.

Quartiles

Quartiles

Percentiles (Quantiles)
First Quartile and Third Quartile are two particular examples of Percentiles,
specifically, 25% percentile and 75% percentile.

The Interquartile Range (Fourth Spread)

The interquartile range (IQR) lets us ignore extreme


data values and concentrate on the middle of the data.

To find the IQR, we first need to know what quartiles


are

The difference between the quartiles is the interquartile


range (IQR), so
IQR = upper quartile lower quartile

The Interquartile Range (Fourth Spread)

The lower and upper quartiles are the 25th and 75th
percentiles of the data, so
The IQR contains the middle 50% of the values of the
distribution, as shown in figure:

Example
Find Lower Quartile(Q1) and Upper Quartile(Q3):
Data: 850, 900, 1400, 1200, 1050, 1000, 750, 1250, 1050, 565
Order dataset: 565, 750, 850, 900, 1000, 1050, 1050, 1200, 1250, 1400
Q1: use left part of the data 565, 750, 850, 900, 1000
median of this part = Q1 = 850

Q3: use right part of the data 1050, 1050, 1200, 1250, 1400
median of this part = Q3 = 1200
IQR = 1200 850 = 350
33

Box Plot

The Five-Number Summary


Example: Systolic Blood Pressure

Max

200

Q3

138

Median

132

Q1

121

Min

108

The five-number summary of a distribution


reports its median, quartiles, and extremes (maximum
and minimum).

Systolic Blood Pressure(SBP) - BOXPLOT


Outlier is an observation that is located farther than 1.5 IQR
from the closest quartile (Q1 or Q3). Outlier is extreme if it is
more than 3 IQR from the closest quartile (Q1 or Q3).

Q1

Q3

Q3+3*IQR or
the max of the
data

Q3+1.5*IQR or
the max of the
data

Q1-1.5*IQR
or the min
of the data

Median

100

120

140

160
SBP

180

200

220

Comparing Histograms and Boxplots


Compare the histogram and boxplot for daily wind speeds:

Comparing Groups -Internal Radiation Exposure


After the Fukushima Nuclear Power Plant Disaster

Comparing Groups - Relationship of Collegiate Football


Experience and Concussion With Hippocampal Volume
and Cognitive Outcomes

Alternative Measure of Spread: The


Standard Deviation

A more powerful measure of spread than the IQR is the


standard deviation, which takes into account how far
each data value is from the mean.

A deviation is the distance that a data value is from the


mean.

Alternative Measure of Spread: The


Standard Deviation

Alternative Measure of Spread: The


Standard Deviation

Example: Metabolic Rates


The following data consist of the metabolic rates (cal./24hr.)
of 7 men from a dieting study:
1792

1666

1362

1614

1460

1867

1439

First, compute the sample mean:

1792 1666 1362 1614 1460 1867 1439

7
11,200

7
1600

Example: Metabolic Rates


Observations

Deviations

Squared deviations

1792

17921600 = 192

(192)2 = 36,864

1666

1666 1600 =

1362

1362 1600 = -238

1614

1614 1600 =

1460

1460 1600 = -140

(-140)2 = 19,600

1867

1867 1600 = 267

(267)2 = 71,289

1439

1439 1600 = -161

(-161)2 = 25,921

sum =

66
14

(66)2 =

4,356

(-238)2 = 56,644
(14)2 =

196

sum = 214,870

Example: Metabolic Rates

214,870
s
35,811.67
7 1
2

s 35,811.67 189.24 calories

Short Cut Method for Calculating Variance

SAS Windowing Environment

47

SAS - PROC UNIVARIATE


ods graphics on;
title "Demonstrating PROC UNIVARIATE";
proc univariate data=example.Blood_Pressure;
id Subj;
var SBP DBP;
histogram;
probplot / normal(mu=est sigma=est);
run;

PROC UNIVARIATE - HISTOGRAM

SAS - PROC UNIVARIATE


title "Demonstrating MIDPOINT= Histogram Option";
proc univariate data=example.Blood_Pressure;
id Subj;
var SBP;
histogram / normal midpoints=100 to 170 by 5;
probplot / normal(mu=est sigma=est);
run;

PROC UNIVARIATE - HISTOGRAM

You might also like