0% found this document useful (0 votes)
5 views

DATAENG Lesson 6a Descriptive Statistics (Self Study) Handout

Uploaded by

kitsgrageda
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

DATAENG Lesson 6a Descriptive Statistics (Self Study) Handout

Uploaded by

kitsgrageda
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

6/26/2020

DATAENG
(Engineering Data Analysis)

Lesson 6a: Descriptive Statistics


(For Self Study – NO NEED TO BE DISCUSS IN CLASS)
Numerical Summaries of Data
Stem-and-Leaf Diagrams
Frequency Distributions and Histograms
Box Plots
Time Sequence Plots
Probability Plots

1
Copyright © 2014 John Wiley & Sons, Inc. All rights reserved.

Applied Statistics and


Probability for Engineers

Sixth Edition
Douglas C. Montgomery George C. Runger

Chapter 6
Descriptive Statistics

Copyright © 2014 John Wiley & Sons, Inc. All rights reserved.

1
6/26/2020

Numerical Summaries of Data


• Data are the numeric observations of a
phenomenon of interest. The totality of all
observations is a population. A portion used
for analysis is a random sample.
• We gain an understanding of this collection,
possibly massive, by describing it numerically
and graphically, usually with the sample data.
• We describe the collection in terms of shape,
outliers, center, and spread (SOCS).
• The center is measured by the mean.
• The spread is measured by the variance.

Sec 6-1 Numerical Summaries of Data 3


Copyright © 2014 John Wiley & Sons, Inc. All rights reserved.

Sample Mean

Sec 6-1 Numerical Summaries of Data 4


Copyright © 2014 John Wiley & Sons, Inc. All rights reserved.

2
6/26/2020

Example 6-1: Sample Mean


Consider 8 observations (xi) of pull-off force from
engine connectors as shown in the table.
8

x i
12.6 + 12.9 + ... + 13.1
i
1
xi
12.6
x = average = i =1
= 2 12.9
8 8 3 13.4
104 4 12.3
= = 13.0 pounds 5 13.6
8 6 13.5
7 12.6
8 13.1
13.00
= AVERAGE($B2:$B9)

Figure 6-1 The sample mean is the balance point.


Sec 6-1 Numerical Summaries of Data 5
Copyright © 2014 John Wiley & Sons, Inc. All rights reserved.

Variance Defined

Sec 6-1 Numerical Summaries of Data 6


Copyright © 2014 John Wiley & Sons, Inc. All rights reserved.

3
6/26/2020

Standard Deviation Defined


• The standard deviation is the square root
of the variance.
• σ is the population standard deviation
symbol.
• s is the sample standard deviation symbol.

Sec 6-1 Numerical Summaries of Data 7


Copyright © 2014 John Wiley & Sons, Inc. All rights reserved.

Example 6-2: Sample Variance


Table 6-1 displays the quantities needed to calculate the
sample variance and sample standard deviation.
2
i xi x i - xbar (x i - xbar)
1 12.6 -0.4 0.16
2 12.9 -0.1 0.01
Dimension of: 3 13.4 0.4 0.16
xi is pounds 4 12.3 -0.7 0.49
Mean is pounds. 5 13.6 0.6 0.36
Variance is pounds2. 6 13.5 0.5 0.25
Standard deviation is pounds. 7 12.6 -0.4 0.16
8 13.1 0.1 0.01
Desired accuracy is generally sums = 104.00 0.0 1.60
accepted to be one more place divide by 8 divide by 7
than the data. xbar = 13.00 variance = 0.2286
standard deviation = 0.48

Table 6-1

Sec 6-1 Numerical Summaries of Data 8


Copyright © 2014 John Wiley & Sons, Inc. All rights reserved.

4
6/26/2020

Computation of s2
The prior calculation is definitional and tedious. A
shortcut is derived here and involves just 2 sums.

Sec 6-1 Numerical Summaries of Data 9


Copyright © 2014 John Wiley & Sons, Inc. All rights reserved.

Example 6-3: Variance by Shortcut


2
n
 n 

i =1
x −   xi 
2
i
 i =1 
n i xi
2
xi
s =
2 1 12.6 158.76
n −1 2 12.9 166.41
3 13.4 179.56
1,353.60 − (104.0 ) 8
2 4 12.3 151.29
= 5 13.6 184.96
7 6 13.5 182.25
7 12.6 158.76
1.60 8 13.1 171.61
= = 0.2286 pounds 2 sums = 104.0 1,353.60
7
s = 0.2286 = 0.48 pounds

Sec 6-1 Numerical Summaries of Data 10


Copyright © 2014 John Wiley & Sons, Inc. All rights reserved.

10

5
6/26/2020

What is this “n–1”?


• The population variance is calculated with N,
the population size. Why isn’t the sample
variance calculated with n, the sample size?
• The true variance is based on data deviations
from the true mean, μ.
• The sample calculation is based on the data
deviations from x-bar, not μ. X-bar is an
estimator of μ; close but not the same. So
the n-1 divisor is used to compensate for the
error in the mean estimation.
Sec 6-1 Numerical Summaries of Data 11
Copyright © 2014 John Wiley & Sons, Inc. All rights reserved.

11

Degrees of Freedom
• The sample variance is calculated with the
quantity n-1.
• This quantity is called the “degrees of
freedom”.
• Origin of the term:
– There are n deviations from x-bar in the sample.
– The sum of the deviations is zero.
– n-1 of the observations can be freely determined,
but the nth observation is fixed to maintain the
zero sum.

Sec 6-1 Numerical Summaries of Data 12


Copyright © 2014 John Wiley & Sons, Inc. All rights reserved.

12

6
6/26/2020

Sample Range
If the n observations in a sample are denoted
by x1, x2, …, xn, the sample range is:

r = max(xi) – min(xi) (6-6)

It is the largest observation in the sample minus


the smallest observation.

From Example 6-3:


r = 13.6 – 12.3 = 1.30
Note that: population range ≥ sample range

Sec 6-1 Numerical Summaries of Data 13


Copyright © 2014 John Wiley & Sons, Inc. All rights reserved.

13

Stem-and-Leaf Diagrams
• Dot diagrams (dotplots) are useful for small
data sets. Stem & leaf diagrams are better
for large sets.
• Steps to construct a stem-and-leaf diagram:
1) Divide each number (xi) into two parts: a stem,
consisting of the leading digits, and a leaf,
consisting of the remaining digit.
2) List the stem values in a vertical column.
3) Record the leaf for each observation beside its
stem.
4) Write the units for the stems and leaves on the
display.

Sec 6-2 Stem-And-Leaf Diagrams 14


Copyright © 2014 John Wiley & Sons, Inc. All rights reserved.

14

7
6/26/2020

Example 6-4: Alloy Strength


To illustrate the construction of
a stem-and-leaf diagram,
consider the alloy compressive
strength data in Table 6-2.

Table 6-2 Compressive Strength (psi) of


Aluminum-Lithium Specimens
105 221 183 186 121 181 180 143
97 154 153 174 120 168 167 141
245 228 174 199 181 158 176 110
163 131 154 115 160 208 158 133
207 180 190 193 194 133 156 123
134 178 76 167 184 135 229 146
218 157 101 171 165 172 158 169
199 151 142 163 145 171 148 158
160 175 149 87 160 237 150 135
196 201 200 176 150 170 118 149

Figure 6-4 Stem-and-leaf diagram for Table 6-2


data. Center is about 155 and most data is
between 110 and 200. Leaves are unordered.
Sec 6-2 Stem-And-Leaf Diagrams 15
Copyright © 2014 John Wiley & Sons, Inc. All rights reserved.

15

Quartiles
• The three quartiles partition the data into four equally sized counts
or segments.
– First or lower quartile : 25% of the data is less than q1.
– Second quartile : 50% of the data is less than q2, the median.
– Third or upper quartile : 75% of the data is less than q3.

• For the Table 6-2 data:

Value of indexed
f Index item quartile
th th
i (i+1)
0.25 20.25 143 144 143.25
0.50 40.50 160 163 161.50
0.75 60.75 181 181 181.00

Sec 6-2 Stem-And-Leaf Diagrams 16


Copyright © 2014 John Wiley & Sons, Inc. All rights reserved.

16

8
6/26/2020

Percentiles and Interquartile Range


• Percentiles are a special case of the quartiles.
• Percentiles partition the data into 100 segments.

• The interquartile range (IQR) is defined as:


IQR = q3 – q1.
• From the Quartiles example:
IQR = 181.00 – 143.25 = 37.75 = 37.8
• Impact of outlier data:
– IQR is not affected
– Range is directly affected.

Sec 6-2 Stem-And-Leaf Diagrams 17


Copyright © 2014 John Wiley & Sons, Inc. All rights reserved.

17

Minitab Descriptives
• The Minitab selection menu:
Stat > Basic Statistics > Display Descriptive Statistics
calculates the descriptive statistics for a data
set.
• For the Table 6-2 data, Minitab produces:
Variable N Mean StDev
Strength 80 162.66 33.77

Min Q1 Median Q3 Max


76.00 143.50 161.50 181.00 245.00
5-number summary

Sec 6-2 Stem-And-Leaf Diagrams 18


Copyright © 2014 John Wiley & Sons, Inc. All rights reserved.

18

9
6/26/2020

Frequency Distributions
• A frequency distribution is a compact
summary of data, expressed as a table,
graph, or function.
• The data is gathered into bins or cells,
defined by class intervals.
• The number of classes, multiplied by the
class interval, should exceed the range of the
data. The square root of the sample size is a
guide.
• The boundaries of the class intervals should
be convenient values, as should the class
width.

Sec 6-3 Frequency Distributions And Histograms 19


Copyright © 2014 John Wiley & Sons, Inc. All rights reserved.

19

Frequency Distribution Table


Frequency Distribution for Table 6-4 Frequency Distribution of Table 6-2 Data
the data in Table 6-2 Cumulative
Relative Relative
Considerations: Class Frequency Frequency Frequency
Range = 245 – 76 = 169 70 ≤ x < 90 2 0.0250 0.0250
90 ≤ x < 110 3 0.0375 0.0625
Sqrt(80) = 8.9 110 ≤ x < 130 6 0.0750 0.1375
130 ≤ x < 150 14 0.1750 0.3125
Trial class width = 18.9 150 ≤ x < 170 22 0.2750 0.5875
170 ≤ x < 190 17 0.2125 0.8000
Decisions: 190 ≤ x < 210 10 0.1250 0.9250
Number of classes = 9 210 ≤ x < 230 4 0.0500 0.9750
230 ≤ x < 250 2 0.0250 1.0000
Class width = 20 80 1.0000

Range of classes = 20 * 9 = 180

Starting point = 70

Sec 6-3 Frequency Distributions And Histograms 20


Copyright © 2014 John Wiley & Sons, Inc. All rights reserved.

20

10
6/26/2020

Histograms
• A histogram is a visual display of a frequency
distribution, similar to a bar chart or a stem-and-leaf
diagram.

• Steps to construct a histogram with equal bin widths:

1) Label the bin boundaries on the horizontal scale.


2) Mark & label the vertical scale with the frequencies
or relative frequencies.
3) Above each bin, draw a rectangle whose height is
equal to the frequency corresponding to that bin.

Sec 6-3 Frequency Distributions And Histograms 21


Copyright © 2014 John Wiley & Sons, Inc. All rights reserved.

21

Histogram of the Table 6-2 Data

Figure 6-7 Histogram of compressive strength of 80 aluminum-lithium alloy


specimens. Note these features – (1) horizontal scale bin boundaries & labels with
units, (2) vertical scale measurements and labels, (3) histogram title at top or in
legend.

Sec 6-3 Frequency Distributions And Histograms 22


Copyright © 2014 John Wiley & Sons, Inc. All rights reserved.

22

11
6/26/2020

Histograms with Unequal Bin Widths


• If the data is tightly clustered in some
regions and scattered in others, it is
visually helpful to use narrow class widths
in the clustered region and wide class
widths in the scattered areas.
• In this approach, the rectangle area, not
the height, must be proportional to the
class frequency.
bin frequency
Rectangle height =
bin width

Sec 6-3 Frequency Distributions And Histograms 23


Copyright © 2014 John Wiley & Sons, Inc. All rights reserved.

23

Poor Choices in Drawing Histograms

Figure 6-8 Histogram of compressive strength of 80 aluminum-


lithium alloy specimens. Errors: too many bins (17) create
jagged shape, horizontal scale not at class boundaries,
horizontal axis label does not include units.
Sec 6-3 Frequency Distributions And Histograms 24
Copyright © 2014 John Wiley & Sons, Inc. All rights reserved.

24

12
6/26/2020

Cumulative Frequency Plot

Figure 6-10 Cumulative histogram of compressive strength of 80


aluminum-lithium alloy specimens. Comment: Easy to see
cumulative probabilities, hard to see distribution shape.

Sec 6-3 Frequency Distributions And Histograms 25


Copyright © 2014 John Wiley & Sons, Inc. All rights reserved.

25

Shape of a Frequency Distribution

Figure 6-11 Histograms of symmetric and skewed distributions.

(b) Symmetric distribution has identical mean, median and mode measures.

(a & c) Skewed distributions are positive or negative, depending on the


direction of the long tail. Their measures occur in alphabetical order as the
distribution is approached from the long tail.☺

Sec 6-3 Frequency Distributions And Histograms 26


Copyright © 2014 John Wiley & Sons, Inc. All rights reserved.

26

13
6/26/2020

Histograms for Categorical Data


• Categorical data is of two types:
– Ordinal: categories have a natural order, e.g.,
year in college, military rank.
– Nominal: Categories are simply different, e.g.,
gender, colors.
• Histogram bars are for each category, are of
equal width, and have a height equal to the
category’s frequency or relative frequency.
• A Pareto chart is a histogram in which the
categories are sequenced in decreasing
order. This approach emphasizes the most
and least important categories.

Sec 6-3 Frequency Distributions And Histograms 27


Copyright © 2014 John Wiley & Sons, Inc. All rights reserved.

27

Example 6-6: Categorical Data Histogram

Figure 6-12 Airplane production in 1985. (Source: Boeing


Company) Comment: Illustrates nominal data in spite of the
numerical names, categories are shown at the bin’s midpoint, a
Pareto chart since the categories are in decreasing order.

Sec 6-3 Frequency Distributions And Histograms 28


Copyright © 2014 John Wiley & Sons, Inc. All rights reserved.

28

14
6/26/2020

Box Plot or Box-and-Whisker Chart


• A box plot is a graphical display showing center,
spread, shape, and outliers (SOCS).
• It displays the 5-number summary: min, q1,
median, q3, and max.

Figure 6-13 Description of a box plot.


Sec 6-4 Box Plots 29
Copyright © 2014 John Wiley & Sons, Inc. All rights reserved.

29

Box Plot of Table 6-2 Data

Figure 6-14 Box plot of compressive strength of 80 aluminum-lithium


alloy specimens. Comment: Box plot may be shown vertically or
horizontally, data reveals three outliers and no extreme outliers.
Lower outlier limit is: 143.5 – 1.5*(181.0-143.5) = 87.25.
Sec 6-4 Box Plots 30
Copyright © 2014 John Wiley & Sons, Inc. All rights reserved.

30

15
6/26/2020

Time Sequence Plots


• A time series plot shows the data value, or statistic, on
the vertical axis with time on the horizontal axis.
• A time series plot reveals trends, cycles or other time-
oriented behavior that could not be seen in the data.

Figure 6-16 Company sales by year (a). By quarter (b).

Sec 6-5 Time Sequence Plots 31


Copyright © 2014 John Wiley & Sons, Inc. All rights reserved.

31

Digidot Plot
Combining a time series plot with some of the other graphical displays that we
have considered previously will be very helpful sometimes. The stem-and-
leaf plot combined with a time series Plot forms a digidot plot.

Figure 6-17 A digidot plot of the compressive strength data in Table 6-2.

Sec 6-5 Time Sequence Plots 32


Copyright © 2014 John Wiley & Sons, Inc. All rights reserved.

32

16
6/26/2020

Constructing a Probability Plot


• To construct a probability plot:
– Sort the data observations in ascending order:
x(1), x(2),…, x(n).
– The observed value x(j) is plotted against the
observed cumulative frequency (j – 0.5)/n.
– The paired numbers are plotted on the
probability paper of the proposed distribution.
• If the paired numbers form a straight line,
then the hypothesized distribution
adequately describes the data.

Sec 6-6 Probability Plots 33


Copyright © 2014 John Wiley & Sons, Inc. All rights reserved.

33

Example 6-7: Battery Life


The effective service life (Xj in minutes) of batteries used in a laptop are given in
the table. We hypothesize that battery life is adequately modeled by a normal
distribution. To this hypothesis, first arrange the observations in ascending order
and calculate their cumulative frequencies and plot them.

Table 6-6 Calculations for Constructing


a Normal Probability Plot

j x (j ) (j -0.5)/10 100(j -0.5)/10


1 176 0.05 5
2 183 0.15 15
3 185 0.25 25
4 190 0.35 35
5 191 0.45 45
6 192 0.55 55
7 201 0.65 65
8 205 0.75 75
9 214 0.85 85
10 220 0.95 95

Figure 6-22 Normal probability plot for battery life.

Sec 6-6 Probability Plots 34


Copyright © 2014 John Wiley & Sons, Inc. All rights reserved.

34

17
6/26/2020

Probability Plot on Standardized Normal Scores


A normal probability plot can be plotted on ordinary axes using z-values. The
normal probability scale is not used.

Table 6-6 Calculations for


Constructing a Normal
Probability Plot
j x (j ) (j -0.5)/10 zj
1 176 0.05 -1.64
2 183 0.15 -1.04
3 185 0.25 -0.67
4 190 0.35 -0.39
5 191 0.45 -0.13
6 192 0.55 0.13
7 201 0.65 0.39
8 205 0.75 0.67
9 214 0.85 1.04
10 220 0.95 1.64

Figure 6-23 Normal Probability plot obtained


from standardized normal scores. This is
equivalent to Figure 6-19.
Sec 6-6 Probability Plots 35
Copyright © 2014 John Wiley & Sons, Inc. All rights reserved.

35

Probability Plot Variations

Figure 6-24 Normal probability plots indicating a non-normal distribution.


(a) Light tailed distribution
(b) Heavy tailed distribution
(c) Right skewed distribution

Sec 6-6 Probability Plots 36


Copyright © 2014 John Wiley & Sons, Inc. All rights reserved.

36

18
6/26/2020

Probability Plots with Minitab


• Obtained using Minitab menu: Graphics > Probability Plot. 14 different
distributions can be used.
• The curved bands provide guidance whether the proposed distribution is
acceptable – all observations within the bands is good.

Probability Plot of Battery Life


Normal - 95% CI
99
Mean 195.7
StDev 14.03
95 N 10
AD 0.257
90
P-Value 0.636
80
70
Percent

60
50
40
30
20

10

1
150 175 200 225 250
Battery Life (x) in Hours

Sec 6-6 Probability Plots 37


Copyright © 2014 John Wiley & Sons, Inc. All rights reserved.

37

Important Terms & Concepts of Chapter 6


Box plot Standard deviation
Frequency distribution Variance
& histogram Probability plot
Median, quartiles & Relative frequency
percentiles distribution
Multivariate data Sample:
Normal probability plot Mean
Pareto chart Standard deviation
Variance
Population:
Mean
Stem-and-leaf diagram
Time series plots
Chapter 6 Summary 38
Copyright © 2014 John Wiley & Sons, Inc. All rights reserved.

38

19

You might also like