Methods and Principles of Statistical Analysis: 2.1 Recommended Textbooks On Statistics
Methods and Principles of Statistical Analysis: 2.1 Recommended Textbooks On Statistics
2
1
Standard deviation (SD)
1
n
i
i
x x
n
If we exclude the extreme pH value in sample 10 (regarded as an outlier), then
the new mean of our remaining eight data points on pH is estimated to be 4.28 and
the standard deviation is estimated as
( ) ( ) ( )
- + - + + -
= =
-
2 2 2
4.41 4.28 4.21 4.28 ... 4.22 4.28
SD 0.09
8 1
Fortunately, most spreadsheets such as Excel or OpenOf ce or statistical soft-
ware can estimate standard deviations and other statistics ef ciently and lessen the
need to know the exact estimation formulas and computing techniques. If we assume
that our data are more or less normally distributed, then a distance of one standard
deviation from the mean will contain approximately 65% of our data. Two standard
deviations from the mean will contain approximately 95% of our data. This is the
main reason why continuous data are often described using the mean and standard
deviation.
If a data set has a skewed distribution or contains many outliers or extreme val-
ues, it is more common to describe the data as the median, with the spread repre-
sented by the minimum and maximum values. To reduce the effect of extreme
values, the so-called interquartile range is an alternative measure of spread in data.
It is equal to the difference between the third and rst quartiles. It can be found by
ranking all the observations in ascending order. For the sake of simplicity, let us
assume one has 100 observations. The lower boundary of the interquartile range is
at the border of the rst 25% of observations in this example observation 25 if they
are ranked in ascending order. The higher boundary of the interquartile range is at
the border of the rst 75% of observations in this example observation 75 if they
are ranked in ascending order.
2.4 Descriptive Plots
The adage a picture is worth a thousand words refers to the idea that a complex
idea can be conveyed with just a single still image. Actually, some attribute this
quote to the Emperor Napoleon Bonaparte, who allegedly said, Un bon croquis
vaut mieux quun long discours (a good sketch is better than a long speech). We
might venture to rephrase Napoleon to describe data A good plot is worth more
18 2 Methods and Principles of Statistical Analysis
M
e
a
s
u
r
e
d
v
a
l
u
e
5
123456789
1
0
1
1
1
2
1
3
1
4
1
5
1
6
1
7
1
8
1
9
2
0
0
a d
e
b
c
5
10
15
Fig. 2.1 Typically used descriptive plots. The plots are ( a ) bar chart, ( b ) box plot, ( c ) line plot,
( d ) histogram, and ( e ) scatterplot. All plots were made using data in Box 5.1 in Chap. 5
than a thousand data. Plots are very useful for describing the properties of data.
It is recommended that these be explored before further formal statistical analysis is
conducted. Some examples of descriptive plots are given in Fig. 2.1
19 2.4 Descriptive Plots
2.4.1 Bar Chart
A bar chart or bar graph is a chart with rectangular bars with lengths proportional to
the values they represent. They can be plotted vertically or horizontally. For categori-
cal data the length of the bar is usually the number of observations or the percentage
distribution, and for discrete or continuous data the length of the bar is usually the
mean or median with (error) lines sometimes representing the variation expressed as,
for example, the standard deviation or minimum and maximum values. Bar charts are
very useful for presenting data in a comprehensible way to a nonstatistical audience.
Bar charts are therefore often used in the mass media to describe data.
2.4.2 Histograms
Sometimes it is useful to know more about the exact spread and distribution of a
data set. Are there many outliers, or is the data distribution equally spread out? To
know more about this, one could make a histogram, which is a simple graphical way
of presenting a complete set of observation in which the number (or percentage
frequency) of observations is plotted for intervals of values.
2.4.3 Box Plots
A box plot (also known as a box-and-whisker diagram) is a very ef cient way of
describing numerical data. It is often used in applied statistical analysis but is not as
intuitive for nonstatistical readers. The plot is based on a ve-number summary of a
data set: the smallest observation (minimum), the lower quartile (cutoff value of the
lowest 25% of observations if ranked in ascending order), the median, the upper
quartile (cutoff value of the rst 75% of observations if ranked in ascending order),
and the highest observation (maximum). Often the whiskers may indicate the
2.5% and 97.5% values with outliers and extreme values indicated by individual
dots. Box plots provide more information about the distribution than bar charts.
If the line indicating the median is not in the middle of the box, then this is usually
a sign of a skewed distribution.
2.4.4 Scatterplots
Scatterplots are very useful for displaying the relationship between two numerical
variables. These plots are also sometimes called XY-scatter or XY-plots in certain
software. A scatterplot is a simple graph in which the values of one variable are
20 2 Methods and Principles of Statistical Analysis
plotted against those of the other. These plots are often the rst step in the statistical
analysis of the correlation between variables and subsequent regression analysis.
2.4.5 Line Plots
A line plot or graph displays information as a series of data points connected by
lines. Depending on what is to be illustrated, the data points can be single observa-
tions or statistical estimates as, for example the mean, median, or sum. As with the
bar chart, vertical lines representing data variation, for example standard deviation,
may then be used. Line plots are often used if one is dealing with repeated measure-
ments over a given time span.
2.5 Statistical Inference (the p -Value Stuff)
Descriptive statistics are used to present and summarize ndings. This may form the
basis for decision making and conclusions in, for example, scienti c and academic
reports, recommendations to governmental agencies, or advice for industrial pro-
duction and food development. However, what if the ndings were just due to a
coincidence? If the experiment were repeated and new data collected, a different
conclusion might be reached. With statistical methods it is necessary to assess
whether ndings are due to randomness and coincidence or are representative of the
true or underlying effect. One set of tools is called statistical tests (or inference)
and form the basis of p -values and con dence intervals.
The basis is a hypothesis that could be rejected in relation to an alternative hypothesis
given certain conditions. In statistical sciences these hypotheses are known as the null
hypothesis (typically a conservative hypothesis of no real difference between samples,
no correlation, etc.) and the alternative hypothesis (i.e., that the null hypothesis is not in
reality true). The principle is to assume that the null hypothesis is true. Methods based
on mathematical statistics have been developed to estimate the probability of outcomes
that are at least as rare as the observed outcomes, given the assumption that the null
hypothesis is true. This probability is the well-known p -value. If this probability is small
(typical less than 5%), then the null hypothesis is typically rejected in favor of the
alternative hypothesis. The level of this probability before the null hypothesis is rejected
is called the signi cance level (often denoted a ).
The relationship between the (unknown) reality if the null hypothesis is true or
not and the decision to accept or reject the null hypothesis is shown in Table 2.3 .
Two types of error can be made Type I and Type II errors. The signi cance level
a is typically set low (e.g., 5%) to avoid Type I errors that from a methodological
point of view are regarded as being more serious than Type II errors. The null
hypothesis is usually very conservative and assumes, for example, no difference
between groups or no correlation. The Type II error is denoted by b . The statistical
21 2.7 Overview of Statistical Models
power is the ability of a test to detect a true effect, i.e., reject the null hypothesis if
the alternative hypothesis is true. Thus, this is the opposite of a Type II error and
consequently equal to 1- b .
2.6 Overview of Classical Statistical Tests
Classical statistical tests are pervasive in research literature. More complex and gen-
eral statistical models can often express the same information as these tests. Table 2.4
presents a list of some common statistical tests. It goes beyond the scope of this
brief text to explain the statistical and mathematical foundations of these tests, but
they are covered in several of the recommended textbooks. Modern software often
has menu-based dialogs to help one determine the correct test. However, a basic
understanding of their properties is still important.
2.7 Overview of Statistical Models
Generally speaking, so-called linear statistical models state that your outcome of
interest (or a mathematical transformation of it) can be predicted by a linear combi-
nation of explanatory variables, each of which is multiplied by a parameter (some-
times called a coef cient and often denoted b ). To avoid having the outcome be
estimated as zero if all explanatory variables are zero, a constant intercept (often
denoted b
0
) is included. The outcome variable of interest is often called the depen-
dent variable, while the explanatory variables that can predict the outcome are called
independent variables.
The terminology in statistics and experimental design may sometimes be some-
what confusing. In all practical applications, models like linear regression, analysis of
covariance (ANCOVA), analysis of variance (ANOVA), or general linear models
(GLM) are very similar. Their different terminology is due as much to the historical
tradition in statistical science as to differences in methodology. Many of these models
with their different names and terminologies can be expressed within the framework
of generalized linear models. It was common to develop mathematical methods to
estimate parameter values and p -values that could be calculated manually by hand and
Table 2.3 Two types of statistical errors: Types I and II errors and their relationship to signi cance
level a and the statistical power (1- b )
Null hypothesis
(H
0
) is true Alternative hypothesis (H
1
) is true
Accept null hypothesis Correct decision Type II error: b
Reject null hypothesis Type I error: a Correct decision
22 2 Methods and Principles of Statistical Analysis
T
a
b
l
e
2
.
4
P
r
o
p
o
s
e
d
s
t
a
t
i
s
t
i
c
a
l
t
e
s
t
s
o
r
m
o
d
e
l
s
d
e
p
e
n
d
i
n
g
o
n
p
r
o
p
e
r
t
i
e
s
o
f
t
h
e
o
u
t
c
o
m
e
a
n
d
e
x
p
l
a
n
a
t
o
r
y
v
a
r
i
a
b
l
e
.
N
o
n
p
a
r
a
m
e
t
r
i
c
a
l
t
e
r
n
a
t
i
v
e
i
s
g
i
v
e
n
i
n
b
r
a
c
k
e
t
s
i
f
a
s
s
u
m
p
t
i
o
n
s
o
n
n
o
r
m
a
l
d
i
s
t
r
i
b
u
t
i
o
n
s
a
r
e
n
o
t
v
a
l
i
d
.
T
h
e
n
u
m
b
e
r
o
f
m
e
n
t
i
o
n
e
d
t
e
s
t
s
i
s
l
i
m
i
t
e
d
a
n
d
r
e
c
o
m
m
e
n
d
a
t
i
o
n
s
m
a
y
v
a
r
y
d
e
p
e
n
d
i
n
g
o
n
t
h
e
n
a
t
u
r
e
o
f
t
h
e
d
a
t
a
a
n
d
p
u
r
p
o
s
e
o
f
a
n
a
l
y
s
i
s
P
u
r
p
o
s
e
w
i
t
h
s
t
a
t
i
s
t
i
c
a
l
a
n
a
l
y
s
i
s
T
y
p
e
o
f
o
u
t
c
o
m
e
d
a
t
a
N
o
m
i
n
a
l
B
i
n
a
r
y
O
r
d
i
n
a
l
D
i
s
c
r
e
t
e
C
o
n
t
i
n
u
o
u
s
A
g
a
i
n
s
t
s
p
e
c
i
c
n
u
l
l
h
y
p
o
t
h
e
s
i
s
a
b
o
u
t
e
x
p
e
c
t
e
d
m
e
a
n
o
r
p
r
o
p
o
r
t
i
o
n
C
h
i
-
s
q
u
a
r
e
d
t
e
s
t
B
i
n
o
m
i
a
l
t
e
s
t
C
h
i
-
s
q
u
a
r
e
d
t
e
s
t
O
n
e
s
a
m
p
l
e
t
-
t
e
s
t
O
n
e
s
a
m
p
l
e
t
-
t
e
s
t
R
e
l
a
t
i
o
n
s
h
i
p
w
i
t
h
c
o
n
t
i
n
u
o
u
s
e
x
p
l
a
n
a
t
o
r
y
v
a
r
i
a
b
l
e
U
s
e
a
s
t
a
t
i
s
t
i
c
a
l
m
o
d
e
l
U
s
e
a
s
t
a
t
i
s
t
i
c
a
l
m
o
d
e
l
S
p
e
a
r
m
a
n
c
o
r
r
e
l
a
t
i
o
n
P
e
a
r
s
o
n
(
S
p
e
a
r
m
a
n
)
c
o
r
r
e
l
a
t
i
o
n
P
e
a
r
s
o
n
(
S
p
e
a
r
m
a
n
)
c
o
r
r
e
l
a
t
i
o
n
D
i
f
f
e
r
e
n
c
e
i
n
e
x
p
e
c
t
e
d
m
e
a
n
o
r
p
r
o
p
o
r
t
i
o
n
s
b
e
t
w
e
e
n
t
w
o
g
r
o
u
p
s
C
h
i
-
s
q
u
a
r
e
d
t
e
s
t
f
o
r
c
r
o
s
s
t
a
b
s
C
h
i
-
s
q
u
a
r
e
d
t
e
s
t
f
o
r
c
r
o
s
s
t
a
b
s
C
h
i
-
s
q
u
a
r
e
d
t
e
s
t
f
o
r
c
r
o
s
s
t
a
b
s
T
w
o
-
s
a
m
p
l
e
t
-
t
e
s
t
(
M
a
n
n
W
h
i
t
n
e
y
U
t
e
s
t
)
T
w
o
-
s
a
m
p
l
e
t
-
t
e
s
t
(
M
a
n
n
W
h
i
t
n
e
y
U
t
e
s
t
)
D
i
f
f
e
r
e
n
c
e
b
e
t
w
e
e
n
m
e
a
n
o
r
p
r
o
p
o
r
t
i
o
n
s
b
e
t
w
e
e
n
m
o
r
e
t
h
a
n
t
w
o
g
r
o
u
p
s
C
h
i
-
s
q
u
a
r
e
d
t
e
s
t
f
o
r
c
r
o
s
s
t
a
b
s
C
h
i
-
s
q
u
a
r
e
d
t
e
s
t
f
o
r
c
r
o
s
s
t
a
b
s
C
h
i
-
s
q
u
a
r
e
d
t
e
s
t
f
o
r
c
r
o
s
s
t
a
b
s
A
n
a
l
y
s
i
s
o
f
v
a
r
i
a
n
c
e
(
K
r
u
s
k
a
l
W
a
l
l
i
s
H
t
e
s
t
)
A
n
a
l
y
s
i
s
o
f
v
a
r
i
a
n
c
e
(
K
r
u
s
k
a
l
W
a
l
l
i
s
H
t
e
s
t
)
A
n
a
l
y
z
e
d
a
s
l
i
n
e
a
r
s
t
a
t
i
s
t
i
c
a
l
m
o
d
e
l
M
u
l
t
i
n
o
m
i
a
l
l
o
g
i
s
t
i
c
r
e
g
r
e
s
s
i
o
n
B
i
n
a
r
y
l
o
g
i
s
t
i
c
r
e
g
r
e
s
s
i
o
n
O
r
d
i
n
a
l
l
o
g
i
s
t
i
c
r
e
g
r
e
s
s
i
o
n
L
i
n
e
a
r
r
e
g
r
e
s
s
i
o
n
/
g
e
n
e
r
a
l
l
i
n
e
a
r
m
o
d
e
l
L
i
n
e
a
r
r
e
g
r
e
s
s
i
o
n
/
g
e
n
e
r
a
l
l
i
n
e
a
r
m
o
d
e
l
T
w
o
c
l
u
s
t
e
r
e
d
o
r
r
e
p
e
a
t
e
d
m
e
a
s
u
r
e
m
e
n
t
s
M
c
N
e
m
a
r
B
o
w
k
e
r
t
e
s
t
M
c
N
e
m
a
r
t
e
s
t
M
c
N
e
m
a
r
B
o
w
k
e
r
t
e
s
t
P
a
i
r
e
d
s
a
m
p
l
e
t
-
t
e
s
t
(
W
i
l
c
o
x
o
n
s
i
g
n
e
d
-
r
a
n
k
t
e
s
t
)
P
a
i
r
e
d
s
a
m
p
l
e
t
-
t
e
s
t
(
W
i
l
c
o
x
o
n
s
i
g
n
e
d
-
r
a
n
k
t
e
s
t
)
S
t
a
t
i
s
t
i
c
a
l
m
o
d
e
l
f
o
r
c
l
u
s
t
e
r
e
d
o
r
r
e
p
e
a
t
e
d
m
e
a
s
u
r
e
m
e
n
t
s
M
i
x
e
d
m
u
l
t
i
n
o
m
i
a
l
l
o
g
i
s
t
i
c
r
e
g
r
e
s
s
i
o
n
o
r
G
E
E
M
i
x
e
d
b
i
n
a
r
y
l
o
g
i
s
t
i
c
r
e
g
r
e
s
s
i
o
n
o
r
G
E
E
M
i
x
e
d
o
r
d
i
n
a
l
l
o
g
i
s
t
i
c
r
e
g
r
e
s
s
i
o
n
o
r
G
E
E
L
i
n
e
a
r
m
i
x
e
d
m
o
d
e
l
o
r
G
E
E
L
i
n
e
a
r
m
i
x
e
d
m
o
d
e
l
o
r
G
E
E
G
E
E
g
e
n
e
r
a
l
i
z
e
d
e
s
t
i
m
a
t
i
n
g
e
q
u
a
t
i
o
n
s
23 References
with the help of statistical tables. Most graduates in statistics are familiar with such
methods for simple regression and ANOVA methods. However, recent innovations in
mathematical statistics, and not least computers and software, have in an applied sense
replaced such manual methods. These computer-assisted methods are usually based
on the theory of so-called likelihood functions and involve nding their maximum
values by using iterations. In other words, these are methods where computer software
is needed for most applied circumstances. The theory behind maximum-likelihood
estimations is covered in several of the more advanced recommended textbooks.
Linear statistical models are often described within the framework of generalized
linear models. The type of model is determined by the properties of the outcome
variable. A dependent variable with continuous data is usually expressed with an
identity link and is often referred to by more traditional terms such as linear regression
or analysis of variance. If the dependent variable is binary, then it is usually expressed
by a logit link and is often referred to by the more traditional term logistic regression .
Count data use a log link and the statistical model is traditionally referred to as
Poisson regression (e.g., Dobsen and Barnett 2008 ) .
References
Agresti A (2002) Categorical data analysis, 2nd edn. Wiley, Hoboken
Cleophas TJ, Zwinderman AH (2012) Statistics applied to clinical studies, 5th edn. Springer,
Dordrecht
Devore JL, Kenneth N (2012) Modern mathematical statistics with applications, 2nd edn. Springer,
New York
Dobsen AJ, Barnett A (2008) An introduction to generalized linear models, 3rd edn. CRC Press,
London
Ibrahim JG, Molenberghs G (2009) Missing data methods in longitudinal studies: a review. Test
18:143. doi: 10.1007/s11749-0090138-x
Kaltenbach HM (2011) A concise guide to statistics. Springer, New York
Kleinbaum DG, Sullivan K, Barker N (2007) A pocket guide to epidemiology. Springer, New
York
Lehmann EL (2011) Fisher, Neyman, and the creation of classical statistics. Springer, New York
Madsen B (2011) Statistics for non-statisticians. Springer, Heidelberg
Marques de S JP (2007) Applied statistics using SPSS, STATISTICA, MATLAB and R, 2nd edn.
Springer, Berlin
Shahbaba R (2012) Biostatistics with R: an introduction to statistics through biological data.
Springer, New York
Song PXK (2007) Missing data in longitudinal studies. In: Correlated data analysis: modeling,
analytics, and applications. Springer, New York
Tamine AY, Robinson RK (2007) Tamine and Robinsons yoghurt science and technology, 3rd edn.
CRC Press, Cambridge
R Development Core Team (2012) The R project for statistical computing. http://www.r-project.
org . Accessed 30 Apr 2012
Vittinghoff E, Glidden DV, Shiboski SC, McCulloch CE (2012) Regression methods in biostatistics:
linear, logistic, survival and repeated measures models, 2nd edn. Springer, New York
http://www.springer.com/978-1-4614-5009-2