100% found this document useful (11 votes)
133 views15 pages

Foundations of Applied Statistical Methods Full Book Download

The document is an introduction to the book 'Foundations of Applied Statistical Methods' by Hang Lee, aimed at researchers and students needing a solid understanding of applied statistics. It addresses gaps in traditional statistics education by focusing on foundational concepts without overwhelming mathematical complexity. The book covers various statistical methods, descriptive statistics, and probability models, making it suitable for both self-study and as a textbook for non-statistics majors.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (11 votes)
133 views15 pages

Foundations of Applied Statistical Methods Full Book Download

The document is an introduction to the book 'Foundations of Applied Statistical Methods' by Hang Lee, aimed at researchers and students needing a solid understanding of applied statistics. It addresses gaps in traditional statistics education by focusing on foundational concepts without overwhelming mathematical complexity. The book covers various statistical methods, descriptive statistics, and probability models, making it suitable for both self-study and as a textbook for non-statistics majors.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Foundations of Applied Statistical Methods

Visit the link below to download the full version of this book:

https://medipdf.com/product/foundations-of-applied-statistical-methods/

Click Download Now


Hang Lee

Foundations of Applied
Statistical Methods
Hang Lee
Department of Biostatistics
Massachusetts General Hospital
Boston, MA, USA

ISBN 978-3-319-02401-1 ISBN 978-3-319-02402-8 (eBook)


DOI 10.1007/978-3-319-02402-8
Springer Cham Heidelberg New York Dordrecht London

Library of Congress Control Number: 2013951231

© Springer International Publishing Switzerland 2014


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of
the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology
now known or hereafter developed. Exempted from this legal reservation are brief excerpts in connection
with reviews or scholarly analysis or material supplied specifically for the purpose of being entered and
executed on a computer system, for exclusive use by the purchaser of the work. Duplication of this
publication or parts thereof is permitted only under the provisions of the Copyright Law of the Publisher’s
location, in its current version, and permission for use must always be obtained from Springer.
Permissions for use may be obtained through RightsLink at the Copyright Clearance Center. Violations
are liable to prosecution under the respective Copyright Law.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
While the advice and information in this book are believed to be true and accurate at the date of
publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for
any errors or omissions that may be made. The publisher makes no warranty, express or implied, with
respect to the material contained herein.

Printed on acid-free paper

Springer is part of Springer Science+Business Media (www.springer.com)


Preface

Researchers who design and conduct experiments or sample surveys, perform sta-
tistical inference, and write scientific reports need adequate knowledge of applied
statistics. To build adequate and sturdy knowledge of applied statistical methods,
firm foundation is essential. I have come across many researchers who had studied
statistics in the past but are still far from being ready to apply the learned knowledge
to their problem solving, and else who have forgotten what they had learned. This
could be partly because the mathematical technicality dealt with the study material
was above their mathematics proficiency, or otherwise the studied worked examples
often lacked addressing essential fundamentals of the applied methods. This book is
written to fill gaps between the traditional textbooks involving ample amount of
technically challenging complex mathematical expressions and the worked exam-
ple-oriented data analysis guide books that often underemphasize fundamentals.
The chapters of this book are dedicated to spell out and demonstrate, not to merely
explain, necessary foundational ideas so that the motivated readers can learn to fully
appreciate the fundamentals of the commonly applied methods and revivify the
forgotten knowledge of the methods without having to deal with complex mathe-
matical derivations or attempt to generalize oversimplified worked examples of
plug-and-play techniques. Detailed mathematical expressions are exhibited only if
they are definitional or intuitively comprehensible. Data-oriented examples are
illustrated only to aid the demonstration of fundamental ideas. This book can be
used as a self-review guidebook for applied researchers or as an introductory statis-
tical methods course textbook for the students not majoring in statistics.

Boston, MA, USA Hang Lee

v
Contents

1 Warming Up: Descriptive Statistics and Essential


Probability Models .................................................................................. 1
1.1 Types of Data.................................................................................... 1
1.2 Description of Data Pattern .............................................................. 2
1.2.1 Distribution ......................................................................... 2
1.2.2 Description of Categorical Data Distribution ..................... 3
1.2.3 Description of Continuous Data Distribution ..................... 3
1.2.4 Stem-and-Leaf .................................................................... 5
1.3 Descriptive Statistics ........................................................................ 8
1.3.1 Statistic ............................................................................... 8
1.3.2 Central Tendency Descriptive Statistics
for Quantitative Outcomes.................................................. 8
1.3.3 Dispersion Descriptive Statistics
for Quantitative Outcomes.................................................. 9
1.3.4 Variance .............................................................................. 9
1.3.5 Standard Deviation ............................................................. 11
1.3.6 Property of Standard Deviation After Data
Transformations .................................................................. 11
1.3.7 Other Descriptive Statistics for Dispersion ........................ 13
1.3.8 Dispersions Among Multiple Data Sets ............................. 14
1.3.9 Caution to CV Interpretation .............................................. 15
1.3.10 Box and Whisker Plot......................................................... 16
1.4 Descriptive Statistics for Describing Relationships
Between Two Outcomes................................................................... 18
1.4.1 Linear Correlation Between Two Continuous Outcomes ... 18
1.4.2 Contingency Table to Describe an Association
Between Two Categorical Outcomes.................................. 19
1.4.3 Odds Ratio .......................................................................... 20

vii
viii Contents

1.5 Two Useful Probability Distributions ............................................... 21


1.5.1 Gaussian Distribution ........................................................... 21
1.5.2 Density Function of Gaussian Distribution .......................... 21
1.5.3 Application of Gaussian Distribution ................................... 22
1.5.4 Standard Normal Distribution .............................................. 23
1.5.5 Binomial Distribution ........................................................... 25
1.6 Study Questions................................................................................ 29
Bibliography .............................................................................................. 29
2 Statistical Inference Focusing on a Single Mean .................................. 31
2.1 Population and Sample ..................................................................... 31
2.1.1 Sampling and Non-sampling Errors ..................................... 31
2.1.2 Sample- and Sampling Distributions .................................... 32
2.1.3 Standard Error ...................................................................... 33
2.2 Statistical Inference .......................................................................... 35
2.2.1 Data Reduction and Related Nomenclature ......................... 35
2.2.2 Central Limit Theorem ......................................................... 35
2.2.3 The t-Distribution ................................................................. 37
2.2.4 Testing Hypotheses............................................................... 39
2.2.5 Accuracy and Precision ........................................................ 48
2.2.6 Interval Estimation and Confidence Interval ........................ 50
2.2.7 Bayesian Inference ............................................................... 54
2.2.8 Study Design and Its Impact to Accuracy and Precision ..... 56
2.3 Study Questions................................................................................ 61
Bibliography .............................................................................................. 62
3 t-Tests for Two Means Comparisons ..................................................... 63
3.1 Independent Samples t-Test for Comparing Two
Independent Means .......................................................................... 63
3.1.1 Independent Samples t-Test When Variances
Are Unequal ......................................................................... 66
3.1.2 Denominator Formulae of the Test Statistic
for Independent Samples t-Test ............................................ 67
3.1.3 Connection to the Confidence Interval ................................. 67
3.2 Paired Sample t-Test for Comparing Paired Means ......................... 68
3.3 Use of Excel for t-Tests .................................................................... 71
3.4 Study Questions................................................................................ 71
Bibliography .............................................................................................. 74
4 Inference Using Analysis of Variance for Comparing
Multiple Means........................................................................................ 75
4.1 Sums of Squares and Variances ........................................................ 75
4.2 F-Test................................................................................................ 77
4.3 Multiple Comparisons and Increased Type-1 Error ......................... 81
4.4 Beyond Single-Factor ANOVA ........................................................ 82
4.4.1 Multi-factor ANOVA ............................................................ 82
4.4.2 Interaction............................................................................. 82
Contents ix

4.4.3 Repeated Measures ANOVA ................................................ 84


4.4.4 Use of Excel for ANOVA ..................................................... 85
4.5 Study Questions................................................................................ 85
Bibliography .............................................................................................. 86
5 Linear Correlation and Regression ....................................................... 87
5.1 Inference of a Single Pearson’s Correlation Coefficient .................. 87
5.1.1 Q & A Discussion ................................................................ 88
5.2 Linear Regression Model with One Independent Variable:
Simple Regression Model ................................................................ 88
5.3 Simple Linear Regression Analysis ................................................. 89
5.4 Linear Regression Models with Multiple
Independent Variables ...................................................................... 94
5.5 Logistic Regression Model with One Independent Variable:
Simple Logistic Regression Model .................................................. 95
5.6 Consolidation of Regression Models ............................................... 98
5.6.1 General and Generalized Linear Models .............................. 98
5.6.2 Multivariate Analyses and Multivariate Model .................... 99
5.7 Application of Linear Models with Multiple
Independent Variables ...................................................................... 100
5.8 Worked Examples of General and Generalized Linear Modes ........ 101
5.8.1 Worked Example of a General Linear Model....................... 101
5.8.2 Worked Example of a Generalized Linear Model
(Logistic Model) Where All Multiple Independent
Variables Are Dummy Variables .......................................... 102
5.9 Study Questions................................................................................ 103
Bibliography .............................................................................................. 104
6 Normal Distribution Assumption-Free Nonparametric Inference ..... 105
6.1 Comparing Two Proportions Using 2×2 Contingency Table ........... 105
6.1.1 Chi-Square Test for Comparing Two Independent
Proportions ........................................................................... 106
6.1.2 Fisher’s Exact Test................................................................ 109
6.1.3 Comparing Two Proportions in Paired Samples .................. 110
6.2 Normal Distribution Assumption-Free Rank-Based Methods
for Comparing Distributions of Continuous Outcomes ................... 112
6.2.1 Permutation Test ................................................................... 114
6.2.2 Wilcoxon’s Rank Sum Test .................................................. 115
6.2.3 Kruskal–Wallis Test.............................................................. 116
6.2.4 Wilcoxon’s Signed Rank Test .............................................. 117
6.3 Linear Correlation Based on Ranks.................................................. 118
6.4 About Nonparametric Methods ........................................................ 118
6.5 Study Questions................................................................................ 119
Bibliography .............................................................................................. 119
x Contents

7 Methods for Censored Survival Time Data .......................................... 121


7.1 Censored Observations ..................................................................... 121
7.2 Probability of Survival Longer Than Certain Duration .................... 121
7.3 Statistical Comparison of Two Survival Distributions
with Censoring ................................................................................. 122
7.4 Study Question ................................................................................. 124
Bibliography .............................................................................................. 124
8 Sample Size and Power ........................................................................... 125
8.1 Sample Size for Interval Estimation of a Single Mean .................... 125
8.2 Sample Size for Hypothesis Tests .................................................... 126
8.2.1 Sample Size for Comparing Two Means Using
Independent Samples z- and t-Tests ..................................... 126
8.2.2 Sample Size for Comparing Two Proportions ...................... 130
8.3 Study Questions................................................................................ 132
Bibliography .............................................................................................. 132
9 Review Exercise Problems...................................................................... 133
9.1 Review Exercise 1 ............................................................................ 133
9.2 Review Exercise 2 ............................................................................ 139
9.2.1 Part A (30 Points): Questions 1–15 “True/False”
Questions, Please Explain/Criticize Why If You Chose to
Answer False (2 Points Each) .............................................. 139
9.2.2 Part B (15 Points): Questions 16.1–16.3 .............................. 141
9.2.3 Part C (15 Points): Questions 17–19 .................................... 142
9.2.4 Part D (10 Points): Questions 20–21 .................................... 143
9.2.5 Part E (5 Points): Question 22 .............................................. 143
9.2.6 Part F (20 Points): Questions 23–26..................................... 144
10 Probability Distribution of Standard Normal Distribution ................ 145
11 Percentiles of t-Distributions .................................................................. 149
12 Upper 95th and 99th Percentiles of Chi-Square Distributions ........... 151
13 Upper 95th Percentiles of F-Distributions ............................................ 153
14 Upper 99th Percentiles of F-Distributions ............................................ 155
15 Sample Sizes for Independent Samples t-Tests ..................................... 157

Index ................................................................................................................. 159


Chapter 1
Warming Up: Descriptive Statistics
and Essential Probability Models

This chapter portrays how to make sense of gathered data before performing the
formal statistical inference. The covered topics are types of data, how to visualize
data, how to summarize data into few descriptive statistics (i.e., condensed numeri-
cal indices), and introduction to some useful probability models.

1.1 Types of Data

Typical types of data arising from clinical studies mostly fall into one of the follow-
ing categories.
Nominal categorical data contain qualitative information and appear to discrete
values that are codified into numbers or characters (e.g., 1=case with a disease diag-
nosis, 0 = control; M = male, F = female).
Ordinal categorical data are semi-quantitative and discrete, and the numeric cod-
ing scheme is to order the values such as 1 = mild, 2 = moderate, and 3 = severe.
Note that the value of 3 (severe) does not necessarily be three times more severe
than 1 (mild).
Count (number of events) data are quantitative and discrete (i.e., 0, 1, 2 …).
Interval scale data are quantitative and continuous. There is no absolute 0 and the
reference value is arbitrary. Particular examples of such data are temperature values
in °C and °F.
Ratio scale data are quantitative and continuous, and there is the absolute 0.
Particular examples of such data are body weight and height.
In most cases the types of data usually fall into the above classification scheme
shown in Table 1.1 in that the types of data can be classified into either quantitative
or qualitative, and discrete or continuous. Nonetheless, some definition of the data
type may not be clear and among which the similarity and dissimilarity between the
ratio scale and interval scale may be such ones that need further clarification.

H. Lee, Foundations of Applied Statistical Methods, DOI 10.1007/978-3-319-02402-8_1, 1


© Springer International Publishing Switzerland 2014
2 1 Warming Up: Descriptive Statistics and Essential Probability Models

Table 1.1 Classifications of data types


Qualitative Quantitative
Discrete Nominal categorical Ordinal categorical (e.g., 1=mild, 2=moderate,
(e.g., M=male, F=female) 3=severe)
Count (e.g., number of incidences 0, 1, 2, 3, …)
Continuous N/A Interval scale (e.g., temperature)
Ratio scale (e.g., weight)

Ratio scale: Two distinct values of the ratio scale are ratio-able. For example, the
ratio of two distinct values of a ratio scale x, x1/x2 = 2 for x1 = 200 and x2 = 100, can
be interpreted as “twice as large.” Blood cholesterol level, measured as the total
volume of cholesterol molecule in a certain unit, is such an example in that if person
A's cholesterol level to person B's cholesterol level ratio is 2, then we will be able to
say that person A's cholesterol level is doubly higher than that of person B. Other
such examples are lung volume, age, and disease duration.
Interval scale: If two distinct values of quantitative data were not ratio-able, then
such data are interval scale data. Temperature is a good example in that there are
three temperature systems, i.e., Fahrenheit, Celsius, and Kelvin. Kelvin system even
has its absolute 0 (there is no negative temperature in Kelvin system). For example,
200 °F is not a temperature that is twice higher than 100 °F. We can only say that
200° is higher by 100° (i.e., the displacement between 200° and 100° is 100° in the
Fahrenheit measurement scale).

1.2 Description of Data Pattern

1.2.1 Distribution

A distribution is a complete description of how large the occurring chance (i.e.,


probability) of a unique datum or certain range of data is. The following two expla-
nations will help you grasp the concept. If you keep on rolling a die, you expect to
observe 1, 2, 3, 4, 5, or 6 equally likely, i.e., a probability for each unique outcome
value is 1/6. We say “a probability of 1/6 is distributed to the value of 1, 1/6 is dis-
tributed to 2, 1/6 to 3, 1/6 to 4, 1/6 to 5, and 1/6 to 6, respectively.” Another example
is that if you keep on rolling a die many times, and each time you say “a success” if
the observed outcome is 5 or 6 and say “a failure” otherwise, then your expected
chance to observe a success is 1/3 and that of a failure is 2/3. We say “a probability
of 1/3 is distributed to the success and 2/3 is distributed to the failure”. In real life,
there are many distributions that cannot be verbalized as simply as these two exam-
ples, which require descriptions using sophisticated mathematical expressions.
1.2 Description of Data Pattern 3

Fig. 1.1 Frequency table


and bar chart for describing
nominal categorical data

Let’s discuss how to describe the distributions arising from various types of data.
One way to describe a set of collected data is to make description about the distribu-
tion of relative frequency for the observed individual values (e.g., what values are
how much common and what values are how much less common). Graphs, simple
tables, or a few summary numbers are commonly used.

1.2.2 Description of Categorical Data Distribution

A simple tabulation, aka frequency table, is to list the observed count (and propor-
tion in percentage value) for each category. A bar chart (see Figs. 1.1 and 1.2) is a
good visual description of where the horizontal axis defines the categories of the
outcome and the vertical axis shows the frequency of each observed category. The
size of each bar in the Figures is the actual count. It is also common to present the
relative frequency (i.e., proportion of each category in percentage value).

1.2.3 Description of Continuous Data Distribution

Figure 1.3 is a listing of white blood cell (WBC) counts of 31 patients diagnosed with
a certain illness listed by the patient identification number. Does this listing itself tell
us the group characteristics such as the average and the variability among patients?
Fig. 1.2 Frequency table
and bar chart for describing
ordinal data

Fig. 1.3 List of WBC raw


data of 31 subjects
1.2 Description of Data Pattern 5

Fig. 1.4 List of 31 individual


WBC values in ascending Minimum Value
order

Median Value

Maximum Value

How can we describe the distribution of these data, i.e., how much of the occur-
ring chance is distributed to WBC=5,200, how much to WBC=3,100 …, and etc.?
Such a description may be very cumbersome. As depicted in Fig. 1.4, the listed full
data in ascending order can be a primitive way to describe the distribution, but it
does not still describe the distribution. An option is to visualize the relative frequen-
cies for grouped intervals of the observed data. Such a presentation is called histo-
gram. To create a histogram, one will first need to create equally spaced WBC
categories and count how many observations fall into each category. Then the bar
graph can be drawn where each bar size indicates the relative frequency of that par-
ticular WBC interval category. This may be a daunting task. Rather than covering
the techniques to create the histogram, next section introduces an alternative option.

1.2.4 Stem-and-Leaf

The Stem-and-Leaf plot requires much less work than creating the conventional
histogram while providing the same information as what the histogram does. This is
a quick and easy option to sketch a continuous data distribution.
Let’s use a small data set for illustration, and then revisit our WBC data example
for more discussion (Fig. 1.10) after we become familiar to this method. The fol-
lowing nine data points: 12, 32, 22, 28, 26, 45, 32, 21, and 85, are ages (ratio scale)
of a small group. Figures 1.5, 1.6, 1.7, 1.8, and 1.9 demonstrate how to create the
Stem-and-Leaf plot of these data.
6 1 Warming Up: Descriptive Statistics and Essential Probability Models

Fig. 1.5 Step-by-step illustration of creating a Stem-and-Leaf plot

Fig. 1.6 Illustration of


creating a Stem-and-Leaf plot

Fig. 1.7 Two Stem-and-Leaf


plots describing the same
data

You might also like