100% found this document useful (6 votes)
88 views

Statistical Learning in Genetics An Introduction Using R Complete PDF Download

This book serves as an introduction to statistical learning in genetics, aimed at life-science PhD students and post-docs with a quantitative background. It covers both likelihood and Bayesian methods, providing detailed explanations, examples, and exercises primarily using the R programming language. The content is organized into three parts: fitting models, prediction techniques, and exercises with solutions, making it a comprehensive resource for those interested in genomic research.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (6 votes)
88 views

Statistical Learning in Genetics An Introduction Using R Complete PDF Download

This book serves as an introduction to statistical learning in genetics, aimed at life-science PhD students and post-docs with a quantitative background. It covers both likelihood and Bayesian methods, providing detailed explanations, examples, and exercises primarily using the R programming language. The content is organized into three parts: fitting models, prediction techniques, and exercises with solutions, making it a comprehensive resource for those interested in genomic research.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Statistical Learning in Genetics An Introduction Using R

Visit the link below to download the full version of this book:

https://medipdf.com/product/statistical-learning-in-genetics-an-introduction-usi
ng-r/

Click Download Now


Daniel Sorensen
Aarhus University
Aarhus, Denmark

ISSN 1431-8776 ISSN 2197-5671 (electronic)


Statistics for Biology and Health
ISBN 978-3-031-35850-0 ISBN 978-3-031-35851-7 (eBook)
https://doi.org/10.1007/978-3-031-35851-7

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland
AG 2023
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Paper in this product is recyclable.


Til min elskede Pia
Preface

This book evolved from a set of notes written for a graduate course on Likelihood
and Bayesian Computations held at Aarhus University in 2016 and 2018. The
audience was life-science PhD students and post-docs with a background in either
biology, agriculture, medicine or epidemiology, who wished to develop analytic
skills to perform genomic research. This book is addressed to this audience of
numerate biologists, who, despite an interest in quantitative methods, lack the
formal mathematical background of the professional statistician. For this reason,
I offer considerably more detail in explanations and derivations than may be needed
for a more mathematically oriented audience. Nevertheless, some mathematical
and statistical prerequisites are needed in order to extract maximum benefit from
the book. These include introductory courses on calculus, linear algebra and
mathematical statistics, as well as a grounding in linear and nonlinear regression and
mixed models. Applied statistics and biostatistics students may also find the book
useful, but may wish to browse hastily through the introductory chapters describing
likelihood and Bayesian methods.
I have endeavoured to write in a style that appeals to the quantitative biologist,
while remaining concise and using examples profusely. The intention is to cover
ground at a good pace, facilitating learning by interconnecting theory with examples
and providing exercises with their solutions. Many exercises involve programming
with the open-source package R, a statistical software that can be downloaded and
used with the free graphical user interface RStudio. Most of today’s students
are competent in R and there are many tutorials online for the uninitiated. The
R-code needed to solve the exercises is provided in all cases and is written,
with few exceptions, with the objective of being transparent rather than efficient.
The reader has the opportunity to run the codes and to modify input parameters
in an experimental fashion. This hands-on computing contributes to a better
understanding of the underlying theory.
The first objective of this introduction is to provide readers with an understanding
of the techniques used for analysis of data, with emphasis on genetic data. The
second objective is to teach them to implement these techniques. Meeting these
objectives is an initial step towards acquiring the skills needed to perform data-

vii
viii Preface

driven genetics/genomics research. Despite the focus on genetic applications, the


mathematics of the statistical models and their implementation are relevant for
many other branches of quantitative methods. An appendix in the opening chapter
provides an overview of basic quantitative genomic concepts, making the book more
accessible to an audience of "non-geneticists".
I have attempted to give a balanced account of frequentist/likelihood and
Bayesian methods. Both approaches are used in classical quantitative genetic and
modern genomic analyses and constitute essential ingredients in the toolkit of the
well-trained quantitative biologist.
The book is organised in three parts. Part I (Chaps. 2–5) presents an overview
of likelihood and Bayesian inference. Chapter 2 introduces the basic elements
of the likelihood paradigm, including the likelihood function, the score and the
maximum likelihood estimator. Properties of the maximum likelihood estimator are
summarised and several examples illustrate the construction of simple likelihood
models, the derivation of the maximum likelihood estimators and their properties.
Chapter 3 provides a review of three computational methods for fitting likelihood
models: Newton-Raphson, the EM (expectation-maximisation) algorithm and gra-
dient descent. After a brief description of the methods and the essentials of their
derivation, several examples (13 in all) are developed to illustrate their imple-
mentation. Chapter 4 covers the basics of the Bayesian approach, mostly through
examples. The first set of examples illustrate the type of inferences that are possible
(joint, conditional and marginal inferences), when the posterior distributions have
known closed forms. In this case, inferences can be exact using analytical methods,
or can be approximated using Monte Carlo draws from the posterior distribution. A
number of options are available when the posterior distribution is only known up
to proportionality. After a very brief account of Bayesian asymptotics, the chapter
focuses on Markov chain Monte Carlo (McMC) methods. These are recipes for
generating approximate draws from posterior distributions. Using these draws, one
can obtain Monte Carlo estimates of the complete posterior distribution, or Monte
Carlo estimates of summaries such as the mean, variance and posterior intervals. The
chapter provides a description of the Gibbs sampling algorithm and of the joint and
single-site updating of parameters based on the Metropolis-Hastings algorithm. An
overview of the tools needed for analysis of the McMC output concludes the chapter.
An appendix provides the mathematical details underlying the magic of McMC
within the constraints imposed by the author’s limited mathematics. Chapter 5
illustrates applications of McMC. Several of the examples discussed in connection
with Newton-Raphson and the EM algorithm are revisited and implemented from a
Bayesian McMC perspective.
Part II of the book has the heading Prediction. The boundaries between Parts I
and II should not be construed as rigid. However, the heading emphasises the main
thread of Chaps. 6–11, with an important detour in Chap. 8 that discusses mul-
tiple testing. Chapter 6 introduces many important ingredients of prediction: best
predictor, best linear predictor, overfitting, bias-variance trade-off, cross-validation.
Among the topics discussed is the accuracy with which future observations can be
predicted, how is this accuracy measured, the factors affecting it and importantly,
Preface ix

how a measure of uncertainty can be attached to accuracy. The body of the chapter
deals with prediction from a classical/frequentist perspective. Bayesian prediction
is illustrated in several examples throughout the book and particularly in Chap. 10.
In Chap. 6, many important ideas related to prediction are illustrated using a simple
least-squares setting, where the number of records n is larger than the number of
parameters p of the model; this is the .n > p setup. However, in many modern
genetic problems, the number of parameters greatly exceeds the number of records;
the .p  n setup. This calls for some form of regularisation, a topic introduced
in Chap. 7 under the heading Shrinkage Methods. After an introduction to ridge
regression, the chapter provides a description of the lasso (least absolute shrinkage
and selection operator) and of a Bayesian spike and slab model. The spike and
slab model can be used for both prediction and for discovery of relevant covariates
that have an effect on the records. In a genetic context, these covariates could be
observed genetic markers and the challenge is how to find as many promising mark-
ers among the hundreds of thousands available, while incurring a low proportion
of false positives. This leads to the topic reviewed in Chap. 8: False Discovery
Rate. The subject is first presented from a frequentist perspective as introduced
by Benjamini and Hochberg in their highly acclaimed work, and is also discussed
using empirical Bayesian and fully Bayesian approaches. The latter is implemented
within an McMC environment using the spike and slab model as driving engine.
The complete marginal posterior distribution of the false discovery rate can be
obtained as a by-product of the McMC algorithm. Chapter 9 describes some of
the technical details associated with prediction for binary data. The topics discussed
include logistic regression for the analysis of case-control studies, where the data are
collected in a non-random fashion, penalised logistic regression, lasso and spike and
slab models implemented for the analysis of binary records, area under the curve
(AUC) and prediction of a genetic disease of an individual, given information on
the disease status of its parents. The chapter concludes with an appendix providing
technical details for an approximate analysis of binary traits. The approximation
can be useful as a first step, before launching the full McMC machinery of a more
formal approach. Chapter 10 deals with Bayesian prediction, where many of the
ideas scattered in various parts of the book are brought into focus. The chapter
discusses the sources of uncertainty of predictors from a Bayesian and frequentist
perspective and how they affect accuracy of prediction as measured by the Bayesian
and frequentist expectations of the sample mean squared error of prediction. The
final part of the chapter introduces, via an example, how specific aspects of a
Bayesian model can be tested using posterior predictive simulations, a topic that
combines frequentist and Bayesian ideas. Chapter 11 completes Part II and provides
an overview of selected nonparametric methods. After an introduction of traditional
nonparametric models, such as the binned estimator and kernel smoothing methods,
the chapter concentrates on four more recent approaches: kernel methods using basis
expansions, neural networks, classification and regression trees, and bagging and
random forests.
Part III of the book consists of exercises and their solutions. The exercises
(Chap. 12) are designed to provide the reader with deeper insight of the subject
x Preface

discussed in the body of the book. A complete set of solutions, many involving
programming, is available in Chap. 13.
The majority of the datasets used in the book are simulated and intend to illustrate
important features of real-life data. The size of the simulated data is kept within the
limits necessary to obtain solutions in reasonable CPU time, using straightforward
R-code, although the reader may modify size by changing input parameters.
Advanced computational techniques required for the analysis of very large datasets
are not addressed. This subject requires a specialised treatment beyond the scope of
this book.
The book has not had the benefit of having been used as material in repeated
courses by a critical mass of students, who invariably stimulate new ideas, help with
a deeper understanding of old ones and, not least, spot errors in the manuscript and
in the problem sections. Despite these shortcomings, the book is completed and out
of my hands. I hope the critical reader will make me aware of the errors. These
will be corrected and listed on the web at https://github.com/SorensenD/SLGDS.
The GitHub site also contains most of the R-codes used in the book, which can be
downloaded, as well as notes that include comments, clarifications or additions of
themes discussed in the book.

Aarhus, Denmark Daniel Sorensen


May 2023
Acknowledgements

Many friends and colleagues have assisted in a variety of ways. Bernt Guldbrandtsen
(University of Copenhagen) has been a stable helping hand and helping mind.
Bernt has generously shared his deep biological and statistical knowledge with
me on many, many occasions, and provided also endless advice with LaTeX and
MarkDown issues, with programming details, always with good spirits and patience.
I owe much to him. Ole Fredslund Christensen (Aarhus University) read several
chapters and wrote a meticulous list of corrections and suggestions. I am very
grateful to him for this effort. Gustavo de los Campos (Michigan State University)
has shared software codes and tricks and contributed with insight in many parts
of the book, particularly in Prediction and Kernel Methods. I have learned much
during the years of our collaboration. Parts of the book were read by Andres Legarra
(INRA), Miguel Pérez Enciso (University of Barcelona), Bruce Walsh (University
of Arizona), Rasmus Waagepetersen (Aalborg University), Peter Sørensen (Aarhus
University), Kenneth Enevoldsen (Aarhus University), Agustín Blasco (Universidad
Politécnica de Valencia), Jens Ledet Jensen (Aarhus University), Fabio Morgante
(Clemson University), Doug Speed (Aarhus University), Bruce Weir (University
of Washington), Rohan Fernando (retired from Iowa State University) and Daniel
Gianola (retired from the University of Wisconsin-Madison). I received many
helpful comments, suggestions and corrections from them. However, I am the only
responsible for the errors that escaped scrutiny. I would be thankful if I could be
made aware of these errors.
I acknowledge Eva Hiripi, Senior Editor, Statistics Books, Springer, for consis-
tent support during this project.
I am the grateful recipient of many gifts from my wife Pia. One has been essential
for concentrating on my task: happiness.

xi
Contents

1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 The Sampling Distribution of a Random Variable . . . . . . . . . . . . . . . . . . 3
1.3 The Likelihood and the Maximum Likelihood Estimator . . . . . . . . . . 5
1.4 Incorporating Prior Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Frequentist or Bayesian? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.6 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.7 Appendix: A Short Overview of Quantitative Genomics. . . . . . . . . . . 32

Part I Fitting Likelihood and Bayesian Models


2 Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.1 A Little Intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.2 Summary of Likelihood Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.3 Example: The Likelihood Function of Transformed Data. . . . . . . . . . 59
2.4 Example: Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
2.5 Example: Bivariate Normal Model with Missing Records . . . . . . . . . 63
2.6 Example: Likelihood Inferences Using Selected Records. . . . . . . . . . 66
2.7 Example: The Likelihood Function with Truncated Data . . . . . . . . . . 71
2.8 Example: The Likelihood Function of a Genomic Model . . . . . . . . . . 72
3 Computing the Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.1 Newton-Raphson and the Method of Scoring . . . . . . . . . . . . . . . . . . . . . . . 77
3.2 Gradient Descent and Stochastic Gradient Descent . . . . . . . . . . . . . . . . 98
3.3 The EM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4 Bayesian Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
4.1 Example: Estimating the Mean and Variance of a Normal
Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
4.2 Posterior Predictive Distribution for a New Observation . . . . . . . . . . . 151
4.3 Example: Monte Carlo Inferences of the Joint Posterior
Distribution of Mean and Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

xiii
xiv Contents

4.4 Approximating a Marginal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155


4.5 Example: The Normal Linear Mixed Model . . . . . . . . . . . . . . . . . . . . . . . . 156
4.6 Example: Inferring a Variance Component from a
Marginal Posterior Distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
4.7 Example: Bayesian Learning—Inheritance of Haemophilia . . . . . . . 162
4.8 Example: Bayesian Learning—Updating Additive
Genetic Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
4.9 A Brief Account of Bayesian Asymptotics . . . . . . . . . . . . . . . . . . . . . . . . . 170
4.10 An Overview of Markov Chain Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . 171
4.11 The Metropolis-Hastings Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
4.12 The Gibbs Sampling Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
4.13 Output Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
4.14 Appendix: A Closer Look at the McMC Machinery . . . . . . . . . . . . . . . 194
5 McMC in Practice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
5.1 Example: Estimation of Gene Frequencies from ABO
Blood Group Phenotypes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
5.2 Example: A Regression Model for Binary Data . . . . . . . . . . . . . . . . . . . . 213
5.3 Example: A Regression Model for Correlated Binary Data . . . . . . . . 220
5.4 Example: A Genomic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
5.5 Example: A Mixture Model of Two Gaussian Components . . . . . . . 234
5.6 Example: An Application of the EM Algorithm
in a Bayesian Context—Estimation of SNP Effects . . . . . . . . . . . . . . . . 239
5.7 Example: Bayesian Analysis of the Truncated Normal Model. . . . . 244
5.8 A Digression on Model Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247

Part II Prediction
6 Fundamentals of Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
6.1 Best Predictor and Best Linear Predictor. . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
6.2 Estimating the Regression Function in Practice: Least Squares . . . 263
6.3 Overview of Things to Come . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
6.4 The Bias-Variance Trade-Off . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
6.5 Estimation of Validation MSE of Prediction in Practice . . . . . . . . . . . 280
6.6 On Average Training MSE Underestimates Validation MSE . . . . . . 284
6.7 Least Squares Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
7 Shrinkage Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
7.1 Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
7.2 The Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
7.3 An Extension of the Lasso: The Elastic Net . . . . . . . . . . . . . . . . . . . . . . . . 319
7.4 Example: Prediction Using Ridge Regression and Lasso . . . . . . . . . . 319
7.5 A Bayesian Spike and Slab Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
8 Digression on Multiple Testing: False Discovery Rates. . . . . . . . . . . . . . . . . 333
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
8.2 Preliminaries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336
Contents xv

8.3 The Benjamini-Hochberg False Discovery Rate . . . . . . . . . . . . . . . . . . . . 338


8.4 A Bayesian Approach for a Simple Two-Group Mixture
Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
8.5 Empirical Bayes Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
8.6 Local False Discovery Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
8.7 Storey’s q-Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
8.8 Fully Bayesian McMC False Discovery Rate . . . . . . . . . . . . . . . . . . . . . . . 350
8.9 Example: A Two-Component Gaussian Mixture . . . . . . . . . . . . . . . . . . . 352
8.10 Example: The Spike and Slab Model with Genetic Markers . . . . . . . 361
9 Binary Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369
9.1 Prediction for Binary Observations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370
9.2 Mean Squared Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
9.3 Logistic Regression with Non-random Sampling. . . . . . . . . . . . . . . . . . . 375
9.4 Penalised Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377
9.5 The Lasso with Binary Records . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379
9.6 A Bayesian Spike and Slab Model for Binary Records . . . . . . . . . . . . 380
9.7 Area Under the Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
9.8 Prediction of Disease Status of Individual Given Disease
Status of relatives. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402
9.9 Appendix: Approximate Analysis of Binary Traits . . . . . . . . . . . . . . . . . 411
10 Bayesian Prediction and Model Checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
10.1 Levels of Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418
10.2 Prior and Posterior Predictive Distributions. . . . . . . . . . . . . . . . . . . . . . . . . 419
10.3 Bayesian Expectations of MSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 428
10.4 Example: Bayesian and Frequentist Measures of Uncertainty . . . . . 430
10.5 Model Checking Using Posterior Predictive Distributions . . . . . . . . . 435
11 Nonparametric Methods: A Selected Overview . . . . . . . . . . . . . . . . . . . . . . . . . 445
11.1 Local Kernel Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446
11.2 Kernel Methods Using Basis Expansions . . . . . . . . . . . . . . . . . . . . . . . . . . . 460
11.3 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 489
11.4 Classification and Regression Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 511
11.5 Bagging and Random Forests. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 521
11.6 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533

Part III Exercises and Solutions


12 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543
12.1 Likelihood Exercises I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543
12.2 Likelihood Exercises II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 549
12.3 Bayes Exercises I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555
12.4 Bayes Exercises II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556
12.5 Prediction Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 562
xvi Contents

13 Solution to Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575


13.1 Likelihood Exercises I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575
13.2 Likelihood Exercises II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594
13.3 Bayes Exercises I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 614
13.4 Bayes Exercises II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 631
13.5 Prediction Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 651

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675
Author Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 683
Subject Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 687
Chapter 1
Overview

1.1 Introduction

Suppose there is a set of data consisting of observations in humans on forced


expiratory volume (FEV, a measure of lung function; lung function is a predictor
of health and a low lung function is a risk factor for mortality), or on the presence or
absence of heart disease and that there are questions that could be answered using
these data. For example, a statistical geneticist may wish to know:
1. Is there a genetic component contributing to the total variance of these traits?
A positive answer suggests that genetic factors are at play. The next step would
be to investigate the following:
2. Is the genetic component of the traits driven by a few genes located on
a particular chromosome, or are there many genes scattered across many
chromosomes? How many genes are involved and is this a scientifically sensible
question?
3. Are the genes detected protein-coding genes, or are there also noncoding genes
involved in gene regulation?
4. How is the strength of the signals captured in a statistical analysis related to the
two types of genes? What fraction of the total genetic variation is allocated to
both types of genes?
5. What are the frequencies of the genes in the sample? Are the frequencies
associated with the magnitude of their effects on the traits?
6. What is the mode of action of the genes?
7. What proportion of the genetic variance estimated in 1 can be explained by the
discovered genes?
8. Given the information on the set of genes carried by an individual, will a
genetic score constructed before observing the trait help with early diagnosis
and prevention?
9. How should the predictive ability of the score be measured?

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 1


D. Sorensen, Statistical Learning in Genetics, Statistics for Biology and Health,
https://doi.org/10.1007/978-3-031-35851-7_1
2 1 Overview

10. Are there other non-genetic factors that affect the traits, such as smoking
behaviour, alcohol consumption, blood pressure measurements, body mass
index and level of physical exercise?
11. Could the predictive ability of the genetic score be improved by incorporation
of these non-genetic sources of information, either additively or considering
interactions? What is the relative contribution from the different sources of
information?
The first question has been the focus of quantitative genetics during many
years long before the so-called genomic revolution, that is, before breakthroughs
in molecular biology made technically and economically possible the sequencing of
whole genomes, resulting in hundreds of thousands or millions of genetic markers
(single nucleotide polymorphisms (SNPs)) for each individual in the data set. Until
the end of the twentieth century before dense genetic marker data were available,
genetic variation of a given trait was inferred using resemblance between relatives.
This requires equating the expected proportion of genotypes shared identical
by descent, given a pedigree, with the observed phenotypic correlation between
relatives. The fitted models also retrieve “estimates of random effects”, the predicted
genetic values that act as genetic scores and are used in selection programs of farm
animals and plants.
Answers to questions .2 − 7 would provide insight into genetic architecture and
thereby, into the roots of many complex traits and diseases. This has important
practical implications for drug therapies targeted to particular metabolic pathways,
for personalised medicine and for improved prediction. These questions could not
be sensibly addressed before dense marker data became available (perhaps with
the exception provided by complex segregation analysis that allowed searching for
single genes).
Shortly after a timid start where use of low-density genetic marker information
made its appearance, the first decade of the twenty-first century saw the construction
of large biomedical databases that could be accessed for research purposes where
health information was collected. One such database was the British .1958−cohort
study including medical records from approximately 3000 individuals genotyped
for one million SNPs. These data provided for the first time the opportunity to begin
addressing questions .2 − 7. However, a problem had to be faced: how to fit and
validate a model with one million unknowns to a few thousand records and how to
find a few promising genetic markers from the million available avoiding a large
proportion of false positives? This resulted in a burst of activity in the fields of
computer science and statistics, leading to development of a methodology designed
to meet the challenges posed by Big Data.
In recent years, the amount of information in modern data sets has
grown and become formidable and the challenges have not diminished. One
example is the UK Biobank that provides a wealth of health information
from half a million UK participants. The database is regularly updated and
a team of scientists recently reported that the complete exome sequence was
completed (about .2% of the genome involved in coding for proteins and
1.2 The Sampling Distribution of a Random Variable 3

considered to be important for identifying disease-causing or rare genetic


variants). The study involved more than .150,000 individuals genotyped
for more than 500 million SNPs (Halldorsson et al 2022). These data are
paired with detailed medical information and constitute an unparalleled
resource for linking human genetic variation to human biology and dis-
ease.
An important task for the statistical geneticist is to adapt, develop and implement
models that can extract information from these large-scale data and to contribute to
finding answers to the 11 questions posed above. This is an exercise on inference
(such as estimation of genetic variation), on gene detection (among the millions
of genetic markers that may be included in a probability model, how to screen
the “relevant” ones for further study?), on prediction (how does the quality of
prediction of future records, for example, outcome of a disease, improve with this
new knowledge about the trait?) and on how to fit the probability models. There are
several areas of expertise that must be developed in order to fulfil this data-driven
research task. An initial step is to understand the methodology that underlies the
probability models and to learn the modern computer-intensive methods required
for fitting these models. The objective of this book is to guide the reader to take this
first step.
This opening chapter gives an overview of the book’s content, omitting many
technicalities that are revealed in later chapters, and is intended to give a flavour
of the way ahead. The first part is about methodology and introduces, by means of
an example, the concepts of probability distribution, likelihood and the maximum
likelihood estimator. This is followed by a brief description of Bayesian methods
indicating how prior knowledge can be incorporated in a probability model and
how it can affect inferences. The second part of the chapter presents models
for prediction and for detection of genes using parametric and nonparametric
approaches. There is an appendix that offers a brief tour of the quantitative
genetic/genomic model. The goal is to introduce the jargon and the basic quanti-
tative genetic/genomic concepts used in the book.

1.2 The Sampling Distribution of a Random Variable

A useful starting point is to establish the distinction between a probability distribu-


tion and a likelihood function. For example, assume a random variable X that has a
Bernoulli probability distribution. This random variable can take 1 or 0 as possible
values (more generally, it can have two modalities) with probabilities .θ and .1 − θ ,
respectively. The mean of the distribution is

E(X|θ ) = 0 × Pr(X = 0|θ ) + 1 × Pr(X = 1|θ ) = θ


.

You might also like