Statistical Learning in Genetics An Introduction Using R Complete PDF Download
Statistical Learning in Genetics An Introduction Using R Complete PDF Download
Visit the link below to download the full version of this book:
https://medipdf.com/product/statistical-learning-in-genetics-an-introduction-usi
ng-r/
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland
AG 2023
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
This book evolved from a set of notes written for a graduate course on Likelihood
and Bayesian Computations held at Aarhus University in 2016 and 2018. The
audience was life-science PhD students and post-docs with a background in either
biology, agriculture, medicine or epidemiology, who wished to develop analytic
skills to perform genomic research. This book is addressed to this audience of
numerate biologists, who, despite an interest in quantitative methods, lack the
formal mathematical background of the professional statistician. For this reason,
I offer considerably more detail in explanations and derivations than may be needed
for a more mathematically oriented audience. Nevertheless, some mathematical
and statistical prerequisites are needed in order to extract maximum benefit from
the book. These include introductory courses on calculus, linear algebra and
mathematical statistics, as well as a grounding in linear and nonlinear regression and
mixed models. Applied statistics and biostatistics students may also find the book
useful, but may wish to browse hastily through the introductory chapters describing
likelihood and Bayesian methods.
I have endeavoured to write in a style that appeals to the quantitative biologist,
while remaining concise and using examples profusely. The intention is to cover
ground at a good pace, facilitating learning by interconnecting theory with examples
and providing exercises with their solutions. Many exercises involve programming
with the open-source package R, a statistical software that can be downloaded and
used with the free graphical user interface RStudio. Most of today’s students
are competent in R and there are many tutorials online for the uninitiated. The
R-code needed to solve the exercises is provided in all cases and is written,
with few exceptions, with the objective of being transparent rather than efficient.
The reader has the opportunity to run the codes and to modify input parameters
in an experimental fashion. This hands-on computing contributes to a better
understanding of the underlying theory.
The first objective of this introduction is to provide readers with an understanding
of the techniques used for analysis of data, with emphasis on genetic data. The
second objective is to teach them to implement these techniques. Meeting these
objectives is an initial step towards acquiring the skills needed to perform data-
vii
viii Preface
how a measure of uncertainty can be attached to accuracy. The body of the chapter
deals with prediction from a classical/frequentist perspective. Bayesian prediction
is illustrated in several examples throughout the book and particularly in Chap. 10.
In Chap. 6, many important ideas related to prediction are illustrated using a simple
least-squares setting, where the number of records n is larger than the number of
parameters p of the model; this is the .n > p setup. However, in many modern
genetic problems, the number of parameters greatly exceeds the number of records;
the .p n setup. This calls for some form of regularisation, a topic introduced
in Chap. 7 under the heading Shrinkage Methods. After an introduction to ridge
regression, the chapter provides a description of the lasso (least absolute shrinkage
and selection operator) and of a Bayesian spike and slab model. The spike and
slab model can be used for both prediction and for discovery of relevant covariates
that have an effect on the records. In a genetic context, these covariates could be
observed genetic markers and the challenge is how to find as many promising mark-
ers among the hundreds of thousands available, while incurring a low proportion
of false positives. This leads to the topic reviewed in Chap. 8: False Discovery
Rate. The subject is first presented from a frequentist perspective as introduced
by Benjamini and Hochberg in their highly acclaimed work, and is also discussed
using empirical Bayesian and fully Bayesian approaches. The latter is implemented
within an McMC environment using the spike and slab model as driving engine.
The complete marginal posterior distribution of the false discovery rate can be
obtained as a by-product of the McMC algorithm. Chapter 9 describes some of
the technical details associated with prediction for binary data. The topics discussed
include logistic regression for the analysis of case-control studies, where the data are
collected in a non-random fashion, penalised logistic regression, lasso and spike and
slab models implemented for the analysis of binary records, area under the curve
(AUC) and prediction of a genetic disease of an individual, given information on
the disease status of its parents. The chapter concludes with an appendix providing
technical details for an approximate analysis of binary traits. The approximation
can be useful as a first step, before launching the full McMC machinery of a more
formal approach. Chapter 10 deals with Bayesian prediction, where many of the
ideas scattered in various parts of the book are brought into focus. The chapter
discusses the sources of uncertainty of predictors from a Bayesian and frequentist
perspective and how they affect accuracy of prediction as measured by the Bayesian
and frequentist expectations of the sample mean squared error of prediction. The
final part of the chapter introduces, via an example, how specific aspects of a
Bayesian model can be tested using posterior predictive simulations, a topic that
combines frequentist and Bayesian ideas. Chapter 11 completes Part II and provides
an overview of selected nonparametric methods. After an introduction of traditional
nonparametric models, such as the binned estimator and kernel smoothing methods,
the chapter concentrates on four more recent approaches: kernel methods using basis
expansions, neural networks, classification and regression trees, and bagging and
random forests.
Part III of the book consists of exercises and their solutions. The exercises
(Chap. 12) are designed to provide the reader with deeper insight of the subject
x Preface
discussed in the body of the book. A complete set of solutions, many involving
programming, is available in Chap. 13.
The majority of the datasets used in the book are simulated and intend to illustrate
important features of real-life data. The size of the simulated data is kept within the
limits necessary to obtain solutions in reasonable CPU time, using straightforward
R-code, although the reader may modify size by changing input parameters.
Advanced computational techniques required for the analysis of very large datasets
are not addressed. This subject requires a specialised treatment beyond the scope of
this book.
The book has not had the benefit of having been used as material in repeated
courses by a critical mass of students, who invariably stimulate new ideas, help with
a deeper understanding of old ones and, not least, spot errors in the manuscript and
in the problem sections. Despite these shortcomings, the book is completed and out
of my hands. I hope the critical reader will make me aware of the errors. These
will be corrected and listed on the web at https://github.com/SorensenD/SLGDS.
The GitHub site also contains most of the R-codes used in the book, which can be
downloaded, as well as notes that include comments, clarifications or additions of
themes discussed in the book.
Many friends and colleagues have assisted in a variety of ways. Bernt Guldbrandtsen
(University of Copenhagen) has been a stable helping hand and helping mind.
Bernt has generously shared his deep biological and statistical knowledge with
me on many, many occasions, and provided also endless advice with LaTeX and
MarkDown issues, with programming details, always with good spirits and patience.
I owe much to him. Ole Fredslund Christensen (Aarhus University) read several
chapters and wrote a meticulous list of corrections and suggestions. I am very
grateful to him for this effort. Gustavo de los Campos (Michigan State University)
has shared software codes and tricks and contributed with insight in many parts
of the book, particularly in Prediction and Kernel Methods. I have learned much
during the years of our collaboration. Parts of the book were read by Andres Legarra
(INRA), Miguel Pérez Enciso (University of Barcelona), Bruce Walsh (University
of Arizona), Rasmus Waagepetersen (Aalborg University), Peter Sørensen (Aarhus
University), Kenneth Enevoldsen (Aarhus University), Agustín Blasco (Universidad
Politécnica de Valencia), Jens Ledet Jensen (Aarhus University), Fabio Morgante
(Clemson University), Doug Speed (Aarhus University), Bruce Weir (University
of Washington), Rohan Fernando (retired from Iowa State University) and Daniel
Gianola (retired from the University of Wisconsin-Madison). I received many
helpful comments, suggestions and corrections from them. However, I am the only
responsible for the errors that escaped scrutiny. I would be thankful if I could be
made aware of these errors.
I acknowledge Eva Hiripi, Senior Editor, Statistics Books, Springer, for consis-
tent support during this project.
I am the grateful recipient of many gifts from my wife Pia. One has been essential
for concentrating on my task: happiness.
xi
Contents
1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 The Sampling Distribution of a Random Variable . . . . . . . . . . . . . . . . . . 3
1.3 The Likelihood and the Maximum Likelihood Estimator . . . . . . . . . . 5
1.4 Incorporating Prior Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Frequentist or Bayesian? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.6 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.7 Appendix: A Short Overview of Quantitative Genomics. . . . . . . . . . . 32
xiii
xiv Contents
Part II Prediction
6 Fundamentals of Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
6.1 Best Predictor and Best Linear Predictor. . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
6.2 Estimating the Regression Function in Practice: Least Squares . . . 263
6.3 Overview of Things to Come . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
6.4 The Bias-Variance Trade-Off . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
6.5 Estimation of Validation MSE of Prediction in Practice . . . . . . . . . . . 280
6.6 On Average Training MSE Underestimates Validation MSE . . . . . . 284
6.7 Least Squares Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
7 Shrinkage Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
7.1 Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
7.2 The Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
7.3 An Extension of the Lasso: The Elastic Net . . . . . . . . . . . . . . . . . . . . . . . . 319
7.4 Example: Prediction Using Ridge Regression and Lasso . . . . . . . . . . 319
7.5 A Bayesian Spike and Slab Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
8 Digression on Multiple Testing: False Discovery Rates. . . . . . . . . . . . . . . . . 333
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
8.2 Preliminaries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336
Contents xv
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675
Author Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 683
Subject Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 687
Chapter 1
Overview
1.1 Introduction
10. Are there other non-genetic factors that affect the traits, such as smoking
behaviour, alcohol consumption, blood pressure measurements, body mass
index and level of physical exercise?
11. Could the predictive ability of the genetic score be improved by incorporation
of these non-genetic sources of information, either additively or considering
interactions? What is the relative contribution from the different sources of
information?
The first question has been the focus of quantitative genetics during many
years long before the so-called genomic revolution, that is, before breakthroughs
in molecular biology made technically and economically possible the sequencing of
whole genomes, resulting in hundreds of thousands or millions of genetic markers
(single nucleotide polymorphisms (SNPs)) for each individual in the data set. Until
the end of the twentieth century before dense genetic marker data were available,
genetic variation of a given trait was inferred using resemblance between relatives.
This requires equating the expected proportion of genotypes shared identical
by descent, given a pedigree, with the observed phenotypic correlation between
relatives. The fitted models also retrieve “estimates of random effects”, the predicted
genetic values that act as genetic scores and are used in selection programs of farm
animals and plants.
Answers to questions .2 − 7 would provide insight into genetic architecture and
thereby, into the roots of many complex traits and diseases. This has important
practical implications for drug therapies targeted to particular metabolic pathways,
for personalised medicine and for improved prediction. These questions could not
be sensibly addressed before dense marker data became available (perhaps with
the exception provided by complex segregation analysis that allowed searching for
single genes).
Shortly after a timid start where use of low-density genetic marker information
made its appearance, the first decade of the twenty-first century saw the construction
of large biomedical databases that could be accessed for research purposes where
health information was collected. One such database was the British .1958−cohort
study including medical records from approximately 3000 individuals genotyped
for one million SNPs. These data provided for the first time the opportunity to begin
addressing questions .2 − 7. However, a problem had to be faced: how to fit and
validate a model with one million unknowns to a few thousand records and how to
find a few promising genetic markers from the million available avoiding a large
proportion of false positives? This resulted in a burst of activity in the fields of
computer science and statistics, leading to development of a methodology designed
to meet the challenges posed by Big Data.
In recent years, the amount of information in modern data sets has
grown and become formidable and the challenges have not diminished. One
example is the UK Biobank that provides a wealth of health information
from half a million UK participants. The database is regularly updated and
a team of scientists recently reported that the complete exome sequence was
completed (about .2% of the genome involved in coding for proteins and
1.2 The Sampling Distribution of a Random Variable 3