Identifiability and Regression Analysis of Biological Systems Models Statistical and Mathematical Foundations and R Scripts, 2nd Edition Illustrated eBook Download
Identifiability and Regression Analysis of Biological Systems Models Statistical and Mathematical Foundations and R Scripts, 2nd Edition Illustrated eBook Download
Visit the link below to download the full version of this book:
https://medipdf.com/product/identifiability-and-regression-analysis-of-biologica
l-systems-models-statistical-and-mathematical-foundations-and-r-scripts-2nd-edit
ion/
The idea behind this book was to provide a practical, concise but mathematically
rigorous guide to regression procedures for experimental data. Finding the right
balance between practicality, conciseness, and rigour is the greatest challenge facing
those who teach or do outreach work. It is increasingly necessary for those who
work in data science and those who simply use its methods and results to have
a sufficiently robust theoretical basis to enable a reasoned, conscious, and correct
use of the statistical tools that are applied in data science. I wrote the book
with the needs of students on the various Master’s courses in Data Science at
various universities around the world in mind, as well as those in various scientific
fields who have to process and interpret data (e.g. doctors, biologists, physicists,
sociologists). In these times when Artificial Intelligence techniques are often used
as black boxes for learning from data and then making predictions based on it,
the need for theoretical foundations and clear practical instructions is therefore
becoming ever greater, with statistics becoming an increasingly important part of a
data scientist’s background. Indeed, statistical analysis improves prediction, pattern
analysis, and data conclusion and interpretation. The two core statistics concepts
that are important in data science are descriptive statistics and inferential statistics.
This book deals with two branches of inferential statistics: (i) the analysis of
identifiability and (ii) regression analysis. The domains of applications in which the
techniques of model identifiability and the regression analysis are presented is that
of dynamical models in biochemistry and systems biology, but the methodologies
and the computational techniques described therein are also of interest and practical
use in other disciplines.
By definition, a model is identifiable if it is theoretically possible to learn the true
values of its parameters from an infinite number of observations of it. If a model is
identifiable from a given set of experimental data, then there exists a unique set
of parameters returning the observed data. Equivalently, if a model is identifiable,
different values of its parameters must generate different probability distributions of
the observable variables. Regression analysis is a predictive modelling technique
which investigates the causal relationship (expressed as a mathematical model)
between a dependent (target) and independent variable(s) (predictor).
v
vi Preface
Chemical and biological systems of realistic size and complexity often exhibit
stiff and non-linear dynamics whose parameter identifiability is not guaranteed
and/or for which the most common and most used regression algorithms do
not converge. Consequently, biochemical and biological systems are a suitable
benchmark for identifiability and regression analysis techniques. A unique solution
for the unknown parameters that links any set of inputs to a set of outputs is a critical
requirement for any model-based analysis, and, indeed, may become particularly
hard for dynamical models of biochemical and, more generally, biological networks.
The size of such systems in terms of the number of interacting agents, the number
and type of interactions among them, the stiffness and non-linearity of their
dynamics, along with a suboptimal sample size of the experimental observations
(due to objective limitations of the experimental investigation of living matter)
challenges the identifiability of putative models. In turn, parameters in the model
that are not identifiable pose challenges during the regression analysis, leading to
both imprecise parameter estimation and misleading conclusions, and at the end, to
the failure of the modelling process.
The book presents the concepts of complexity of a dynamical systems and
knowledge inference (Chap. 1); deterministic and stochastic dynamical models,
stiff dynamical systems, and hybrid stochastic/deterministic simulation algorithms
(Chap. 2); theoretical and algorithmic treatment of observability, identifiability, and
distinguishability of models of complex systems (Chap. 3); the theoretical principles
and the practical formulas of multilinear regression, non-linear regression, robust
and Bayesian regression, along with the methods of predictors selections, regression
diagnostic, and outlier analysis (Chap. 4). As the spread of artificial intelligence
techniques in data science requires an increasing understanding and familiarity
in the use of regression approaches based on neural networks, in this edition, a
new chapter (Chap. 5) has been included, explaining at a basic introductory level
the concepts of neural networks and their use for parameter estimation in both
multilinear systems and differential equation systems describing the dynamics of
real world systems (e.g. physical, biological, chemical, social system, etc.). The
book also provides R scripts illustrating the implementation of unsupervised model
selection and regression analysis, multi-linear regression, unsupervised model
selection, non-linear regression, an example of neural Bayesian regression, and an
example of neural network for data regression (Chap. 6). Within the chapters, we
also point the reader to other sources (websites, blogs, and posts) in which to find
implementation solutions to regression problems. As in the first edition, at the end of
each chapter we offer a number of practical and theoretical exercises through which
the reader can test his or her understanding of the concepts. Some of the exercises
also invite the reader to go deeper into what is presented in the chapters.
The book is addressed to (i) university students in the last years of their
study courses in scientific disciplines such as chemistry, mathematics, engineering,
physics, (ii) doctoral students in courses in bioinformatics, bioengineering, systems
biology, biophysics, biochemistry, environmental sciences, experimental physics,
numerical analysis, and (iii) researchers, modellers, and practitioners in these
fields. The prerequisites necessary to understand the contents of the book are the
Preface vii
There are many people and work contexts that inspired the content of this book and
made its realization possible. A book is the result of study and cultural exchange
with colleagues, collaborators, and students. I am very grateful to the colleagues
of the Faculty of Computer Engineering of the University of Bolzano-Bozen (Italy)
for their advices and for their outstanding commitment in didactic and dissemination
activities. I thank them for being an example and guide and for creating a pleasant
and productive working environment around me. I also thank my collaborators of
the Department of Medicine (Division of Pathology) of University of Verona (Italy)
with whom I have worked fruitfully and with great pleasure for several years and
who have helped me to understand the needs of doctors and biologists in the use
and understanding of statistics. I thank very much my students of the University of
Bolzano-Bozen, as their enthusiasm and their questions have always been for me
the motivation and inspiration of my work.
Finally, I cannot fail to thank my family, to whom I owe a great deal for always
encouraging me and creating a family environment conducive to study, discussion,
and learning.
ix
Contents
xi
xii Contents
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
Chapter 1
Complex Systems, Data, and Inference
for the analysis of complex systems does not necessarily have to be complex.
The most promising way to implement a mathematical treatment that is not too
complex and practical for users coming from various disciplines is to organize
knowledge on a complex system in a graph or in a hypergraph. In recent years, the
biological sciences have made extensive use of graphs and networks to represent
complex interacting systems composed of sets of genes, proteins, metabolites, and
functional chemical compounds of various natures and functions [1–3]. In a graph-
like representation, the agents are the vertices and the interactions are indicated by
arcs connecting interacting agents. The topology of the graph is usually derived from
qualitative and quantitative experimental observations. This type of representation
implies a new way of investigating a phenomenology, which takes the whole system
into account, and not just its individual components. In fact, a system is not just
a set of components, but sets of information: components and their interactions.
The graph that represents it includes both information: those on the components
and those on their interactions. The graph also facilitates the construction of a
mathematical model of the dynamics of a system since it is a data structure that
can be translated into a set of equations or computational procedures. The use of a
graph or hypergraph representation not only provides a guide for the construction of
the mathematical or computational specification of a model and for the analysis
of its properties, but it also makes it possible to identify the possible controls
of its complexity. A sensitivity and robustness analysis of the graph allows us
to identify the driver nodes of a dynamics, and the cluster of driver nodes of
stochastic/deterministic hybrid dynamics due to stiffness.
Frequently the terms graph and network are used interchangeably, although their
meaning is very different. We will not give formal mathematical definitions of
“graph” and “network” here since they can be found in numerous good books and
articles in the literature. Here we instead highlight, what are the differences between
graph and network from the point of view of the processes that we want to represent
graphically and mathematically. To emphasize immediately that graph and net are
two different objects, we will use the terms vertex and arc when we talk about graphs
and node and edge when we talk about the network.
Graphs are combinatorial models representing relationships (arcs) between
certain agents (vertices). In biology, the vertices typically describe proteins, metabo-
lites, genes, or other molecular complexes, whereas the arcs represent functional
relationships or interactions between the vertices such as “activate”, “binds to”,
“catalyses”, or “is converted to” [4]. Furthermore, very often the activation action
performed by a vertex is represented as an arc coming out of the vertex and pointing
to another arc.
In a graph, every edge connects two nodes, and there are no arcs pointing to
other arches. Many biological processes, however, are characterized by more than
1.2 Biological Systems as Graphs and Hypergraphs 3
Fig. 1.1 Multireagents bindings reactions, as well as the creation of multiple reaction products are
not representable as graphs, where only bilateral relations between nodes are contemplated
two participating partners. Klamt et al. [4] bring as an example a metabolic reaction
involving four species such as A + B −→ C + D or a protein complex consisting of
more than two proteins. Hence, physico-chemical interactions between biological
entities are not susceptible of a graph-like representation. As illustrated in Fig. 1.1,1
an attempt to provide a graph-like representation may cause a loss of information
that can lead to wrong interpretations afterwards. A hypergraph is a generalization of
a graph that helps to overcome such conceptual limitations [6]. For this reason, many
databases and interaction storage formats support hyperedges i of different types,
either explicitly or implicitly [6–8]. In a hypergraph an arc can join any number of
vertices. What it is commonly called “network” is indeed a hypergraph. Klamt et al.
1 The clip-arts objects of “Thinking man” are taken from the free images databases publicly
[4] noted that although hypergraphs occur ubiquitously when dealing with cellular
networks, their notion is less known than that of graphs. This causes a suboptimal
use of the hypergraph expressive potentialities. On the online Encyclopedia of
Mathematics [9], we learn that a hypergraph is defined by a set of vertices V and a
st od arcs that are defined by subsets of vertices. We learn also that “a hypergraph
may be represented in a plane by identifying its nodes with points of the plane and
by identifying the edges with connected domains containing the vertices incident
with these edge”. For example, it is possible to represent a hypergraph H with set
of nodes
V = {v1 , v2 , v3 , v4 }
Fig. 1.2 An example of hypergraph. A hypergraph may be represented by a bipartite graph, and
conversely
1.2 Biological Systems as Graphs and Hypergraphs 5
Table 1.1 An example of R script to build and visualize the hypergraph H of Fig. 1.1
library(hypergraph)
library(hyperdraw)
Fig. 1.3 The hypergraph H obtained from the code in Table 1.1
r1 R1 + r2 R2 + . . . −→ p1 P1 + p2 P2 + . . .
Temkin et al. [11] showed that a chemical reaction can be described as a weighted
directed hyperedge in a directed hypergraph where nodes are the chemicals and
hyperedges are the reactions. However, Estrada et al. [12] noted that the lack of
a mature well-founded theory for the structural analysis of directed hypergraphs
caused the coexistence of two alternative commonly used representations of a
chemical reaction. In the first representation, a chemical reaction is modelled as
a bipartite graph, in which a set of nodes represents the reactants and products
and the other set represents the reaction itself. The other representation consists
of the substrate graph, in which reactants and products are nodes, and two nodes
are connected if the corresponding chemical compounds take part in the same
reaction. As sets of chemical reactions, metabolic pathways are represented in the
form of hypergraphs as well. In order to give an example of metabolic pathway
modelled as hypergraph, we consider the amphibolic pathway of the citric acid
(Krebs cycle) [13–15], involving the set of reactions reported in Table 1.2, and
converted into a graph structure by the script in Table 1.3. We present then in
Tables 1.4 and 1.5 two R scripts that can be used to generate and visualize the
hypergraph of the 25 reactions of the Krebs cycle, whereas in Table 1.6 we present
the R script tow generate the hypergraph of bimolecular reaction. Although the
network of citric acid cycle considered in this example has only 25 reactions and
24 nodes, its graphical representations as hypergraph result to be complex and not
immediately understandable, especially compared with the graph representation in
Fig. 1.4 (obtained with the R script in Table 1.4). However, the graph is missing
important information, for instance, about the citrate formation, that occurs through
the reaction
Table 1.2 The citric acid cycle, known as Krebs cycle, is amphibolic. An amphibolic pathway
is both anabolic and catabolic in its functions, i.e. it functions in both degradative or catabolic
and biosynthetic or anabolic reactions (the Greek prefix “amphi” means “both”). The citric acid
cycle is a series of reactions that degrade acetyl co-enzyme A to yield carbon dioxide and energy
[13–15]
Reaction Reaction’s index
Pyruvate −→ Acetyl-CoA R1
Acetyl-CoA −→ Oxaloacetate R2
Oxaloacetate −→ Citrate R3
Citrate −→ Cis-aconitate R4
Cis-aconitate −→ Isocitrate R5
Isocitrate −→ Oxalosuccinate R6
Oxalosuccinate −→ alpha-ketoglutarate R7
alpha-ketoglutarate −→ Succinyl-CoA R8
Succinyl-CoA −→ Succinate R9
Succinate −→ Fumarate R10
Fumarate −→ Malate R11
Malate −→ Oxaloacetate R12
Malate −→ Glucose R13
Citrate −→ Cholesterol R14
Citrate −→ Fatty-Acids R15
Amino-acids −→ alpha-ketoglutarate R16
alpha-ketoglutarate −→ Amino-acids R17
Odd_Chains-Fatty-Acids −→ Succinyl-CoA R18
Isoleucine −→ Succinyl-CoA R19
Methionine −→ Succinyl-CoA R20
Valine −→ Succinyl-CoA R21
Succinyl-CoA −→ Porphirins R22
Aspartate −→ Malate R23
Phenylalanina −→ Malate R24
Tyrosine −→ Malate R25
Oxalosuccinate −→ Amino-acids R26
Amino-acids −→ Oxalosuccinate R27
first organization, the nodes of the network represent proteins and an edge links two
proteins that interact with each other.
Estrada et al. [16] noted that the characterization of multi-protein complexes in
the whole proteome of an organism requires that the data are organized in lists
of protein membership to protein complexes. This list is usually represented in
two ways. The first is the protein-protein interaction network in which the nodes
represent proteins and an edge links two proteins that interact with each other. This
representation, however, does not take into account the multi-protein complexes
[16]. The second way is an intersection graph, whose nodes represent complexes,
and a link exists between two nodes (complexes) if they have one or more proteins