Bioinformatics Notes 2020 2021
Bioinformatics Notes 2020 2021
September 9, 2021
ii
Contents
iii
iv CONTENTS
6 Systems Biology 47
6.1 From Network Biology to Physiology and Systems Biology . . 47
6.1.1 From omics data to circuits of interacting molecules . . 47
6.1.2 Unambiguous graphical representation of biological cir-
cuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.2 Mathematical modeling of biological systems . . . . . . . . . 51
6.2.1 Mathematical representations and formalisms . . . . . 51
6.3 Examples of mathematical models in biology . . . . . . . . . 55
6.4 Noise in Biology . . . . . . . . . . . . . . . . . . . . . . . . . 61
CHAPTER 1
1
2 CHAPTER 1. INTRODUCTION AND RESOURCES
sential because the only possibility for researchers not to get lost in the data
and to actually be able to extract information from it was to use computers
and develop specific informatics tools and methods that could be applied to
molecular biology data.
A tremendous amount of human and clinical data has accumulated in
the last decades. This created an additional regulatory problem. The use
of these data falls under strict regulations because of confidentiality issues,
and its bioinformatics analysis often has to deal with strong computational
overheads that hinder generating knowledge from the data.
over the long run, the second reason may also have far reaching practical
implications.
An ontology extends the concept of classification by complementing the
“is a” relationship with other possibilities, so any classification is also an
onthology. For instance, we could take the species relationship defined above
and extend it with a “is part of” relationship. Now we can say that head is
part of insect and do the same for torax, abdomen and six legs, our computer
will immediately know that an ant has six legs, and so does a fly.
For the purposes of this course we will use “classification” and “ontology”
interchangeably, unless we are explicitly talking about their differences.
As an example of an ontology and its usefulness for biological research we
have the gene ontology (GO). GO is a controlled vocabulary with 3 categories:
Cellular localization (membrane, cytoplasm, etc.)
Molecular function (kinase, oxireductase, etc.)
Biological process (DNA repair, cell cycle, etc.)
Each of the categories contains various entities and there are the above
mentioned two types of relationships between them. A specific instance of
data may either be an entity (cell membrane is a membrane) or a part of
another entity (cell membrane is part of the cell).
For example, consider the gene BRCA1 from humans. This gene is asso-
ciated to chromosomes, localized in the nucleus and involved in chromosome
replication. IF an ortholog of this gene is found in another animal, then one
can infer that that ortholog has a high probability of being localized in the
same compartments or involved in similar biological functions. This enables
an initial and automated annotation of the new gene that should obviously
be curated later. The GO is a good start to organize information regarding
the function of genes in an organism. However, it is far from complete and
if more specific functional information is required, GO is not the appropri-
ate classification to use. If one is interested in specific aspects of biological
systems, then it might be more appropriate to use classifications that are
developed for those aspects.
If you are interested in metabolism, a more appropriate functional clas-
sification to use might be the EC classification developed specifically for
enzymes, the EC numbers. There are six classes of enzymes, each with their
own subclasses and so on and so forth. The clasification is hierarchical and
has four levels, so every enzyme is identified by four numbers separated by
points, one number per level. This results in signatures similar to the num-
bers assigned to the books in a library using the Dewey Decimal Classification
system. The best repository for enzyme information is BRENDA.
1.5. DATABASES AND SERVERS 5
plified terms, the semantic web aims to annotate the information in web
pages to include the meaning of the data elements in machine readable form.
When you open a web page in your browser, for example using SAFARI or
CHROME, you see a human readable display of content and information.
However, for the browser to know how to display the content of the page,
that content is tagged. The way these tags work is very straight forward.
By indicating breaks, paragraphs, and display styles they allow you browser
to format and show you content in a way that is, hopefully, pleasant and
easy for humans to read. The semantic web takes this concept a step fur-
ther, introducing additional tags to the content of web pages. These tags
relay semantic information regarding how the various parts of the web page
are related amongst themselves, permitting the application of simple rules to
automatically analyze web content. Taking again an example where before
you only had a sentence (say “The cat is ugly”) now you can tag the various
elements of that sentence and automatically infer how they are related to
< > < > < > <
each other: pronoun The name cat verb is adjective ugly. >
This example has to do with language analysis. However, it can translate
directly to biological relationships, through the use of tags that are related to
biological content, function, and relationships. This emphasizes the impor-
tance of universal classifications and ontologies in the semantic web context.
The picture discussed above might be considered by some as too simplis-
tic, as is almost always the case in introductory discussions about a subject.
Why is that? Because it assumes rigid rules that always apply and makes
no concessions to uncertainty and error in creation and storing information.
However, there are now artificial intelligence and machine learning meth-
ods that can be applied to infer fuzzy relationships between entities in a
way that attempts to mimic human inference. These methods can also be
used to recognize entities described and/or tagged with a certain amount
of error associated to them. In broad terms, these methods work in the
following way. First, they require having a “golden set” of data containing
a large amount of data, where entities and relationships between them are
fully well annotated. This set is then usually divided into a training set and
a test set. The training set is used by the methods to infer statistical and/or
morphological characteristics of the entities and their relationships. These
characteristics (for example chemical names have a usage frequency of “(”
or “-” that is very different from that observed in common words and they
have terminations that are quite specific, as is the case of methyl- or propyl-)
are transformed into mathematical constructs. These constructs are used to
calculate parameters that permit the method to identify other entities with
the same morphology and/or frequency of usage of various types of marks
and signs. Once these parameters are calculated, they create a mathemat-
8 CHAPTER 1. INTRODUCTION AND RESOURCES
ical/statistical model of how the entities and relationships might be. This
model can then be used to analyze other datasets in the hope of inferring
entities and relationships in those new datasets. However, before one can do
so, the model must be tested to see if it really learned how to identify new en-
tities and/or relationships, rather than just having memorized the entities in
the training set. This can be done by applying the model to identify the en-
tities/relationships in the test set that was derived from the golden standard
data set. If the method is able to correctly identify the entities/relationships
in this set, then it can be used in other data sets. If not, then we know that
the method memorized the information in the training set but did not learn
the rules for identifying new entities/relationships.
Taken together, the development of a mature semantic web, the applica-
tion of biological classifications and ontologies, and the creation of accurate
and fast machine learning/artificial intelligence methods can open in the fu-
ture a new era for biologists. The conjunction of the three might enable an
automated, accurate, and fast information extraction and data integration
from the large amounts of datasets that are accumulating in molecular and
cellular biology.
In addition, artificial intelligence methods can now be trained to orga-
nize, integrate, and analyze data. These methods could, in principle become
a layer of data integration and analysis that would tremendously facilitate
our handling of biological and clinical data to extract information from them.
Many databases of molecular information exist and are freely available on-
line. While there was, for a long time problems with data integration between
databases and servers, web servers such as NCBI, EBI, or UNIPROT provide
raw data stored in perpetually updating databases, as well as webservers that
integrate individual tools to mine the information contained in that data.
These tools can perform functions that go from comparing sequences to ana-
lyzing gene expression, predicting protein structure or function, and creating
mathematical or statistical models. Currently, many of these databases are
linked and data integration and homogenization is much less problematic
than it was a few years ago.
Many of these tools are also available for individual download and us-
age outside of the web server. More advanced users can create pipelines or
workflows where these tools are used in succession to consecutively process
the output of one another and create multidimensional analysis of complex
biological datasets. Such pipelines/workflows can be created using platforms
such as Galaxy or Taberna. These platforms allow users to plug in pro-
grams that do the individual steps of the workflow in lego-like fashion and
pipeline the outputs between those programs. In order to do so efficiently,
users should carefully consider what they want to do and spend the time to
1.6. FILE FORMATS 9
1.6.1 FASTA
The FASTA format is used to store nucleotide or aminoacid sequences. Any
fasta file begins with a line of description followed by the sequence itself. For
instance:
>gi|186681228|ref|YP_001864424.1| phycoerythrobilin
MNSERSDVTLYQPFLDYAIAYMRSRLDLEPYPIPTGFESNSAVVGKGKNQEEVVTTSYAFQTAKLRQIRA
AHVQGGNSLQVLNFVIFPHLNYDLPFFGADLVTLPGGHLIALDMQPLFRDDSAYQAKYTEPILPIFHAHQ
QHLSWGGDFPEEAQPFFSPAFLWTRPQETAVVETQVFAAFKDYLKAYLDFVEQAEAVTDSQNLVAIKQAQ
LRYLRYRAEKDPARGMFKRFYGAEWTEEYIHGFLFDLERKLTVVK
The description line begins with the greater than symbol “>” and the
sequence uses one letter codes. The sequences are case insensitive, with
gaps in the sequence being represented by a single hyphen regardless of their
length. In nucleotide sequences, the bases are represented by the usual codes:
A,C,T,U and G. Unknown nucleotides are represented with the letter N.
The format accepts other codes for purines, pyrimidines, etc. In sequences
of aminoacids, unknown residues are represented by the letter X, end of
translation by an asterisk (*) and each of the aminoacids is represented by
its standard single letter abreviation. Single letter abreviations were defined
by one of the pioneers of bioinformatics: Dr Margaret O. Dayhoff. In order
to make the one letter codes as easy to remember as possible, Dayhoff chose
them according to the following logic:
Aminoacids cysteine, histidine, isoleucine, methionine, serine and va-
line take the first letter of their names since there are no other aminoacids
starting with the same letter
In cases where two or more aminoacids start the same letter, the letter is
used for the most abundant one. Thus alanine, glycine, leucine ,proline
and threonine are coded by their initial.
10 CHAPTER 1. INTRODUCTION AND RESOURCES
After the assignments above, there were not so many letters left so
Dayhoff assigned K to lysine since K is close to L in the alphabet. For
the pairs aspartate/asparagine and glutamate glutamine, Dayhoff as-
signed the letters closer to the beginning of the alphabet to the smaller
molecules: aspartate (D) / asparagiNe (N) and those closer to the end
to the bigger ones: glutamate (E) / glutamine (Q).
1.6.2 FASTQ
The FASTQ format is used to report sequencing results and normally uses
four lines per sequence, for example:
@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!’’*((((***+))%%%++)(%%%%).1***-+*’’))**55CCF>>>>>>CCCCCCC65
The first line begins with a “@” character and is followed by a sequence
identifier and an optional description.
The final line encodes the quality values for the sequence in Line 2, and
must contain the same number of symbols as letters in the sequence.
Each symbol in the last line encodes the quality of the corresponding
position in the sequence and follow an arbitrary order from the lowest quality
(!) to the highest (∼)
1.7 Outlook
While this is an introduction to bioinformatics, it is important to have some
perspective regarding how professional bioinformaticians work. Professional
bioinformaticians need to ensure that their pipelines and code work repro-
ducibly over time and across datasets. A pipeline needs to work both in last
1.7. OUTLOOK 11
13
14 CHAPTER 2. BIOINFORMATICS WITH DNA
Due to these sizes, classical DNA sequencing methods were not fit for use
in sequencing full length chromosomal DNA. After several attempts, whole
genome shotgun sequencing became the de facto standard method for full
genome sequencing. This method works as follows.
1. DNA is isolated and randomly broken into small pieces of equal length.
Since the sample contains more than one copy of the genome, each
sequence appears in several fragments. The number of repetitions is
known as coverage. For example, in a sample that contains 10 genomes,
the coverage of the sequencing is 10X. The pieces must be smaller than
about 1000 bps.
largest known genome at a whopping 670 Gbp. These numbers are now disputed.
2.1. GENOME SEQUENCING 15
are two-fold. First, if the coverage is low, there might be gaps in the reads
and the whole genome can not be assembled. Second, even if the coverage
is high, long repeats may prevent the assembly of the sequences into indi-
vidual chromosomes. There are however situations where sequencing and
assembling whole genomes is either not needed or not possible.
Whole genome sequencing is often not needed when a (set of) reference
genome(s) is available for the organism of interest. In such a case, one can use
a technique that cuts the genome you want to sequence into smaller pieces,
followed by hybridization of those pieces to a corresponding set of pieces from
the reference genome. The pieces from the reference genome that do not fully
hybridize with the genome you want to sequence identify regions of the later
where variation against the reference exists. Thus, you isolate those pieces
from the new genome, sequence only them, and replace the corresponding
sequences in the reference genome.
In addition, there are approaches to genome sequencing that do not re-
quire whole genome assembly. For example, when one is interested in identi-
fying the parts of the genome that can be expressed, then one can isolate the
full mRNA complement of the cell and sequence that complement. In general
this will not identify the parts of the chromosomes that are not expressed,
and one ends up with a library of ESTs (expressed sequence tags). Another
major example of a situation where whole genome assembly might be im-
possible is when sequencing a metagenome3 . Metagenomes are sequenced
by isolating the DNA or mRNA from environmental samples followed by se-
quencing of that DNA. In most cases this means that one can not accurately
attribute a given gene to a specific organism, although this is not always
true. Using the statistical properties of DNA/mRNA (see below, section 2.2
) one can often identify the genus of the organism to which the gene belong
to.
annotation is the only full proof way to annotate genomes. However, this is
unpractical and can not be done. Hence computational methods are required
in order to make sense out of genome sequences.
There are two general approaches to predict where a gene is in a newly
sequenced genome: homology and ab initio approaches.
Substitution matrices
Margaret Dayhoff was the first person to calculate these matrices. She did
so in the following way. She collected and aligned sequences of proteins with
similar function (orthologs). Then, she analyzed the alignments and calcu-
lated the probabilities of a given residue being replaced by another in those se-
quences. These probabilities were then transformed into log-likelihood scores
and used to create programs that more quickly align sequences. BLAST and
HMMER use different types of substitution matrices
BLAST takes a bulk approach, based on overall residue substitution ma-
trices. These matrices are similar to those calculated by Dayhoff. An example
is shown in slide 39 of Class 2.1 powerpoint. It is easy to see that residues that
align with themselves have high log likelihoods (look at the diagonal elements
of the matrix), while residues that align with other residues that have very
different properties have negative log likelihoods for replacing each other
during evolution. How are these matrices used to estimate if an alignment
between two sequences is of high quality? An example can be seen in slide 40
of the same powerpoint. If you have the
18 CHAPTER 2. BIOINFORMATICS WITH DNA
alignment, you use the matrix to calculate the score of that alignment in the
following way:
3. Sum all scores for each position in the alignment. The total score
reflects the overall similarity of the two sequences. The highest the
global score is, the better the alignment is.
Now that the use of these matrices to estimate the quality of an align-
ment is hopefully clear let us look at how we can calculate these matrices.
Start with a multiple alignment. Calculate the frequency of each residue
type in that alignment (represented by qi). Calculate the frequency of muta-
tions pij (of residue type i into residue type j). The log-odds ratio is given by
log(Pij /qi qj ). The type of substitution matrix that one calculates depends on
the multi alignment one uses. For example PAM (Point accepted mutation)
matrices align proteins that are closely related and are a reasonable model
for this type of sequences. BLOSUM N matrices align sequences of proteins
that are at most N% similar. In addition, while PAM matrices calculate sub-
stitution frequencies over the entire alignment, BLOSUM matrices do so over
ungapped blocks of the alignment. Other types of substitution matrices exist
and the choice of the appropriate one is case specific. The main limitations of
using these matrices to score alignments is that they assume a constant rate
of evolution over the entire range of the sequences and disregard any long
range effects of residues on the conservation of the sequences. Nevertheless,
they work quite well in aligning sequences that are larger than 100 residues
and have a sequence identity higher than 30%.
BLAST
The most widely used program to perform alignment of new genome se-
quences to sequences of genes that are present in databases is BLAST. This
program is very fast and is able to compare a given sequence to a database
2.2. GENOME ANNOTATION 19
(a) Divide all sequences in the database into small n(=6 by default)
letter words.
(b) Divide the query sequence into small n(=6 by default) letter words.
(c) Set a minimum score Tmin below which an alignment between
two n-letter words is discarded.
(d) Compared all words from the query sequence to all words from
the database sequences. Discard those database words that score
below Tmin .
2. The second stage helps BLAST discard proteins from the database to
which the query sequence should not be aligned to. It does so in the
following way:
(a) Collect all database n-letter words that scored above Tmin in stage
1.
(b) Identify proteins that do not contain any of these words.
(c) Discard those proteins.
3. The third and final stage of the alignment simply takes the words that
match between the query sequence and the database and extend the
alignment, using the substitution matrix of choice. If extending the
alignment decreases the score below a certain limit, this subject se-
quence is discarded. Alignment extension is done locally, as global
optimization would be too slow. Once the alignments are finished,
they are ranked and ordered by BLAST.
There are several metrics that BLAST can use to rank the similarity
between the subject sequence and the query sequence. Most used metrics are
the total score or the e-value of the alignment. The total score is calculated
as described above, by adding the scores of the individual positions. This
total score can be transformed into other metrics, such as the bit score, that
we won’t discuss here. The e-value measures the number of times you would
expect to find simply by chance an alignment between the query sequence
and a sequence from the database with the score you have. The e-value is
20 CHAPTER 2. BIOINFORMATICS WITH DNA
Signal sensor methods look into the sequence to identify know signals
that are associated to genes. For example, a very simple signal method might
go through the entire DNA sequence and identify all ATG start codons,
followed by TGA stop codons at a certain distance that is larger than a
given number of residues and call that a gene. More sophisticated methods
would also look for Shine-Dalgarno sequences somewhere close to the ATG, as
these represent preferential ribosome binding sites. Promoter and regulatory
binding sites can also be used to improve the gene model, as can many other
signals.
Finally lets add just some considerations about predicting genes that do
not code for proteins. RNA genes may be a throwback to the RNA world,
which is likely to be the original way in which transmission of genetic infor-
mation has evolved. However, it might also be that RNA genes are more
recent and have re-evolved after the DNA→RNA→Protein world has ap-
peared. Whatever the truth about the evolution of RNA genes is, the fact
is that they exist and perform regulatory, structural and catalytic functions.
Predicting non-coding RNA genes can be done using methods that are equiv-
alent to the ones described for finding protein coding genes. In addition, there
is a set of methods that is specific for RNA, which rely on the assumption
that ncRNA genes code for RNA sequences that are thermodynamically sta-
ble. Hence, one can systematically analyze intergenic regions, predict their
folds and calculate the thermodynamic stability of those folds. Stabler RNA
structures are then flagged as possible ncRNA genes. It should be noted
that this class of methods have been placed into question when a study of
thermodynamic stability of random and ncRNA shown that their stability
was not significantly different.
In fact lack of statistical significance is a problem in predicting ncRNA
genes. Because they are smaller than protein coding genes, the statistical
signal that distinguishes random RNA sequences from ncRNA sequences is
smaller and these genes are harder to detect.
3C data. In the other plot the x-axis represents the same, but the y-axis
represents the sensitivity of each chromosomal site to the specific DNAse(s)
[restriction enzymes] used to digest the DNA. 5C and HiC data pertain to
larger genomic regions and therefore can be better represented using heat
maps, as shown in slide 65 of Class 2.1 powerpoint.
CHAPTER 3
3.1 Transcriptomics
3.1.1 Macro/micro arrays
The development of DNA macro and microarrays opened the door to measur-
ing changes in gene expression simultaneously for all the genes in a genome.
This enabled us to understand how the entire gene expression complement of
an organism changes in response to any type of challenge. RNA-Seq is now
phasing out macro and micro arrays. Still, the latter technology remains
widely used, especially in clinical contexts, and work as follows. First, a set
of RNA probes is synthesized, one for each gene of interest. Each probe must
be long enough so that it hybridizes specifically with a single gene within the
genome. The probes are then attached at specific locations (spots) to a phys-
ical support (a slab of material) forming an array. Once the array is ready,
mRNA extracted from the cells of interest and amplified using marked nu-
cleotides. In radioactive microarrays, one can hybridize the amplified RNA
directly and measure the radioactivity at each spot. By having a radioac-
tive standard one should then be able to calculate absolute amounts of gene
expression. In fluorescent microarrays, one marcs the amplified mRNA of
two different conditions with fluorescent probes that emit on different wave-
lengths and hybridizes the two sets of amplified mRNA with the same array.
Spots that hybridize with genes that are equally expressed in both condi-
tions will emit a similar amount of fluorescence on both wavelengths. Spots
25
26 CHAPTER 3. BIOINFORMATICS WITH RNA
that hybridize with genes that are preferably expressed in one of conditions
will emit stronger fluorescence on the wavelength that is associated to that
condition.
There are several aspects one needs to be aware of when analyzing micro
array data. First, and foremost this data is often very noisy and has low
quantitative reproducibility. This means that even radioactive microarrays
can only be used to calculate absolute changes in very special cases. Second,
and focusing on fluorescent microarrays, there is not a single way to treat
the raw data that is accepted and works 100% of the times. Many protocols
exist and they all have advantages and disadvantages. All of them consider
several aspects of treating the data:
1. Background noise. RNA is sticky and binds nonspecifically to the phys-
ical material of the microarray. This creates a background of fluores-
cence that needs to be subtracted from the overall intensity one mea-
sures. There are several ways in which one can deal with this problem.
For example, one can use internal controls with spots where there are
no probes and measure the fluorescence of each wavelength in those
spots. Through the use of a statistical model for the distribution of
this background noise one can then subtract it from all over the mi-
croarray.
2. Are the two fluorescent dyes equal in their behavior? In other words, for
the same amount of marking do they emit the same amount of fluores-
cence? If not, this should also be controlled. For example, fluorescence
should be normalized at each wavelength and normalized fluorescence
should be compared.
3. Once all background noise has been removed and all intensities nor-
malized, how do we know that a given gene has differential expression
between two conditions being compared? Again, statistics helps. For
example, one could assume that not all genes should, could, or need
to change their expression between alternative conditions. If this is so,
then one can collect all the changes across the genome and create a
statistical model of those changes. With this statistical model (distri-
bution) in hand , one can then identify all the changes that are less
likely than a given (low) probability of being observed by chance. Say
you want to identify the changes that are less than 5% likely of being
observed by chance. Then, you go to the distribution and pick those
changes outside of the 2.5% percentile on each side of the tails. You
can also do this in a non parametric way, by ranking the changes in
gene expression and picking up the smallest and highest 2.5%.
3.1. TRANSCRIPTOMICS 27
aligning to them. The opposite will happen with lowly expressed genes. The
combination of aligning and counting results in high quality and reproducible
meassurements of gene expression, but requires a number of corrections and
normalizations. One of the many alternative approaches to do this is nor-
malize the counts (number of reads aligning to a gene) by the length of the
gene and by the total number of mapped reads. Depending on order of the
normalizations and the details of the method chosen, gene expression will be
obtained as Fragments Per Kilobase of transcript per Million mapped reads
(FPKM) or in Transcripts Per Million (TPM).
Although one can find whole genome/transcriptome gene expression data
in many places, GEO (Gene Expression Omnibus) collects most of these data
that are publicly available. In fact, reporting of experiments that generate
this data in most professional journals requires that the data is deposited in
GEO.
Treatment of gene expression data can also be done using an almost
infinite number of programs, many of them proprietary. GEO itself provides
some functionality to analyze the data it contains. However, if you want to
analyze data on your own, the standard is Bioconductor, which is an R-based
software that is free and accessible to everyone.
3.2 ncRNA
Lets consider two large classes of ncRNAs. The first class are regulatory
ncRNAs that bind other RNAs (coding or non coding) and regulate the way
in which those RNAs perform their function. The second class of RNAs have
direct structural and catalytic functions and we will call them structural and
catalytic RNAs.
1. The native structure of RNA molecules is the one with the minimum
energy.
30 CHAPTER 3. BIOINFORMATICS WITH RNA
2. The free energy of each base pairing is independent of all other base
pairs.
3. Settle on a score for each type of base pairing situation between residues
in positions i and j. An example of how the 4 possible situations might
be scored is given in slide 33. Typical values for how much a given base
pairing stabilizes the structure are given in slide 34.
4. Use the scores from c) to fill the upper triangular matrix. The first four
diagonals will be zero. We can then start with the next diagonal and
score the possible pairings. Then, we can calculate the score of each
pairing in each row and column.
5. Once we have the matrix fully scored we can start at the highest score
and track the score until we reach the middle diagonals. This gives us
the most stable configuration, as shown in slide 37.
After seeing how genome, transcriptome and other RNA analysis requires
bioinformatics and reviewing some general methods, we shall move on to
proteins. Until just a few years ago, finding protein coding genes, character-
izing those genes and the proteins that they coded was probably the main
field of application for bioinformatics.
33
34 CHAPTER 4. BIOINFORMATICS WITH PROTEINS
protein sequences (on the order of 108 ). Hence, finding the structure of the
protein that you are interested in is literally a thousand times less likely than
finding its function via sequence comparison to a well characterized ortholog.
It is because of this that methods to predict protein structures from their
primary sequences were developed. The most accurate approaches to predict
protein structure involve what is known as homology modelling. Homology
modelling servers work in the following way:
4. Based on the alignment, the amino acids of your query sequence are
threaded and distributed spatially in the same place as the residues to
which they are aligned from the subject.
5. This initial model is then optimized using force fields and energy mini-
mization and sidechains of the amino acids are rearranged the eliminate
clashes. In the end you obtain a model of the structure your protein is
likely to have.
So, what can we do when no templates are available for homology mod-
eling. In such a case, ab initio methods can be used. There are many such
methods. However, the one that seems to be more accurate so far is the
one developed by David Baker and his group, ROSETTA. This method does
not rely on first principles, it uses small scale homology to predict protein
structure in a way similar to BLAST:
2. If not, divide the query sequence into short subsequences (by default 6
letters, although this can be changed).
3. Look for homology to those sequences. Once found, identify all possible
short structural elements that are associated to those small sequences.
4. Start assembling the model as if you are synthesizing the protein in the
ribosome. The first six amino acids come first. Then, add the second
six, followed the third six, etc. This procedure automatically eliminates
all possible conformations in which assembling a new fragment causes
clashes with preexisting amino acids.
5. Once you finish assembling the sequences, you minimize the energy and
rank the models according to that energy.
There are servers that allow you to automatically create models of protein
structures using the various types of approaches available out there. The pro-
tein modeling portal collects many of these under the same umbrella. Swiss-
model is probably the best homology modelling server out there. ROBETTA
implements the ROSETTA algorithm, but it is very slow. If you need to use
ROSETTA just download the software, install it, and run it locally.
4.3 Proteomics
In the same way that organisms mount a long term, slower, response to
changes in environmental conditions by changing gene expression, those changes
do not necessarily propagate to the subsequent levels of cellular execution.
For example, sometimes, the change in environmental conditions causes changes
in the stability of the RNA and an observed change in gene expression is only
mounted to maintain RNA level and protein activity constant. To fully un-
derstand how the adaptive responses propagate one must also analyze the
protein complement of the cell, or its Proteome. This can be done in several
4.3. PROTEOMICS 37
ways. Classically one can study small sets of proteins and their abundances
using traditional biochemical approaches.
However, as was the case with RNA, there are methods that permit quan-
titative proteome wide studies. In contrast with the genome wide gene ex-
pression changes, where new technology had to be developed, proteome abun-
dance studies rely mostly on the development of previously existing technolo-
gies. Originally, proteome wide abundance studies relied on a combination
of electrophoresis and spectroscopy. The protein fraction of the cell was
solubilized and extracted, then ran through a 2D gel electrophoresis where
proteins are separated by charge in one dimension and mass in the other.
Protein spots could then then identified, for example via protein staining
methods. Each spot was isolated and the proteins extracted. A cocktail of
specific proteases could then be applied to these proteins, for the fragments
to be subject to spectroscopic/spectrometric analysis. In mass spectrometry,
for instance, the fragments are again separated by charge and size, resulting
in a spectrum per sample. This method is known as Peptide mass finger-
printing. Another method that is gaining users is tandem, where the peptide
fragments are generated via collisions on the fly. Modern methods replace
the electrophoresis step with alternative separation techniques.
Knowing the sequence of the proteins and the composition of the protease
cocktail, a theoretical spectrum can be calculated for each protein so it can
be compared with those of each spot. Thus, it is principle possible to identify
which proteins are in each spot and in what amount. This is not as easy as
it sounds, though. The spots often have more than one protein, making the
spectra difficult to analyze. In practice mathematical methodsare needed to
analyze and compare spectra. These methods often rely on comparing the
mass peaks of the two spectra and identifying a number of peaks that can be
associated to specific proteins above a certain level of statistical significance.
Proteomics methods can also achieve absolute quantification of protein
abundances. Relative proteomics methods compare two samples to deter-
mine the relative abundance of each protein between them. This is done by
marking two samples with alternative isotopes of some of the atoms used to
synthesize proteins and calculating the ratio for each protein. Calculating
absolute protein numbers from this methods, requires the availability of an
absolute standard for one of the samples. The field of proteomics is still
growing at a fast pace and new methods keep being developed, including
different label free approaches.
Proteome abundance measurements are not as sensitive as measurements
of changes in gene expression. The former tend to not detect proteins that
are in low abundance ( . 50 copies per cell). In addition, they may not
always be able to resolve which proteins are identified in a given sample. As
38 CHAPTER 4. BIOINFORMATICS WITH PROTEINS
is the case with changes in gene expression there are also other aspects of the
proteome that can be analyzed. For example, one can use techniques similar
to those used in nC methods to identify genomic binding sites for transcrip-
tion factors, for example using chip-chip experiments. Another example is
the use of marked ATP to measure changes in the phosphorylation patterns
of known proteins. There is an abundance of software that allows people
to identify proteins in MS experiments. Many of the available programs
are proprietary and come with the machines that are used for the experi-
ments. However, there are also many open source general software that can
be used and adapted for each specific purpose. Examples are Greylag, In-
spect, or MassWiz, which can either be downloaded and installed locally or
run through an online dedicated server.
CHAPTER 5
At this point we have an idea about many of the things we can do and
learn with bioinformatics about a gene or protein, based solely on static
sequence information. This is very nice and allows us to obtain a lot of
information about what those genes do or how they evolve (by comparing
sequences of orthologous genes over many organisms). One can also identify
the limit conditions in which organisms can survive. This can be done simply
by analyzing the genome of an organism, identifying which functions are
present and which are absent and correlating those functions to what the
environment needs to provide to the organism for it to survive. For example,
if the pathway for the biosynthesis of Lysine is absent from the genome, the
organism needs that the environment provides that amino acid for survival.
However, we often need another type of information that is more dynamic
and quantitative. For example, if we want to know how a microbe reacts
to an antibiotic in real time, sequence information provides very little help.
In this situation, one needs to measure the dynamic response of the various
components that permit a microbe to react to the drug. Such measurements
are the bread and butter of molecular and cellular biology. For more than
half a century we have been measuring how specific components of a cell
respond to specific challenge. For example, Monod and Jacob won the Nobel
prize by analyzing the response of the lac operon to various inducers and
repressors of gene expression, to reconstruct the regulation of the operon.
The limitation of these classical approaches is that one was only able to
follow the dynamics of a small set of molecular components of the cell at
the same time. Since 1995, however, technological developments and shifts
39
40 CHAPTER 5. BIOINFORMATICS WITH NETWORKS
in the way the we can measure the molecular components of the cell. The
omics revolution has allowed us to (in principle) measure dynamic changes
in all possible components of the cell at the various levels. This generates
a situation where humans need computers and computational methods to
integrate, analyze, and extract information from the data.
you have and asks for another metabolomics analysis. After several tries he
finds something that cures you but he still has no idea what you had. If the
following year you fall sick again, and get a metabolomics analysis of you
blood done. If the spectrum of that analysis is sufficiently similar to that
of your previous illness, your doctor might automatically give you the same
medicine that worked before.
In addition to the ability to determine the metabolites present in a biolog-
ical sample, we are also able to determine, in principle, how these metabolites
are synthesized. Fluxomics techniques allow us to do so. These techniques
rely on marking initial metabolites with rare isotopes and following how these
rare isotopes distribute among the various metabolites that are synthesized
from the marked precursor. For example, if we start with marked glucose
and, at small intervals, take samples and measure how the marked atoms of
glucose are distributed through the glycolytic pathway using NMR, we can
then obtain the kinetics for the individual processes in that pathway.
Homology transfer
First, there is the most traditional method one can think of: Homology trans-
fer. There are many molecular pathways that have already been characterized
and for which the proteins, RNAs, and/or metabolites that participate in the
pathway are known. Imagine one can, through sequence comparison (or any
other method), identify the sequences of genes/proteins in a new genome
that are orthologous to those that are known to participate in that pathway
in other organisms. In this case one can almost automatically attribute that
gene or RNA in the new genome to the corresponding place in the pathway
that appears to be conserved from other species.
Data mining
slower than iHOP. To address the issue of speed, Biblio-MetReS stores the
gene/protein co-occurrence information regarding any document it finds and
analyzes once. Thus, if a document is found in a new search but was ana-
lyzed before, Biblio-MetReS will retrieve the processed information regard-
ing that document from its central database, rather than analyzing it again.
Finally, Biblio-MetReS allows users to include lists of biological processes
and/or pathways with which the genes and proteins might occur, providing
additional information regarding how may the various genes be functionally
interacting.
Evolutionary methods
This semi-automated method for network reconstruction focuses on analyz-
ing evolutionary information regarding genes and proteins from organisms.
The rationale for using this information in network reconstruction is as fol-
lows: Proteins (genes) that work together are likely to suffer evolutionary
pressures that make them have similar evolutionary patterns. Again, turn-
ing this argument on its head, one might state that proteins/genes that have
similar evolutionary patterns are more likely than average to be functionally
interacting. There several ways in which coevolution can be analyzed.
First, there are phylogenetic profiling techniques. Imagine that one
has the fully sequenced and annotated genome of several organisms. We
can then build a matrix where each row is an organism and each column
is a protein/gene. By analyzing which genes are simultaneously present or
absent in the same subset of organisms, one can identify genes that are likely
to participate in the same processes. More complicated logical analysis can
also be performed using such a matrix. For example, if a gene is always
absent when another gene is presence and vice versa, this might indicate
that both genes perform the same function in alternative organisms.
Second, one can also do conservation of gene neighborhood analysis
(termed synteny in eukaryotic genomes). In this case one analyzes the regions
of the genome that are close to a given gene and identify the other genes that
are present. If those genes are conserved in the proximity of orthologs of the
original gene in other organisms, this might be an indication that genome
evolution has kept these genes together for some reason and that they are
functionally cooperating to perform some function. This method is much
more accurate for prokaryotes than for eukaryotes. In fact, operons are the
main example that sustains this method. Typically an operon contain sets
of genes that belong to the same pathway.
Third, once can identify gene fusion events by comparing various
genomes. For example, E. coli employs two genes in the biosynthesis of
44 CHAPTER 5. BIOINFORMATICS WITH NETWORKS
tryptophan that, in B. subtilis, are merged together. In this case, gene fu-
sion events indicate both, functional and physical interaction between the
products of the genes.
Fourth, collecting sequences for the orthologs or several genes/proteins in
a number of organisms and building multiple alignments for each gene enables
the construction of phylogenetic trees. These trees can be compared to
one another. The genes that give rise to these similar trees have similar
evolutionary patterns and are thus candidates for functional interaction.
The fifth set of methods is based on the analysis of omics data. The prin-
ciples that underlie them are similar. If a given set of genes/proteins/metabolites/RNA
have similar patterns of regulation under different conditions, then the molecules
in this set are likely to be working together and participate in the processes
that regulate the adaptive responses to those conditions.
The sixth set of methods is based on large scale protein interaction
data. There are several techniques that pertain to this set of methods. First,
there have been experimental measurements of physical interactions between
every possible pair of proteins in a genome for several well studied model or-
ganism. This was done using, for example, Two-Hybrid system approaches,
where if two proteins interact they emit a signal that can be detected. If
the pair of proteins does not interact, no signal is detected. Physical in-
teraction is taken as an indicator of functional cooperation. Having these
large interaction datasets for several organisms allows us to transfer those
interactions to other organisms by assuming that orthologs of the interact-
ing proteins might also be interacting. Second, there are computer methods
that permit predicting if two proteins might interact and how. These meth-
ods are of two types. Sequence docking relies on having the sequences for
orthologs of the two proteins in several organisms, which permits creating
multiple alignments for each of the proteins. By comparing how the residues
are conserved in each of the two alignments one can sometimes predict if
specific positions in the two proteins physically interact. This is done by
identifying compensatory mutations between the two proteins. For example,
imagine two residues, one from each protein, that participate in the phys-
ical interaction between the proteins and that there are mutations in one
of the residues in some of the organisms (for example Asp Gln). In the
same organisms, the interacting residue from the other protein should have
mutated in order to maintain the interaction (for example from Gln Asp in
order to maintain the electrostatic complementarity between residues). The
other type of computer docking methods are termed in silico docking. These
methods require the experimental or modeled structure of a pair of proteins
and, using physical-chemical, electrostatic, thermodynamic and spatial con-
siderations, (and sometimes biological information) identify the most likely
5.2. INFERENCE OF BIOLOGICAL NETWORKS 45
way in which the two structures interact. These methods are very computer
intensive and require a lot of power. So far, they don’t scale up well for whole
genome/proteome level analysis. To my knowledge there is only a server that
combines most of these methods in a way that is user friendly. That server
is STRING and it lacks protein docking methods. Although the server and
its database and integration is far from perfect, it is probably as good as it
can be with the current level of biological, scientific and technical knowledge
that is available.
As a final note, be advised that gene regulatory information (for example,
which TFs might regulate gene expression or which RNAs might be regulated
mRNA use by the ribosomes) is still not included in STRING. In fact, in-
tegration and prediction of circuits at the gene and RNA regulatory levels
with circuits at the protein level is still incipient at best. This is also true
for automated identification of regulatory interactions of proteins by small
metabolites. Nevertheless, there are already databases and work regarding
such regulatory circuits that might in the near future facilitate automated
reconstruction of circuits at this level.
46 CHAPTER 5. BIOINFORMATICS WITH NETWORKS
CHAPTER 6
Systems Biology
This chapter focuses on presenting and discussing the methods and tech-
niques that you might need to address the second practical task that you
will have to do.
47
48 CHAPTER 6. SYSTEMS BIOLOGY
your network. However, when you start interrogating the network, the limi-
tations of this representation become apparent. Let us start by asking what
the interactions between nodes mean. Just by looking at the figure in slide 6
we can not really answer this question. Let us try another question and try
to figure out which nodes (molecules) are important regulatory points in the
dynamic responses of the system. Again, we can not answer this question
just from the representation we have here. What about identifying all nodes
that are fundamental for the response of the network, can we know that from
analyzing the graph? The easy answer is again no. However, there is a more
´´
accurate and complicated answer that is “partially . There is a branch of
mathematics called Graph theory that analyzes the connectivity of graphs
and the centrality of the various nodes. It turns out that biological circuits
have certain properties that enable using a set of four types of properties
of the graph to predict (with up to 70% accuracy) which nodes might be
fundamental in the functioning of the network. We will not go into details
about this and let us just keep the simple answer: No.
Overall, what we can say from this is that this node and edge representa-
tions tend to be ambiguous and are not very helpful in letting us predict the
physiological behavior of our circuits of interest. Even more detailed graphs,
such as the ones that can be displayed by STRING are not much better in
this respect, as you can not real infer what is going on in the interaction,
only what type of interaction it is.
How can we overcome this limitation and use reconstructed circuits to
analyze and predict the physiological behavior of the system? If ambiguity
is the problem, let us define a representation that is unambiguous.
ways to represent molecular circuits in such a way. For example, the Systems
Biology Graphical Notation (SBGN) provides such a representation, where
each type of biological reaction and event is represented by a specific symbol.
However, for our purposes this is unnecessarily complicated. There are dozens
of symbols and they are continuously being updated. In this course we will
opt for using the notation that chemists have been using for more than 100
years, with slight adaptations. This representation will be flexible enough to
represent all processes we will be interested in and simple enough that we
do not have to memorize tens to hundreds of symbols. The notation is very
simple. Whenever there is a material flow between two pools of material,
these two pools of material (or molecules) will be united by a full arrow,
with the arrowhead pointing in the direction of the material flow. If the
flow is influenced by another molecule or molecules that are neither used not
produced in the process, a dashed arrow will unite each of those molecules
to the material flow arrow. The arrowhead will point in the direction of the
material flow arrow. Dashed arrows will never point to other dashed arrow
or to species in the circuit, only to flux material arrows. If the dashed arrow
represents an activation of the flux, it is associated to a plus signal. If the
dashed arrow represents an inhibition of the flux, it is associated to a minus
signal.
In a circuit representation there are two types of molecules (or variables).
Molecules whose amounts change over time are called dependent variables.
Molecules whose amounts do not change over time are called independent
variables. This is a definition that we will use for the creation of mathematical
models. For example, in slide 19 A and B change over time, as material flows
from A to B. These two are dependent variables. On the other hand, C does
not change over time, as no arrow brings material to or draws material from
the pool of C. C is an independent variable.
In our network representation we also need to include stoichiometric infor-
mation. For example, if two molecules of A are needed to form one molecule
of B, the number 2 should appear before A. If more than one species as-
sociates to form another species or complex, then the associating species
should be included, together with their stoichiometry. For example if three
molecules of species D associate to 2 molecules of species A to form species or
complex B, the representation is as shown in slide 19. If the reactions are
reversible, then an arrow pointing from the products to the reactants should
also be included. In such cases, it is important to clearly identify of the
modulation arrows influence the forward reaction, the reverse reaction, or
both.
Also, let us define a notation for naming the molecules in a circuit. Al-
though this might seem like overkill, it can serve two purposes. First, in
50 CHAPTER 6. SYSTEMS BIOLOGY
large models where the names of the species are similar (e. g. G6P or G1P)
it is easy to make a mistake in writing the reactions. Having a systematic
notation that removes names and puts the focus on variables decreases the
probability of making such mistakes. Second, if and/or when you ever have
to implement methods to create and analyze these models, such methods will
internally convert each molecule into a consecutive set of variables with the
same tag and increasing number. What we will do in this course is use X as
the stand in for the variable and number the various consecutive variables in
increasing order.
When creating a conceptual model for a circuit we must also consider
that often we do not require models for the entire cell, only for the part of
the cell that we are interested in. Often, in such cases material flows into
and/or out of your system. This can be represented by source reactions,
where a full arrow pointing to the species where material comes into the
system from nowhere, and sink reactions, where a full arrow pointing from
the species where material goes off of the system and into nowhere. Often,
source reactions can also include substrates that are considered as being
independent variables in your circuit. For example, if a protein (mRNA) is
synthesized, the source reaction comes from the cellular pool of amino acids
(nucleotides). The metabolic levels of amino acids (or nucleotides) in the cell
are approximately constant, irrespective of protein (RNA) synthesis. Hence,
we can say that even though they are being used amino acids (nucleotides)
do not change over time for our purposes and they are independent variables
in our system.
Another aspect that one should consider is that of cellular compartments.
If the system we want to study is distributed throughout several cell com-
partments (e.g. cytoplasm and nucleus), it is likely that we should represent
these compartments and consider them later to build a model.
Now that we have a unambiguous representation for biological circuits,
let us check if such a representation is enough to allow for analysis and
prediction of the physiological behavior of a concrete system. Let us consider
a simple three step biosynthetic pathway that is a representative abstraction
for example for the biosynthesis of some amino acids. As represented in slide
28 a source X0 is used to produce a metabolite X1, which in turn produces
X2. X2 is further metabolized to create X3, which is the final product of the
pathway. The consecutive reactions are metabolized by enzymes E1, E2, E3,
and E4. E4 represents the cellular demand for the product of the pathway. It
is very common that the final product of the pathway, X3, inhibits the first
reaction of the pathway. This type of inhibition is termed overall feedback.
Now that we have the pathway representation, let us see if we can predict
how the pathway might work dynamically. Imagine that at time t0 you
6.2. MATHEMATICAL MODELING OF BIOLOGICAL SYSTEMS 51
increase X0. Using linear logic we would think that this would lead to an
increase in X1, which in turn would lead to an increase in X2, followed by
an increase in X3. If X3 increases then the inhibition of the first reaction
would become stronger and one might predict that it would be followed by a
decrease in the concentration of X1, then X2, then X3. This would again lift
the inhibition, allowing more material to come into the pathway, and leading
to subsequent increases in the pools of X1, X2, and X3. The cycle would
then repeat. This means that linear logic would predict that the dynamical
behavior of the pathway is oscillatory, with cyclic increases and decreases in
the amounts of the metabolites.
When we go in and actually observe how the pathway behaves this is what
we see. If the parameter that codes for the overall feedback strength is low,
the pathways will not oscillate. It will reach a dynamic equilibrium known as
steady state where the concentrations remain constant. This lack of change is
due to the fact that what goes into the system is perfectly balanced by what
comes out of it, and not because the system has stopped. If the parameter
that codes for the overall feedback has intermediate strength, then we do
observe oscillations. If the parameter that codes for the overall feedback has
high values, then the system becomes unstable and there is neither a steady
state nor an oscillation. Such a system would not allow for survival of an
organism.
What this example illustrates is that having an unambiguous representa-
tion of a biological circuit might be necessary but it is not sufficient to enable
analysis and/or prediction of the physiological behavior of that circuit. Why
might this be? Why can we not in general predict the dynamic behavior of a
biological system using simple logic? The reason is simple: Biological reac-
tions have non-linear dynamic dependencies on the variables that influence
them. Because of this, linear logic will fail when the system is operating near
a non-linear regime. The way to solve this problem is simply by using the
conceptual representations to create non linear mathematical models of the
systems of interest. The first of these models was created and published back
in 1943 by Britton Chance.
X df
f (X1 , . . . , Xn ) = f0 (X1,0 , . . . , Xn,0 ) + dXi (Xi,0 − Xi )
0
i
X 1 d2 f X 1 dn f
2 n
+ dX 2 (Xi,0 − Xi ) + · · · + dX n (Xi,0 − Xi ) ) (6.1)
i
2 i 0 i
n! i 0
What this means is that we can now write pretty much any function that
is relevant to represent the dependency of a biological flux on its influencing
variables. However, it is still of little help, given that in general the series is
infinite and we are not good in dealing with infinities.
How can we solve this problem? Well, by approximating the infinite
series with something that is manageable and yet has a range of validity
where predictions and analysis can still be done with sufficient accuracy.
The easiest way to do this is by truncating the Taylor series in the first order
term:
6.2. MATHEMATICAL MODELING OF BIOLOGICAL SYSTEMS 53
X df
f (X1 , . . . , Xn ) = f0 (X1,0 , . . . , Xn,0 ) + dXi (Xi,0 − Xi )
i 0
F = log(f )
Y i = log(Xi )
P dF
dF
Here, A = f0 (Y1,0 , . . . , Yn,0 ) + i dY
i 0
Y i,0 and gi = dYi . Both these
0
quantities are constant. To return to a Cartesian space from a logarithmic
space we have to apply the inverse of the logarithmic transformation, which
is the exponential transformation:
Y Y g
f (X1 , . . . , Xn ) = exp A + exp(gi Yi ) = α Xi i
i i
give numbers to these parameters. It has been shown that, under physio-
logical conditions, many reactions function with the variables close to the
value they have when flux is of half is maximum possible value. If this is the
case, the kinetic order can be taken as 0.5 or -0.5, depending on whether the
variable activates or inhibits the flux. Furthermore, if a variable of the flux
function works as a catalyst of the process, it is often reasonable to assume
that its kinetic order is close to 1. Furthermore, in sink reactions, without
any modifiers and for which we don’t know the mechanism, a typical simpli-
fying assumptions is that the kinetic order of the substrate is 1. However, if
additional information is available regarding the processes and fluxes, that
information should be used to better parameterize the models. So now we
have a non-linear formalism that can be used even when we don’t know what
the shape of the dependencies between the flux function and its variables is.
Furthermore, this formalism is regular and always looks the same. This is
important because it facilitates automating the derivation of mathematical
models from conceptually unambiguous representations of biological circuits
and because it permits developing analytical methods that take advantage
of the regularities in the functions and are computationally more efficient. It
should be noted that, if you know the shape of the flux dependencies, then
we can use other spaces to approximate the flux function. There are many
options, including saturating cooperative spaces, which perfectly represent the
cases from slide 35. However, in this course we will not go into those other
spaces. You can find details about those approximations in the paper on
campus virtual. As a final note I would like to add that often biologists try to
do the reverse of what we described in this section: Taking a mathe-matical
modeling paper, they try to reconstruct the network of interactions and
actions of the system. This is not always easy but it is good practice to
understand better what the math actually means. Still, often this in-verse
problem has more than one solution. Depending on the mathematical
formalism used for the equations a variable can be either a substrate or a
modulator. Solving this inverse problem requires that one looks further than
the mathematical formalism. As stated above, each dependent variable has
an equation describing its dynamic behavior in the system of ordinary differ-
ential equations that represents the circuit of interest dependent. As a rule
of thumb, plus and minus separate different terms in an ordinary differential
equation. Terms preceded by a plus represent a process that contributes to
the production of the dependent variable, while terms preceded by a minus
represent a process that contributes to the consumption of the dependent
variable. As another rule of thumb, if we don’t know whether a variable X is
a substrate or a modifier of a process that appears in the ordinary differen-
tial equation of another variable Y, one looks at the differential equation that
6.3. EXAMPLES OF MATHEMATICAL MODELS IN BIOLOGY 55
then able to test several things about our understanding of how red blood
cells work. By analyzing the effect of changing parameters on the behavior
of the model, they were able to identify which parts of metabolism were not
really well understood. Parameters whose changes lead to important loss of
functionality to the metabolism of the red blood cell identify parts of the
model for which we do not have enough information. Given that we know
red blood cells are very robust and can survive up to 120 days. In addition
they used this ability to survive to identify possible regulatory interactions
that were unknown. By systematically introducing these regulatory interac-
tions and identifying, which led to the survival of a model red blood cell to
approximate that of the real red blood cell. Some time later, one of the regu-
latory interactions that were predicted were actually experimentally verified
by other groups.
As you can see there are several approaches you can use to create large
scale models of cells and organisms. However, I don’t recommend that you
do this. Whole cell/Whole organism models are difficult to create and try-
ing to make one of these is not a productive way to start modeling. My
suggestion is that you focus you questions on a pathway, circuit, or process
that you identified in Task 1 and ask your question and create your models
about these circuits. Even when you do this, there are steps you need to
take before you start modeling. You need to decide what you are going to
include in your system. For example, if you want to model the biosynthe-
sis of Methionine, you can obtain a meta map of metabolism that includes all
reactions involved in this biological process in at least one organisms.
However, not all these reactions are present in all organ-isms. You need to
select the valid reactions for your organism. Once you have done that you also
need to decide that level of detail that you want to include in the reactions.
Are you going to be mechanistically detailed or using approximations is
enough? There is no valid answer for this and one needs to decide on a case by
case basis. You should also realize that creat-ing smaller models of subsections
and processes of cells and organisms is a valid way to understand how these
processes work. There is a fair amount of modularity in cells. This means that
studying these modules will give us valuable information about how they work
and regulate the global cellular metabolism. In general, you should simplify as
much as you can but no more than that. What “more than that” is, again is
case specific. Often, it takes several iterations before you finally decide what
you include in the model. Another aspect you should consider has to do with
what your questions are. If you simplify your model in such a way that it does
no longer allow you to answer the question(s) you are asking, then your model
is oversimplified. Up until this point I have been talking about the question
you need to ask
58 CHAPTER 6. SYSTEMS BIOLOGY
1. Use the known circuits for the pathways that synthesize carotenoids to
create power law models of the circuits. These models were different
for each maize line and depended on the genes that got inserted in the
specific line.
Once we have done this we had two things. First we had the most likely
profiles for protein activity, which is something that experimentally could
not be measured. Second we had functioning mathematical models for the
various lines. With these models in hand we could ask two additional ques-
tions:
1. What are the parts of the pathway can we further engineer in order to
manipulate vitamin production?
due to the fact that we could not get a good fit to the data when using
only the known structure of the biosynthetic pathways. This meant that we
took advantage of the power law model and did the same as Ni & Savageau
(slides 74-76 of power point Class 3) and identified regulatory interactions that
would improve the fitting. Of all tested interactions we found that inhibition
of v3 in slide 103 of Class3 powerpoint by Lut would significantly improve the
fitting. This has not been tested yet, but later on we found that this regulation
exists in other plants, so the prediction makes sense.
This example is fairly easy for you to grasp because there are numbers in
it. However you can also ask semi-quantitative questions about pathways for
which you have few or no numbers. Consider the circuit shown in Slide 68 of
Class 3 powerpoint. This is an abstract representation that can represent any
linear biosynthetic pathway in a cell. Typically, in such pathways, the final
product exerts a negative feedback on the first reaction (overall feedback). X4
represents the cellular demand for the product X3, whatever that product is.
Using a power law approximation, we can write the system of differential
equations for the circuit.
Just by having these equations, we can analyze the long term or homeo-
static behavior of the system. This is done by equating the right hand side
of the equation to zero and solving the resulting algebraic equation. This is
known as calculating the steady state of the system and you can get infor-
mation about which parameters have a bigger control of your system simply
by calculating the derivative of the solutions of the algebraic equations with
respect to the parameters of interest. This type of analysis allows you to
understand how the system behaves on the long term. If you are looking
at faster responses then you most likely need numbers for the parameters
and numerical simulations. However, even in such cases you can use nor-
malization techniques that decrease the number of parameters you have to
estimate.
Irrespective of the model you have in hand you should never forget that
your model is only a model. It needs to be validated by contrasting its
results to those that are known about the system. If the model fails in this
comparison, then you need to go back to the drawing board and improve it.
However, if the model is validated, you should also never forget that a model
is never valid over all possible conditions. If you push it hard enough it will
also fail.
6.4. NOISE IN BIOLOGY 61