0% found this document useful (0 votes)
107 views

Bioinformatics Notes 2020 2021

This document provides an introduction to the field of bioinformatics. It discusses the history and need for bioinformatics due to the large amounts of biological data being generated. It describes how bioinformatics uses computational methods and tools to analyze biological data by gathering data inputs, storing the data in databases, and developing algorithms to process the data and draw conclusions. The key aspects covered include classifications and ontologies to organize data, common file formats like FASTA and FASTQ, and how bioinformatics is applied to analyze DNA, RNA, proteins, networks and systems through approaches like genome sequencing, annotation, transcriptomics and proteomics.

Uploaded by

Pavlo
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
107 views

Bioinformatics Notes 2020 2021

This document provides an introduction to the field of bioinformatics. It discusses the history and need for bioinformatics due to the large amounts of biological data being generated. It describes how bioinformatics uses computational methods and tools to analyze biological data by gathering data inputs, storing the data in databases, and developing algorithms to process the data and draw conclusions. The key aspects covered include classifications and ontologies to organize data, common file formats like FASTA and FASTQ, and how bioinformatics is applied to analyze DNA, RNA, proteins, networks and systems through approaches like genome sequencing, annotation, transcriptomics and proteomics.

Uploaded by

Pavlo
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

Introduction to Bioinformatics

Rui Alves Alberto Marin-Sanguino

September 9, 2021
ii
Contents

1 Introduction and resources 1


1.1 Why Bioinformatics? . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 A short historical overview . . . . . . . . . . . . . . . . . . . 1
1.3 What is Bioinformatics and how to use it effectively? . . . . . 2
1.4 Classifications and Ontologies . . . . . . . . . . . . . . . . . . 3
1.5 Databases and servers . . . . . . . . . . . . . . . . . . . . . . 5
1.6 File formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.6.1 FASTA . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.6.2 FASTQ . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.7 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 Bioinformatics with DNA 13


2.1 Genome sequencing . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Genome annotation . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.1 Homology approaches to detecting genes . . . . . . . . 17
2.2.2 Ab initio approaches to detecting genes . . . . . . . . 20
2.2.3 Final notes on genome annotation . . . . . . . . . . . 21
2.3 The dynamics of the genome: changes in genome structure . . 22

3 Bioinformatics with RNA 25


3.1 Transcriptomics . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1.1 Macro/micro arrays . . . . . . . . . . . . . . . . . . . 25
3.1.2 RNA Seq . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 ncRNA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.1 Regulatory ncRNA target prediction . . . . . . . . . . 28

iii
iv CONTENTS

3.2.2 RNA structural prediction . . . . . . . . . . . . . . . . 29


3.2.3 Ab initio comparative structural prediction of RNA
molecules . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.4 Minimum energy structural prediction of RNA molecules 29
3.2.5 Structural inference prediction of RNA molecules . . . 30
3.2.6 Final notes on the prediction of RNA structures . . . 31

4 Bioinformatics with Proteins 33


4.1 From gene sequence to protein sequence . . . . . . . . . . . . 33
4.2 Extracting information about a protein from its sequence . . 34
4.2.1 Predicting physical chemical characteristics of a protein 34
4.2.2 Homology and ab initio protein structure modeling . . 34
4.3 Proteomics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5 Bioinformatics with Networks 39


5.1 The dynamics of the metabolome . . . . . . . . . . . . . . . . 40
5.2 Inference of biological networks . . . . . . . . . . . . . . . . . 41

6 Systems Biology 47
6.1 From Network Biology to Physiology and Systems Biology . . 47
6.1.1 From omics data to circuits of interacting molecules . . 47
6.1.2 Unambiguous graphical representation of biological cir-
cuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.2 Mathematical modeling of biological systems . . . . . . . . . 51
6.2.1 Mathematical representations and formalisms . . . . . 51
6.3 Examples of mathematical models in biology . . . . . . . . . 55
6.4 Noise in Biology . . . . . . . . . . . . . . . . . . . . . . . . . 61
CHAPTER 1

Introduction and resources

1.1 Why Bioinformatics?

1.2 A short historical overview


The Biological Sciences have come a long way in the last century. Since the
1900s, biologists have made great progress in all aspects of biology, both at
the macroscopic and microscopic level. Life was split into its unit components
and Microbes, animals and plants were studied in captivity. These organisms
and their performance was studied and characterized in isolation, with the
hope that characterization could be used to extrapolate how the organisms
would interact with one another.
With the accumulation of technological breakthroughs, it was possible to
apply the same reductionist principles to the study of the molecular determi-
nants of organic life. In the 1950s the first genes and proteins were sequenced.
The amount of molecular data that became available for thousands of organ-
isms over the subsequent decades made it impossible for human beings to,
singlehandedly or in teams, organize, manage, store, analyze, and mine in-
formation from that data. This problem was compounded a thousand fold in
the late 1990s, with the onset of the omics age, when data regarding simul-
taneous experimental determinations of how the entire set of genes, proteins,
or metabolites of a cell became available in large amounts leading to the
problem of Big Data.
At some point during this data accumulation, Bioinformatics became es-

1
2 CHAPTER 1. INTRODUCTION AND RESOURCES

sential because the only possibility for researchers not to get lost in the data
and to actually be able to extract information from it was to use computers
and develop specific informatics tools and methods that could be applied to
molecular biology data.
A tremendous amount of human and clinical data has accumulated in
the last decades. This created an additional regulatory problem. The use
of these data falls under strict regulations because of confidentiality issues,
and its bioinformatics analysis often has to deal with strong computational
overheads that hinder generating knowledge from the data.

1.3 What is Bioinformatics and how to use it


effectively?
The development of computational methods and tools for their application
to the study of biology is what defines bioinformatics. However, there are
certain requirements about the way in which we generate and deal with the
data. To understand these requirements, we must first think about how
humans answer questions about natural phenomena and then imagine how
a computer could try to imitate such process. As humans, we approach
problems by first gathering data using our five senses – sight, smell, audition,
touch and taste. Our brains, fuzzy machines as they are, compare thes
stimuli with the sum of all our memories through a process that is not fully
understood and produces information in the form of thoughts e.g. it smells
like something is burning in the kitchen. Somehow, the brain has processed
and analyzed the information acquired by all our senses and drawn a relevant
conclusion.
The way a computer proceeds is somehow similar and, at the same time,
very different. The five senses of humans are easily replaced by input data
files in numerous formats: images for vision, mp4 files for audition, chro-
matograms for smell and taste force sensors for touch . . . and many formats
with types information that go beyond our senses. In this respect, computers
get inputs of data just like humans do, only on a much broader spectrum.
Just as computers are more versatile than humens when it comes to the
types of input data that they can receive, they can also store, access and
treat data in a more reliable way. The storage of data is done in databases,
which can be accessed via programs/computers called servers.
However, when it comes to treating the input data and turn them into
information, computers are much more limited than the human brain. Our
neuronal connections grow and shrink by themselves, learning and adapting
1.4. CLASSIFICATIONS AND ONTOLOGIES 3

to new problems. Although some methods try to emulate this process in


computers, the most effective computational data analysis is still done by
telling a computer exactly what it has to do with the data in a step by
step manner. This step by step procedure, or list of elementary instructions
that we give a computer to solve a problem is called an Algorithm. Unlike
humans, who often develop their problem solving algorithms on the fly using
fuzzy steps, computers need to be told all the elementary steps in sequence
of how they should treat a given set of data. It is not always easy for the
untrained person to identify what those elementary steps might be, because
most people are not aware of what elementary operations a computer can
achieve. In general, a human readable description of a computer algorithm
relies on a multi scale graining, where some steps are elemental while others
are not.
In summary, the workflow in bioinformatics starts with organizing the
data using biological classifications and ontologies. The organized data
is then formatted according to a certain structure and stored in databases.
Finally, efficient and accurate algorithms are used to mine the relevant
information from the stored data.

1.4 Classifications and Ontologies


A classification is a hierarchical organization of categories in which we can
fit all instances of a specific type of data. We can think of the relationship
between levels in the hierarchy as an “is a” relationship. For instance: a fly
is a(n) insect, and insect is a(n) arthropod, an arthropod is a(n) animal.
Any species we can come up with will fit in one of the lower categories, and
each of these categories will belong to a higher one so: E. coli is a(n) en-
terobacteria is a Gram-negative is a bacteria. Classifying our data provides
a useful structure and lets the computer “pretend” to have some degree of
understanding. Lets imagine that we have a list of species: M musculus, R
norvegicus, E coli, C lupus,S cerevisiae. For the computer, this is just a list of
words – or strings in computerspeak – that can be stored or returned to the
user. If we organize these species as we did above, we can ask the computer
to do clever things such as returning all the species in the list that happen to
be animals or deleting all bacteria from the list. Besides the immediate ben-
efit of enabling a more efficient storage and processing of the data, defining
classifications has another benefit. If we are able to develop classifications or
ontologies that reflect the nature of the data, this implies that we understand
them. This, in turn, implies that we can use our understanding to start to
engineer the systems that generate the data. Although it may seem strange,
4 CHAPTER 1. INTRODUCTION AND RESOURCES

over the long run, the second reason may also have far reaching practical
implications.
An ontology extends the concept of classification by complementing the
“is a” relationship with other possibilities, so any classification is also an
onthology. For instance, we could take the species relationship defined above
and extend it with a “is part of” relationship. Now we can say that head is
part of insect and do the same for torax, abdomen and six legs, our computer
will immediately know that an ant has six legs, and so does a fly.
For the purposes of this course we will use “classification” and “ontology”
interchangeably, unless we are explicitly talking about their differences.
As an example of an ontology and its usefulness for biological research we
have the gene ontology (GO). GO is a controlled vocabulary with 3 categories:
ˆ Cellular localization (membrane, cytoplasm, etc.)
ˆ Molecular function (kinase, oxireductase, etc.)
ˆ Biological process (DNA repair, cell cycle, etc.)
Each of the categories contains various entities and there are the above
mentioned two types of relationships between them. A specific instance of
data may either be an entity (cell membrane is a membrane) or a part of
another entity (cell membrane is part of the cell).
For example, consider the gene BRCA1 from humans. This gene is asso-
ciated to chromosomes, localized in the nucleus and involved in chromosome
replication. IF an ortholog of this gene is found in another animal, then one
can infer that that ortholog has a high probability of being localized in the
same compartments or involved in similar biological functions. This enables
an initial and automated annotation of the new gene that should obviously
be curated later. The GO is a good start to organize information regarding
the function of genes in an organism. However, it is far from complete and
if more specific functional information is required, GO is not the appropri-
ate classification to use. If one is interested in specific aspects of biological
systems, then it might be more appropriate to use classifications that are
developed for those aspects.
If you are interested in metabolism, a more appropriate functional clas-
sification to use might be the EC classification developed specifically for
enzymes, the EC numbers. There are six classes of enzymes, each with their
own subclasses and so on and so forth. The clasification is hierarchical and
has four levels, so every enzyme is identified by four numbers separated by
points, one number per level. This results in signatures similar to the num-
bers assigned to the books in a library using the Dewey Decimal Classification
system. The best repository for enzyme information is BRENDA.
1.5. DATABASES AND SERVERS 5

Similar types of classification exist for transcription factors (TF). How-


ever, unlike the EC numbers which is a de facto standard for classifying en-
zyme activities, no single TF classification is accepted as standard, unique,
or “complete” 1 . There are classifications that are specific for prokaryotes,
others for eukaryotes, and a few that mix prokaryotic and eukaryotic TF.
Molecular signal transduction or Transport functionalities have also been
organized in ways similar to transcription factors and lack a universally ac-
cepted standard. However, they all share the type of organization. A type
of Dewey system is used to create a system of classes, sub-classes, sub-sub-
classes, and so on and so forth, that allows researchers to fit the functionalities
they are interested in into a pre-existing structure. Furthermore, this a la
Dewey approach to classifying functionality makes it easier to automate the
revision of the classes and of the class memberships.
Another type of classification that is also used is organism centric. The
primary example of this is the Human Protein Reference database. This
classification organizes protein types in the human proteome. Any protein
that does not have a functional homolog in humans is ignored by this classi-
fication. In contrast, GO classifications for specific organisms are simply an
application of the GO to the protein functions in a specific genome, without
ignoring proteins that are not found in that genome.
In summary, developing clear biological ontologies and classifications is
what allows us to organize a certain type of data in systematic ways that are
always the same. This is very important for bioinformatics because comput-
ers depend on regular data structures for efficient performance and analysis.
In addition, having such data structures permits developing optimized algo-
rithms to analyze the data. If a classification or ontology is well defined,
then one can develop very efficient algorithms that always take the same
number of instructions to execute the tasks of data organization, storage and
retrieval. Classifications and ontologies are the first step to convert data into
information.

1.5 Databases and servers


When our data are classified we store them in a database but, what is a
database? In practical terms and for most applications biologists run into, a
database is a collection of data, organized according to a systematic classifi-
cation that is either explicit or implicit. The mosdt popular type of database
is the relational database where all data are organized as a set of interrelated
1
However I note that EC numbers are also not complete in the sense that new activities
can be inserted and old activities deleted and/or reclassified.
6 CHAPTER 1. INTRODUCTION AND RESOURCES

tables. Slide 50 shows a two table relational database, containing informa-


tion about pictures of cell in one table and which cells within a picture are
alive or dead and where they are in space. Since relational databases all
share the same structure, all the operations that we can perform on our data
can be described using the same language. The standard language used to
handle relational databases is called SQL – sometimes pronounced “sequel”
– and it can be used with all relational databases (MySQL,PostgreSQL,MS
Access, ORACLE . . . ) However, the steady accumulation of biological data
has reached a point where these database technologies are becoming less and
less adequate, as they have large managing times for accessing BIG DATA2 .
Big data is an all encompassing designation for collections of data that
become too large to be effectively managed or mined using traditional (SQL)
database technology. Google and other internet giants were the first to come
against this problem. Therefore, they were also the first to develop ways to
deal with it. Mostly they change the way in which they organize and mine
databases. NOSQL technology is used for this. Examples include Hadoop
and MONGO database technology. Often the information is stored in xml
files, where the various concepts and entities are tagged with marks that new
algorithms can find. Large data files are split and distributed between various
calculating computer cores, thus allowing for more efficient data management
and mining. However, Hadoop methods are very inefficient for data sets of
less than several dozen GBs or more, when compared to SQL methods.
Having the data and the technology to manage it, store it, and mine it is
only an initial step, when it comes to the non-expert end user. Most of these
would not know how to mine the data for information if the database is not
coupled to a user-friendly webserver (or downloadable desktop application).
This is the case of the NCBI database warehouses that are connected to a
server the permits easy searching of the data, by restricting the options for
data mining that users have. It is also the case of the EBI, which accesses
databases from various sources and integrates their information on the fly.
Currently, an approach like the EBI’s is fraught with dangers for sev-
eral reasons. First, the data contained in concurrent databases regarding
the same subjects might be inconsistent. For example, the latest version
of the human proteome contained in the NCBI database might not be the
same as that contained in UNIPROT. Even if they are the same, some of
the proteins may be identified using alternative nomenclatures that are not
mutually consistent.
An effort to surmount these problems is the semantic web. In very sim-
2
The concept of BIG DATA will become more prominent in biology and biomedicine
in upcoming years. See http://en.wikipedia.org/wiki/Big_data
1.5. DATABASES AND SERVERS 7

plified terms, the semantic web aims to annotate the information in web
pages to include the meaning of the data elements in machine readable form.
When you open a web page in your browser, for example using SAFARI or
CHROME, you see a human readable display of content and information.
However, for the browser to know how to display the content of the page,
that content is tagged. The way these tags work is very straight forward.
By indicating breaks, paragraphs, and display styles they allow you browser
to format and show you content in a way that is, hopefully, pleasant and
easy for humans to read. The semantic web takes this concept a step fur-
ther, introducing additional tags to the content of web pages. These tags
relay semantic information regarding how the various parts of the web page
are related amongst themselves, permitting the application of simple rules to
automatically analyze web content. Taking again an example where before
you only had a sentence (say “The cat is ugly”) now you can tag the various
elements of that sentence and automatically infer how they are related to
< > < > < > <
each other: pronoun The name cat verb is adjective ugly. >
This example has to do with language analysis. However, it can translate
directly to biological relationships, through the use of tags that are related to
biological content, function, and relationships. This emphasizes the impor-
tance of universal classifications and ontologies in the semantic web context.
The picture discussed above might be considered by some as too simplis-
tic, as is almost always the case in introductory discussions about a subject.
Why is that? Because it assumes rigid rules that always apply and makes
no concessions to uncertainty and error in creation and storing information.
However, there are now artificial intelligence and machine learning meth-
ods that can be applied to infer fuzzy relationships between entities in a
way that attempts to mimic human inference. These methods can also be
used to recognize entities described and/or tagged with a certain amount
of error associated to them. In broad terms, these methods work in the
following way. First, they require having a “golden set” of data containing
a large amount of data, where entities and relationships between them are
fully well annotated. This set is then usually divided into a training set and
a test set. The training set is used by the methods to infer statistical and/or
morphological characteristics of the entities and their relationships. These
characteristics (for example chemical names have a usage frequency of “(”
or “-” that is very different from that observed in common words and they
have terminations that are quite specific, as is the case of methyl- or propyl-)
are transformed into mathematical constructs. These constructs are used to
calculate parameters that permit the method to identify other entities with
the same morphology and/or frequency of usage of various types of marks
and signs. Once these parameters are calculated, they create a mathemat-
8 CHAPTER 1. INTRODUCTION AND RESOURCES

ical/statistical model of how the entities and relationships might be. This
model can then be used to analyze other datasets in the hope of inferring
entities and relationships in those new datasets. However, before one can do
so, the model must be tested to see if it really learned how to identify new en-
tities and/or relationships, rather than just having memorized the entities in
the training set. This can be done by applying the model to identify the en-
tities/relationships in the test set that was derived from the golden standard
data set. If the method is able to correctly identify the entities/relationships
in this set, then it can be used in other data sets. If not, then we know that
the method memorized the information in the training set but did not learn
the rules for identifying new entities/relationships.
Taken together, the development of a mature semantic web, the applica-
tion of biological classifications and ontologies, and the creation of accurate
and fast machine learning/artificial intelligence methods can open in the fu-
ture a new era for biologists. The conjunction of the three might enable an
automated, accurate, and fast information extraction and data integration
from the large amounts of datasets that are accumulating in molecular and
cellular biology.
In addition, artificial intelligence methods can now be trained to orga-
nize, integrate, and analyze data. These methods could, in principle become
a layer of data integration and analysis that would tremendously facilitate
our handling of biological and clinical data to extract information from them.
Many databases of molecular information exist and are freely available on-
line. While there was, for a long time problems with data integration between
databases and servers, web servers such as NCBI, EBI, or UNIPROT provide
raw data stored in perpetually updating databases, as well as webservers that
integrate individual tools to mine the information contained in that data.
These tools can perform functions that go from comparing sequences to ana-
lyzing gene expression, predicting protein structure or function, and creating
mathematical or statistical models. Currently, many of these databases are
linked and data integration and homogenization is much less problematic
than it was a few years ago.
Many of these tools are also available for individual download and us-
age outside of the web server. More advanced users can create pipelines or
workflows where these tools are used in succession to consecutively process
the output of one another and create multidimensional analysis of complex
biological datasets. Such pipelines/workflows can be created using platforms
such as Galaxy or Taberna. These platforms allow users to plug in pro-
grams that do the individual steps of the workflow in lego-like fashion and
pipeline the outputs between those programs. In order to do so efficiently,
users should carefully consider what they want to do and spend the time to
1.6. FILE FORMATS 9

develop an efficient workflow algorithm before implementing in their favorite


platform.

1.6 File formats


Since much of bioinformatics is about finding, stroing and sharing informa-
tion, file formats play a very important role. As an sample of the innumerable
amount of file formats used in bioinformatics, we will present two of the most
popular: the FASTA and the FASTQ formats.

1.6.1 FASTA
The FASTA format is used to store nucleotide or aminoacid sequences. Any
fasta file begins with a line of description followed by the sequence itself. For
instance:
>gi|186681228|ref|YP_001864424.1| phycoerythrobilin
MNSERSDVTLYQPFLDYAIAYMRSRLDLEPYPIPTGFESNSAVVGKGKNQEEVVTTSYAFQTAKLRQIRA
AHVQGGNSLQVLNFVIFPHLNYDLPFFGADLVTLPGGHLIALDMQPLFRDDSAYQAKYTEPILPIFHAHQ
QHLSWGGDFPEEAQPFFSPAFLWTRPQETAVVETQVFAAFKDYLKAYLDFVEQAEAVTDSQNLVAIKQAQ
LRYLRYRAEKDPARGMFKRFYGAEWTEEYIHGFLFDLERKLTVVK
The description line begins with the greater than symbol “>” and the
sequence uses one letter codes. The sequences are case insensitive, with
gaps in the sequence being represented by a single hyphen regardless of their
length. In nucleotide sequences, the bases are represented by the usual codes:
A,C,T,U and G. Unknown nucleotides are represented with the letter N.
The format accepts other codes for purines, pyrimidines, etc. In sequences
of aminoacids, unknown residues are represented by the letter X, end of
translation by an asterisk (*) and each of the aminoacids is represented by
its standard single letter abreviation. Single letter abreviations were defined
by one of the pioneers of bioinformatics: Dr Margaret O. Dayhoff. In order
to make the one letter codes as easy to remember as possible, Dayhoff chose
them according to the following logic:
ˆ Aminoacids cysteine, histidine, isoleucine, methionine, serine and va-
line take the first letter of their names since there are no other aminoacids
starting with the same letter
ˆ In cases where two or more aminoacids start the same letter, the letter is
used for the most abundant one. Thus alanine, glycine, leucine ,proline
and threonine are coded by their initial.
10 CHAPTER 1. INTRODUCTION AND RESOURCES

ˆ For the less abundant aminoacids, some could be associated with a


letter due to a phonetic affinity in their names. This is the case of
aRginine (R), PHenilalanine (F), tYrosine (Y) and tWyptophan.

ˆ After the assignments above, there were not so many letters left so
Dayhoff assigned K to lysine since K is close to L in the alphabet. For
the pairs aspartate/asparagine and glutamate glutamine, Dayhoff as-
signed the letters closer to the beginning of the alphabet to the smaller
molecules: aspartate (D) / asparagiNe (N) and those closer to the end
to the bigger ones: glutamate (E) / glutamine (Q).

1.6.2 FASTQ
The FASTQ format is used to report sequencing results and normally uses
four lines per sequence, for example:

@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!’’*((((***+))%%%++)(%%%%).1***-+*’’))**55CCF>>>>>>CCCCCCC65

ˆ The first line begins with a “@” character and is followed by a sequence
identifier and an optional description.

ˆ The second line contains the raw sequence.


ˆ The third line begins with a “+” character and is optionally followed
by the same sequence identifier (and any description) again.

ˆ The final line encodes the quality values for the sequence in Line 2, and
must contain the same number of symbols as letters in the sequence.

Each symbol in the last line encodes the quality of the corresponding
position in the sequence and follow an arbitrary order from the lowest quality
(!) to the highest (∼)

1.7 Outlook
While this is an introduction to bioinformatics, it is important to have some
perspective regarding how professional bioinformaticians work. Professional
bioinformaticians need to ensure that their pipelines and code work repro-
ducibly over time and across datasets. A pipeline needs to work both in last
1.7. OUTLOOK 11

years’ and in today’s version of linux/Windows/MacOS and across operat-


ing systems. To ensure that repositories of pipelines are created and can be
downloaded and ran on different computers. For example, using programs
like Docker, which work in various operating systems, you can install virtual
machines on your computer that ensure the pipeline works as it did originally.
This is done by creating a sort of “mirror image” of the computer where the
software was originally ran and installing it in your own computer.
Having these pipelines also allows different people and teams to modify
the pipeline and apply it to their local problems. While learning how to
code is fundamental, reusing code decreases the overhead of learning how
to program, as you can learn by updating and adjusting already functional
examples.
Many bioinformatics communities exist around the world. The Bioinfor-
matics infrastructure in the EU organizes around ELIXIR, an ESFRI (Euro-
pean Science Fundamental Research Infrastructure). In ELIXIR’s web page
you will find repositories and resources that greatly facilitate bioinformatics
practice and training. In Spain, ELIXIR is represented by INB (Instituto
Nacional de Bioinformatica), which hosts many useful resources. INB also
vertebrates TRANSBioNET, a community that provide services for clinical
applications of Bioinformatics. IRBLleida is a part of this community.
12 CHAPTER 1. INTRODUCTION AND RESOURCES
CHAPTER 2

Bioinformatics with DNA

2.1 Genome sequencing


The word genome comes from German (Genom) and was coined in 1920 by
the botanist Hans Winkler. The term combines the words gene and chromo-
some and referred to the totality of the genes in an organism, thus fitting
with collective nouns such as biome or rhizome. Since then, the termination
-ome has been taken to mean the a complete set of a certain class. The tech-
niques or disciplines that study different elements of the cell collectively have
received the name -omics, like genomics, proteomics and transcriptomics.
Now we understand the genome as the sequence of DNA contained within
a cell that contains all necessary information to synthesize and assemble
every protein or functional RNA that an organism can produce. All living
organisms have a genome that is unique to them. The genome contains all
the information needed to create a new organism and therefore having the
genome sequence will in principle allow us to understand how that organism
might work.
The size of a genome is usually measured by the number of bases it
contains. Genome sizes can vary widely. The smallest known genomes belong
to some retroviruses. These genomes contain around 3.5 kilo base pairs (3500
bases), but they can be as small as 1.759 kbp (Porcine circovirus type 1).
The largest known genomes tend to be those of plants, at tens to hundreds
of Gbp 1 .
1
However, note that the Polychaos dubium amoeba has been reported to have the

13
14 CHAPTER 2. BIOINFORMATICS WITH DNA

Figure 2.1: Typical genome sizes

Due to these sizes, classical DNA sequencing methods were not fit for use
in sequencing full length chromosomal DNA. After several attempts, whole
genome shotgun sequencing became the de facto standard method for full
genome sequencing. This method works as follows.

1. DNA is isolated and randomly broken into small pieces of equal length.
Since the sample contains more than one copy of the genome, each
sequence appears in several fragments. The number of repetitions is
known as coverage. For example, in a sample that contains 10 genomes,
the coverage of the sequencing is 10X. The pieces must be smaller than
about 1000 bps.

2. Each small piece is sequenced. The sequence of each fragment is called


a read. Most sequencing methods lose reliability as the sequence pro-
gresses, which is why the fragments have to be small.

largest known genome at a whopping 670 Gbp. These numbers are now disputed.
2.1. GENOME SEQUENCING 15

3. The pieces are then assembled in puzzle-like manner, by overlapping


the edges of the small sequences.

The methods described above face difficulties in long repetitive regions of


the genome, since there is no telling how many times a sequence is repeated
just by looking at the fragments. Currently, newer methods produce what
are called “long reads”. These are continuous reads of up to 100Kb (but
on average of about 10kb). These methods may be especially important in
identifying long genomic repetitions and copy number variants 2 . In general,
long-read methods produce less quality sequences that are prone to error,
so they tend to be used in combination with the traditional approach. In-
dependently of the specifics of method, they all produce a high number of
short sequences that one then needs to perform quality control on before
assembling. The quality control of the short sequences and their assembly
into larger ones are thus a common problem to all these methods.
As stated above, sequencing tends to lose quality as it progresses so the
ends of a sequence tend to have lower quality than the inner elements. When
the drop in quality is bad enough, removing the bases from the ends of some
reads may be necessary, this process is called read trimming. Read trimming
software works a combination of two methods: running sum and window
based methods.
Running sum algorithms use the quality information about base calling in
the sequencing process. This information is given by the probability P that
the base is correctly called. Then you set a threshold probability of error,
Plimit . The score for each residue is given by Plimit − P . For every base, the
running sum of this value is calculated. If the sum drops below zero, it is set
to zero. The part of the sequence not trimmed will be the region between the
first positive value of the running sum and the highest value of the running
sum. Everything before and after this region will be trimmed. If this number
is negative, the residues are trimmed. These methods typically call the bases
to be trimmed along the entire read.
Window based methods are mostly applied to trim 3 prime end bases,
which often are contaminants from the elongation process. They work as
the running sum, except that they add scores of n neighboring residues and
calculate the average quality within the window. If the quality is below a
certain threshold, this window is eliminated.
Once the reads are trimmed and clean, they have to be assembled. Com-
puters do so as if assembling a puzzle. The major problems in this assembly
2
I note that a technology that is expected to continuously read a sequence until a signal
is detected that concludes the read is under development. This could change the game of
genome sequencing.
16 CHAPTER 2. BIOINFORMATICS WITH DNA

are two-fold. First, if the coverage is low, there might be gaps in the reads
and the whole genome can not be assembled. Second, even if the coverage
is high, long repeats may prevent the assembly of the sequences into indi-
vidual chromosomes. There are however situations where sequencing and
assembling whole genomes is either not needed or not possible.
Whole genome sequencing is often not needed when a (set of) reference
genome(s) is available for the organism of interest. In such a case, one can use
a technique that cuts the genome you want to sequence into smaller pieces,
followed by hybridization of those pieces to a corresponding set of pieces from
the reference genome. The pieces from the reference genome that do not fully
hybridize with the genome you want to sequence identify regions of the later
where variation against the reference exists. Thus, you isolate those pieces
from the new genome, sequence only them, and replace the corresponding
sequences in the reference genome.
In addition, there are approaches to genome sequencing that do not re-
quire whole genome assembly. For example, when one is interested in identi-
fying the parts of the genome that can be expressed, then one can isolate the
full mRNA complement of the cell and sequence that complement. In general
this will not identify the parts of the chromosomes that are not expressed,
and one ends up with a library of ESTs (expressed sequence tags). Another
major example of a situation where whole genome assembly might be im-
possible is when sequencing a metagenome3 . Metagenomes are sequenced
by isolating the DNA or mRNA from environmental samples followed by se-
quencing of that DNA. In most cases this means that one can not accurately
attribute a given gene to a specific organism, although this is not always
true. Using the statistical properties of DNA/mRNA (see below, section 2.2
) one can often identify the genus of the organism to which the gene belong
to.

2.2 Genome annotation


Once we have genome sequences, what can we do with them? The sequence
in and of itself is useless if we don’t know how to interpret it. There are
several aspects and levels of the sequence that we need to interpret. It is
likely that we are still unaware about some of these levels. Let us start
with the most obvious one: Where are the genes? What is their structure?
Genes code for proteins and structural and catalytic RNAs that perform
most of the functions required for cell survival and reproduction. In addition,
can we identify regulatory elements and alternative splicing? Experimental
3
A metagenome is the genome of all the organisms of a given environment.
2.2. GENOME ANNOTATION 17

annotation is the only full proof way to annotate genomes. However, this is
unpractical and can not be done. Hence computational methods are required
in order to make sense out of genome sequences.
There are two general approaches to predict where a gene is in a newly
sequenced genome: homology and ab initio approaches.

2.2.1 Homology approaches to detecting genes


Homology approaches rely on the assumption that genes (proteins) with sim-
ilar function will have similar sequences. Hence, they compare known gene
sequences from databases against the assembled genome and identify where a
similar sequence might occur in the genome. This type of pairwise sequence
comparison can be done through several tools. The two most widely used
are BLAST and HMMER, and they rely on different models that describe
how gene sequences evolve. These models are coded using substitution ma-
trices. Substitution matrices estimates the log likelihood score that a given
DNA base (or protein amino acid) is replaced by an alternative base (or
amino acid). Homology detection of genes in a new genome is more effec-
tive if gene/genome sequences of closely related organisms are available for
comparison. On average, the farther two organisms are in the evolutionary
scale, the least precise will the gene sequences from one of the organisms be
in predicting genes for the other organism .

Substitution matrices

Margaret Dayhoff was the first person to calculate these matrices. She did
so in the following way. She collected and aligned sequences of proteins with
similar function (orthologs). Then, she analyzed the alignments and calcu-
lated the probabilities of a given residue being replaced by another in those se-
quences. These probabilities were then transformed into log-likelihood scores
and used to create programs that more quickly align sequences. BLAST and
HMMER use different types of substitution matrices
BLAST takes a bulk approach, based on overall residue substitution ma-
trices. These matrices are similar to those calculated by Dayhoff. An example
is shown in slide 39 of Class 2.1 powerpoint. It is easy to see that residues that
align with themselves have high log likelihoods (look at the diagonal elements
of the matrix), while residues that align with other residues that have very
different properties have negative log likelihoods for replacing each other
during evolution. How are these matrices used to estimate if an alignment
between two sequences is of high quality? An example can be seen in slide 40
of the same powerpoint. If you have the
18 CHAPTER 2. BIOINFORMATICS WITH DNA

alignment, you use the matrix to calculate the score of that alignment in the
following way:

1. Look at each position in the alignment. Attribute the relevant score


from the matrix to that position. For example, if C is aligned with C
this position is worth 9 points. If C is aligned with K, this position is
worth -3.

2. It is important to attribute a score to indels (gaps) in the alignment.


Typically, creating a gap is penalized with a high negative score. Ex-
tending that gap is also penalized, although the gap extension score
is not as high as the gap opening score. Values for gap opening and
extension are often determined empirically.

3. Sum all scores for each position in the alignment. The total score
reflects the overall similarity of the two sequences. The highest the
global score is, the better the alignment is.

Now that the use of these matrices to estimate the quality of an align-
ment is hopefully clear let us look at how we can calculate these matrices.
Start with a multiple alignment. Calculate the frequency of each residue
type in that alignment (represented by qi). Calculate the frequency of muta-
tions pij (of residue type i into residue type j). The log-odds ratio is given by
log(Pij /qi qj ). The type of substitution matrix that one calculates depends on
the multi alignment one uses. For example PAM (Point accepted mutation)
matrices align proteins that are closely related and are a reasonable model
for this type of sequences. BLOSUM N matrices align sequences of proteins
that are at most N% similar. In addition, while PAM matrices calculate sub-
stitution frequencies over the entire alignment, BLOSUM matrices do so over
ungapped blocks of the alignment. Other types of substitution matrices exist
and the choice of the appropriate one is case specific. The main limitations of
using these matrices to score alignments is that they assume a constant rate
of evolution over the entire range of the sequences and disregard any long
range effects of residues on the conservation of the sequences. Nevertheless,
they work quite well in aligning sequences that are larger than 100 residues
and have a sequence identity higher than 30%.

BLAST
The most widely used program to perform alignment of new genome se-
quences to sequences of genes that are present in databases is BLAST. This
program is very fast and is able to compare a given sequence to a database
2.2. GENOME ANNOTATION 19

containing tens to hundreds of millions of other sequences on a timescale of


seconds to minutes. This efficiency relies on a heuristic 3-stage algorithm to
perform the comparison.
1. The first stage helps the program decide which sequences should be
compared. BLAST does this in the following way:

(a) Divide all sequences in the database into small n(=6 by default)
letter words.
(b) Divide the query sequence into small n(=6 by default) letter words.
(c) Set a minimum score Tmin below which an alignment between
two n-letter words is discarded.
(d) Compared all words from the query sequence to all words from
the database sequences. Discard those database words that score
below Tmin .

2. The second stage helps BLAST discard proteins from the database to
which the query sequence should not be aligned to. It does so in the
following way:

(a) Collect all database n-letter words that scored above Tmin in stage
1.
(b) Identify proteins that do not contain any of these words.
(c) Discard those proteins.

3. The third and final stage of the alignment simply takes the words that
match between the query sequence and the database and extend the
alignment, using the substitution matrix of choice. If extending the
alignment decreases the score below a certain limit, this subject se-
quence is discarded. Alignment extension is done locally, as global
optimization would be too slow. Once the alignments are finished,
they are ranked and ordered by BLAST.
There are several metrics that BLAST can use to rank the similarity
between the subject sequence and the query sequence. Most used metrics are
the total score or the e-value of the alignment. The total score is calculated
as described above, by adding the scores of the individual positions. This
total score can be transformed into other metrics, such as the bit score, that
we won’t discuss here. The e-value measures the number of times you would
expect to find simply by chance an alignment between the query sequence
and a sequence from the database with the score you have. The e-value is
20 CHAPTER 2. BIOINFORMATICS WITH DNA

given by K · m · n · e−λS , where m is the length of the query sequence, n


the length of the subject sequence, S is the score of the alignment and K
and λ are database specific parameters. The lower the e –value, the least
likely it is that your sequence is identified by chance. Typically, e-values
between 10−10 and 10−30 identify homologous genes, while e-values below
10−30 identify orthologous genes. However, in rigor, these boundaries depend
on the database of sequences one is using.

HMMer and Hidden Markov Models of sequence evolution


HMMER is like BLAST in the sense that it also uses a similar heuristic ap-
proach to align a sequence to those of a database. However, instead of using
a simple substitution matrix such as the BLOSUM or PAM matrices, it uses
a hidden markov model (HMM) profile to perform the alignment/recognition
of similar sequences. These HMM profiles are also statistical models of how
sequences evolve. However, they look at the specific positions of the align-
ment and calculate probabilities and log-scores that are position specific.
This permits a more accurate alignment than BLAST. The caveat is that, if
the position specific HMM models are accurate, they need to be individually
created for each cluster of orthologous or homologous genes, as opposed to
the substitutions matrices that are more generally applied.

2.2.2 Ab initio approaches to detecting genes


Another type of approaches to detect genes in new genomes is by using ab
initio methods. These methods can be further subdivided into signal sensor
and content sensor methods.

Signal sensor methods look into the sequence to identify know signals
that are associated to genes. For example, a very simple signal method might
go through the entire DNA sequence and identify all ATG start codons,
followed by TGA stop codons at a certain distance that is larger than a
given number of residues and call that a gene. More sophisticated methods
would also look for Shine-Dalgarno sequences somewhere close to the ATG, as
these represent preferential ribosome binding sites. Promoter and regulatory
binding sites can also be used to improve the gene model, as can many other
signals.

Content sensor methods rely on the different statistical properties be-


tween coding and non-coding sequences. For example the GC content of
2.2. GENOME ANNOTATION 21

coding and non-coding sequences in a genome is different. Hence, by cal-


culating the GC content of know genes from the organism and identifying
regions of the genome that have this content we identify likely candidates for
genes. Another clear example of a content sensor is the use of synonymous
codon utilization to determine whether a predicted gene is likely to be a real
gene or not. Synonymous codons are not used with the same frequency in
gene coding sequences and different organisms have different preferences for
which synonymous codons are most frequently used.

2.2.3 Final notes on genome annotation


It should be stressed that none of the methods described above, in and of
themselves, is sufficient to accurately annotate a genome. New pipelines that
combine all methods in statistically accurate ways to predict gene borders
and regulatory regions are continuously being developed. These pipelines are
often organism specific. For example, in eukaryotes, such a pipeline could
use a HMM of gene structures to predict where genes are and what they look
like, as illustrated in slide 58.
Genome annotation typically goes through two stages. A first stage takes
the complete sequence of a chromosome and automatically combines the var-
ious methods discussed above (and others) in various ways and, by applying
rigorous statistical models of what genes look like, identifies probable genes
in the sequence of the genome. Once this automated stage is over, a stage of
human curation is used to clean up and complete the annotation. The first
stage of the process can be done using various online servers or pipelines.
Some of these can also be installed locally and run in house. An example
of a server that automatically annotates bacterial and archaeal genomes is
RAST.
We should also note that eukaryotic genomes are more difficult to anno-
tate than prokaryotic ones. There are several reasons for this. Here are a
few:

1. Eukaryotic genomes are typically much sparser in gene density than


prokaryotic genomes. In other words, a much larger fraction of eukary-
otic genomes do not code for proteins. This makes it more difficult to
accurately predict gene borders.

2. The structure of promotors and regulatory elements is more complex in


eukaryotes. Promoters and regulatory elements can be found at larger
distances from the start of coding sequences and this makes it more
complicated to annotate eukaryotic genes.
22 CHAPTER 2. BIOINFORMATICS WITH DNA

3. Eukaryotic gene structures are on average more complicated than prokary-


otic gene structures, because there are often introns that need to be
processed and removed from the mRNA before it can be sent to the
cytoplasm and translated into protein by the ribosomes. Predicting in-
trons and exons can be done in the same way as predicting genes, either
by homology or by ab initio methods. However, because the sequences
are smaller, the statistical power of the various available methods is
less than that for longer genes. We further remark that most of these
more problematic characteristics of eukaryotic genomes are also found
at a much lower frequency in prokaryotic genomes.

Finally lets add just some considerations about predicting genes that do
not code for proteins. RNA genes may be a throwback to the RNA world,
which is likely to be the original way in which transmission of genetic infor-
mation has evolved. However, it might also be that RNA genes are more
recent and have re-evolved after the DNA→RNA→Protein world has ap-
peared. Whatever the truth about the evolution of RNA genes is, the fact
is that they exist and perform regulatory, structural and catalytic functions.
Predicting non-coding RNA genes can be done using methods that are equiv-
alent to the ones described for finding protein coding genes. In addition, there
is a set of methods that is specific for RNA, which rely on the assumption
that ncRNA genes code for RNA sequences that are thermodynamically sta-
ble. Hence, one can systematically analyze intergenic regions, predict their
folds and calculate the thermodynamic stability of those folds. Stabler RNA
structures are then flagged as possible ncRNA genes. It should be noted
that this class of methods have been placed into question when a study of
thermodynamic stability of random and ncRNA shown that their stability
was not significantly different.
In fact lack of statistical significance is a problem in predicting ncRNA
genes. Because they are smaller than protein coding genes, the statistical
signal that distinguishes random RNA sequences from ncRNA sequences is
smaller and these genes are harder to detect.

2.3 The dynamics of the genome: changes in


genome structure
Response of organisms to changing environmental conditions occurs at var-
ious levels. The most sustained changes occur at the gene expression level.
These changes occur on a time scale between minutes and hours and allow
the organism to down-regulate the production of proteins that it does not
2.3. THE DYNAMICS OF THE GENOME: CHANGES IN GENOME STRUCTURE23

need in the new environmental conditions, while upregulating the production


of those it needs. These changes correlate to changes in the 3D contact struc-
ture of the genome. To change the patterns of gene expression chromosomes
open or close different parts of the DNA double helix, allowing for different
proteins to bind the DNA. This changes the relative position of the vari-
ous parts of the chromosome with respect to each other, as the chromosome
moves in the nucleus to accommodate the changes to DNA accessibility. nC-
methods have been developed and applied over the last decade to measure
the dynamics of these 3D structural changes in chromosomes. nC-methods
allow chromosome conformation capture. N can be replaced by 3 (3C) 4C,
5C, or Hi-C. What do these methods do?
Classically, cells are submitted to formaldehyde in order to create the
chemical conditions that lead to crosslinking between DNA and proteins.
Then, DNA is submitted to restriction enzymes, and sites that are not pro-
tected by crosslinkage are digested. Crosslinked fragments are then ligated
to form hybrid DNA molecules and isolated. Crosslinkage is then reversed
and DNA is analyzed. Classical 3C experiments analyze the fragments one
at a time, using PCR and site specific primers. This limits the number of
possible interactions they are able to detect. 4C methods differ from 3C
methods in that they then use reverse PCR to generate the complementary
DNAs and then amplify and hybridize them against whole genome sequence
arrays that permit identifying the exact linear places where the sequences
originate in the genome. Typically, there is one microarray per locus you
want to run against the genome. 5C methods extent this and hybridize a
set of loci against another set of loci, followed by sequencing. HiC methods
are the unbiased extension of the nC Methods. Chromatin is digested with
a (set of) restriction enzyme(s). The ends of the fragments are filled in with
nucleotides, and one of these is biotinylated. Ligation between the two ends
is performed and the DNA is purified and sheared. Biotinylated junctions
are isolated with streptavidin beads and identified by paired-end sequencing.
Taken together this data can be used to identify looping in the chromosomes,
characterize spatial genome compartments, identify topologically associated
domains (TADs) and even create 3D models of the chromatin.
Initial analysis of this data can be done using methods that are similar
to those describe above for identifying sequences. However, once that part
is over, the interactions can be represented in different ways. 3C interaction
data can be represented as shown in slide 64. The reference sequence in that
plot is the gene promoter. Then, one represents on the x-axis the chromosome
distance from that promoter and on the y-axis the frequency with which each
region is found to be interacting with the original region. 4C data is usually
represented using two plots. One is the same as that for
24 CHAPTER 2. BIOINFORMATICS WITH DNA

3C data. In the other plot the x-axis represents the same, but the y-axis
represents the sensitivity of each chromosomal site to the specific DNAse(s)
[restriction enzymes] used to digest the DNA. 5C and HiC data pertain to
larger genomic regions and therefore can be better represented using heat
maps, as shown in slide 65 of Class 2.1 powerpoint.
CHAPTER 3

Bioinformatics with RNA

3.1 Transcriptomics
3.1.1 Macro/micro arrays
The development of DNA macro and microarrays opened the door to measur-
ing changes in gene expression simultaneously for all the genes in a genome.
This enabled us to understand how the entire gene expression complement of
an organism changes in response to any type of challenge. RNA-Seq is now
phasing out macro and micro arrays. Still, the latter technology remains
widely used, especially in clinical contexts, and work as follows. First, a set
of RNA probes is synthesized, one for each gene of interest. Each probe must
be long enough so that it hybridizes specifically with a single gene within the
genome. The probes are then attached at specific locations (spots) to a phys-
ical support (a slab of material) forming an array. Once the array is ready,
mRNA extracted from the cells of interest and amplified using marked nu-
cleotides. In radioactive microarrays, one can hybridize the amplified RNA
directly and measure the radioactivity at each spot. By having a radioac-
tive standard one should then be able to calculate absolute amounts of gene
expression. In fluorescent microarrays, one marcs the amplified mRNA of
two different conditions with fluorescent probes that emit on different wave-
lengths and hybridizes the two sets of amplified mRNA with the same array.
Spots that hybridize with genes that are equally expressed in both condi-
tions will emit a similar amount of fluorescence on both wavelengths. Spots

25
26 CHAPTER 3. BIOINFORMATICS WITH RNA

that hybridize with genes that are preferably expressed in one of conditions
will emit stronger fluorescence on the wavelength that is associated to that
condition.
There are several aspects one needs to be aware of when analyzing micro
array data. First, and foremost this data is often very noisy and has low
quantitative reproducibility. This means that even radioactive microarrays
can only be used to calculate absolute changes in very special cases. Second,
and focusing on fluorescent microarrays, there is not a single way to treat
the raw data that is accepted and works 100% of the times. Many protocols
exist and they all have advantages and disadvantages. All of them consider
several aspects of treating the data:
1. Background noise. RNA is sticky and binds nonspecifically to the phys-
ical material of the microarray. This creates a background of fluores-
cence that needs to be subtracted from the overall intensity one mea-
sures. There are several ways in which one can deal with this problem.
For example, one can use internal controls with spots where there are
no probes and measure the fluorescence of each wavelength in those
spots. Through the use of a statistical model for the distribution of
this background noise one can then subtract it from all over the mi-
croarray.

2. Are the two fluorescent dyes equal in their behavior? In other words, for
the same amount of marking do they emit the same amount of fluores-
cence? If not, this should also be controlled. For example, fluorescence
should be normalized at each wavelength and normalized fluorescence
should be compared.

3. Once all background noise has been removed and all intensities nor-
malized, how do we know that a given gene has differential expression
between two conditions being compared? Again, statistics helps. For
example, one could assume that not all genes should, could, or need
to change their expression between alternative conditions. If this is so,
then one can collect all the changes across the genome and create a
statistical model of those changes. With this statistical model (distri-
bution) in hand , one can then identify all the changes that are less
likely than a given (low) probability of being observed by chance. Say
you want to identify the changes that are less than 5% likely of being
observed by chance. Then, you go to the distribution and pick those
changes outside of the 2.5% percentile on each side of the tails. You
can also do this in a non parametric way, by ranking the changes in
gene expression and picking up the smallest and highest 2.5%.
3.1. TRANSCRIPTOMICS 27

Microarrays are much less accurate in determining changes in gene ex-


pression than classical approaches such as quantitative PCRs. However, they
allow for determination of simultaneous changes in gene expression for the
whole genome, which could not be done using previous methods.

3.1.2 RNA Seq


RNA seq has become the new standard for quantitative transcriptomics prac-
tically replacing microarrays in the last few years due to its high reproducibil-
ity and its steadily decreasing cost. This technology is based in the same next
generation sequencing techniques previously discussed for DNA sequencing.
Thus, the RNA isolated from the cells of interest has to be converted into
DNA using reverse transcription and subsequent amplification by PCR.
The experimental procedure starts with the isolation of RNA from a tissue
or cell culture. Samples for total RNA study can pass directly to the next
phase, but if the goal is the quantification of gene expression or the expression
of small non-coding RNA the RNA samples are often treated to select the
desired type. To quantify gene expression, eukaryotic samples can be enriched
in mRNA by techniques that bind their poly-A tag. Alternatively, stable
RNA can be degraded using diverse treatments.To select ribosomal RNA,
which is often 90% of the RNA found in a cell, specific hybridization of that
type of RNA is used for extraction. Finally, small regulatory RNAs can be
separated by size, since these RNAs are much smaller than ribosomal and
mRNAs.
The resulting RNA has to be concerted into DNA for sequencing so it
is fragmented, converted to cDNA using a reverse transcriptase and often
amplified with PCR. The result is a DNA library that can now be sequenced
to obtain reads, just like when sequencing a genome.
The rest of the work is exclusively computational and depends on the
goal of the anaylsis.

de novo transcriptome Since the fragments are sequenced, this tech-


nique can be applied to genomes that have not been sequenced yet. The
reads obtined in previous phases can then be assembled, just like a genome,
in order to obtain the sequences of each mRNA in the cell, in other words,
the transcriptome.

quantification of gene expression If the genome of the organism is


known, the reads can be aligned to it. Genes that are highly expressed will
result in many copied of the corresponding mRNA, and therefore, many reads
28 CHAPTER 3. BIOINFORMATICS WITH RNA

aligning to them. The opposite will happen with lowly expressed genes. The
combination of aligning and counting results in high quality and reproducible
meassurements of gene expression, but requires a number of corrections and
normalizations. One of the many alternative approaches to do this is nor-
malize the counts (number of reads aligning to a gene) by the length of the
gene and by the total number of mapped reads. Depending on order of the
normalizations and the details of the method chosen, gene expression will be
obtained as Fragments Per Kilobase of transcript per Million mapped reads
(FPKM) or in Transcripts Per Million (TPM).
Although one can find whole genome/transcriptome gene expression data
in many places, GEO (Gene Expression Omnibus) collects most of these data
that are publicly available. In fact, reporting of experiments that generate
this data in most professional journals requires that the data is deposited in
GEO.
Treatment of gene expression data can also be done using an almost
infinite number of programs, many of them proprietary. GEO itself provides
some functionality to analyze the data it contains. However, if you want to
analyze data on your own, the standard is Bioconductor, which is an R-based
software that is free and accessible to everyone.

3.2 ncRNA
Lets consider two large classes of ncRNAs. The first class are regulatory
ncRNAs that bind other RNAs (coding or non coding) and regulate the way
in which those RNAs perform their function. The second class of RNAs have
direct structural and catalytic functions and we will call them structural and
catalytic RNAs.

3.2.1 Regulatory ncRNA target prediction


Typically, the function of this class of RNAs relies on their binding to their
targets. Thus, most bioinformatics applications concerning them aim at pre-
dicting their targets. The methods for target prediction take a list of regula-
tory genes and then run each of their sequences against all RNA sequences
that are known to be coded in the genome, looking for complementarity. This
can be done either using homology methods or ab initio methods that also
consider the structure of the regulatory ncRNA. Considering this structure
might be an advantage because part of the ncRNA sequence might not be
available for complementary binding to other genes. If this is known, then
more precise predictions can be made.
3.2. NCRNA 29

3.2.2 RNA structural prediction


RNA molecules fold and form (mostly) stable structures. Folding of RNA
occurs via base pair complementation, leading to the creation of structural
RNA features that can often be represented in two dimensions. These struc-
tures have direct implications for RNA function. For example, small hairpin
loops in mRNA might regulate the speed at which an mRNA is translated by
the ribosome into protein. They might also regulate the targets that ncRNAs
might attack, by hidden complementary sequences within the hairpin loop.
Thus, predicting the structure of RNA molecules is an important target of
RNA bioinformatics. There are three large classes of methods to predict RNA
structures: ab initio comparison, minimum energy, and structural inference
methods.

3.2.3 Ab initio comparative structural prediction of


RNA molecules
Ab initio comparative methods rely on comparing the sequences of homolo-
gous RNA sequences. Covariation between different positions in the sequence
might indicate that they are complementary and bind each other. The logic
dictates that if two positions of a conserved sequence bind one another, a
mutation in one will favor mutations in the other that restore the comple-
mentarity and preserve the binding. Simple methods to predict such binding
are illustrated in slide 29 of Class 2.2 powerpoint. We align seven
homologous sequences and then analyze how the residues in each position
change when compared to those of other positions. For example, positions
5 and 27 in the alignment have a perfect correlation in terms of how
residues change. Whenever one of the positions has as G the other has a
C. If one of the positions has an A, the other has an U. This is perfect
complementarity, indicating that these two positions might be paired
together in a hairpin loop. Doing this type of anal-ysis and then using
thermodynamic considerations about how C-G, U-A, and woobling base
pairing (U-G) might stabilize the structure of the RNA will allow one to
predict and rank possible structures for the RNA molecules.
3.2.4 Minimum energy structural prediction of RNA
molecules
Minimum energy methods rely on two assumptions:

1. The native structure of RNA molecules is the one with the minimum
energy.
30 CHAPTER 3. BIOINFORMATICS WITH RNA

2. The free energy of each base pairing is independent of all other base
pairs.

By applying these two assumptions and using dynamic programing one


can then predict the structure of RNAs in the following way:

1. Start by creating a matrix of the sequence, where each residue occupies


one entry in the matrix, as shown in slide 32 of Class 2.2 powerpoint.

2. Assume there is a minimum number of positions that must exist be-


tween two base pairs if they are to be able to do base pairing. This
is a reasonable assumption, as the bend in the RNA molecule skeleton
requires that bases are distanced by at least 4 residues for the bending
to allow base pairing. In the example I show in the slides we take this
number to be 4 base pairs. It is known that physical bending of the
nucleic acid can not put in contact base pairs that are closer than this
in a single chain.

3. Settle on a score for each type of base pairing situation between residues
in positions i and j. An example of how the 4 possible situations might
be scored is given in slide 33. Typical values for how much a given base
pairing stabilizes the structure are given in slide 34.

4. Use the scores from c) to fill the upper triangular matrix. The first four
diagonals will be zero. We can then start with the next diagonal and
score the possible pairings. Then, we can calculate the score of each
pairing in each row and column.

5. Once we have the matrix fully scored we can start at the highest score
and track the score until we reach the middle diagonals. This gives us
the most stable configuration, as shown in slide 37.

3.2.5 Structural inference prediction of RNA molecules


These methods rely on homology assumptions. If one knows the structure
of a given RNA molecule (subject) and wants to predict the structure of
another RNA molecule (query) that is very similar to the first, then one can
use the alignment of the two sequences to predict where each base of the
query will be in space. This is done by aligning the two sequences and then
placing the residues of the query in the spatial location of the corresponding
residues from the subject. A final step of energy minimization can help in
improving the prediction.
3.2. NCRNA 31

3.2.6 Final notes on the prediction of RNA structures


The more accurate methods for RNA structure prediction combine 2 or more
of the methods discussed above in various ways. It is also noteworthy that
the final structure of an RNA can also contain more complex structures, such
as pseudoknots, which are extremely difficult to predict.
32 CHAPTER 3. BIOINFORMATICS WITH RNA
CHAPTER 4

Bioinformatics with Proteins

After seeing how genome, transcriptome and other RNA analysis requires
bioinformatics and reviewing some general methods, we shall move on to
proteins. Until just a few years ago, finding protein coding genes, character-
izing those genes and the proteins that they coded was probably the main
field of application for bioinformatics.

4.1 From gene sequence to protein sequence

So how do we go from gene to protein and beyond? Translating the coding


sequence of a gene into that of a proteine is straightforward but not free of
pitfalls because the genetic code is not as universal as we tend to think. There
are at least 13 known variants of this code. For example, the TGA codon
that is universally accepted as the termination codon codes for tryptophan
in genes from the mitochondrial genome of yeasts. In addition, if this codon
is found in the middle of a mRNA and that mRNA as a SECIS structural
mRNA element close to its 3’ end, the TGA codes for the 21st amino acid,
selenocysteine. In other words, translating DNA/RNA into protein sequences
is easy, but some attention should be paid to using the appropriate genetic
code.

33
34 CHAPTER 4. BIOINFORMATICS WITH PROTEINS

4.2 Extracting information about a protein


from its sequence
Once we have the sequence of a protein there are several things that we
typically want to know about that protein. For example, we may want to
figure out its function in the cell, its structure or its localization.

4.2.1 Predicting physical chemical characteristics of a


protein
There are several approaches that can help us in these tasks. Once we have
our protein sequence we can go to a database and compare it to others. If
we are lucky, our sequence will be very similar to a well characterize protein,
beacuse both proteins will, very likely, have the same function.
Unfortunately, We often find find that our sequence is only similar poorly
annotated proteins having unknown function, maybe even hypothetical pro-
teins. In a few cases we may also find that our sequence is unlike any in the
database. Even in these conditions bioinformatics can help us to obtain in-
formation about our protein from the first principles analysis of its sequence.
One can calculate the relative amino acid composition of the sequence, pre-
dict the protein’s isoelectric point from the pKas of the individual amino
acids, or calculate the molecular mass by adding the masses of all the residues.
Amino acid composition, as well as signal and content sensor methods can
then be used to predict where in the cell might the protein be localized. So
even if bioinformatics cannot tell you what the protein does, it can help you
isolate it for experimental analysis. There are many different servers that al-
low you to make predictions of this type about protein sequences. The most
organized collection of such servers can be found at the Expasy webpage, but
the EBI or NCBI also have a few of these functionalities.

4.2.2 Homology and ab initio protein structure mod-


eling
Another important aspect of proteins is their structure. This structure is
(mostly) coded in their sequence and it has a direct impact on what a pro-
tein can do and how. If we are lucky, the structure of our protein of interest
can be found at the protein databank, which is the portal/database/server
where all experimentally determined protein structures are collected, orga-
nized, and made available to the general public. However, there are much
less experimentally determined protein structures (on the order of 105 ) than
4.2. EXTRACTING INFORMATION ABOUT A PROTEIN FROM ITS SEQUENCE35

protein sequences (on the order of 108 ). Hence, finding the structure of the
protein that you are interested in is literally a thousand times less likely than
finding its function via sequence comparison to a well characterized ortholog.
It is because of this that methods to predict protein structures from their
primary sequences were developed. The most accurate approaches to predict
protein structure involve what is known as homology modelling. Homology
modelling servers work in the following way:

1. You submit your sequence to a server. That server has a database of


sequences for which experimental structure determinations exist.

2. The server seeks to identify a set of subject sequences in its database


that are similar to your query, using algorithms like BLAST.

3. These subject sequences are then aligned to your query. Typically,


the automated alignments provided by the servers can be manually
improved.

4. Based on the alignment, the amino acids of your query sequence are
threaded and distributed spatially in the same place as the residues to
which they are aligned from the subject.

5. This initial model is then optimized using force fields and energy mini-
mization and sidechains of the amino acids are rearranged the eliminate
clashes. In the end you obtain a model of the structure your protein is
likely to have.

The accuracy of homology models can be widely varying. This accuracy


depends on various factors. First and foremost comes the quality of the
alignment. The better the alignment and the highest the identity between the
query sequence and the subject sequence(s) the better will be the prediction.
Roughly speaking, if the identity is higher than 70% you will have a good
model. If it is below 30%, your model will not be that good. Another factor
that is important in the prediction is the globularity of proteins. Globular
proteins tend to be more accurately predicted than non-globular proteins.
This is probably due to the fact that the vast majority of known structures
are globular.
This reasoning would also justify why homology modelling of membrane
proteins is normally not very accurate: There are very few such proteins for
which we know a structure. Given that we can only accurately model pro-
teins that are similar to the ones we already know; this limits the prediction
accuracy for membrane proteins.
36 CHAPTER 4. BIOINFORMATICS WITH PROTEINS

So, what can we do when no templates are available for homology mod-
eling. In such a case, ab initio methods can be used. There are many such
methods. However, the one that seems to be more accurate so far is the
one developed by David Baker and his group, ROSETTA. This method does
not rely on first principles, it uses small scale homology to predict protein
structure in a way similar to BLAST:

1. If the query sequence has sufficiently strong homologues, use homology


modeling.

2. If not, divide the query sequence into short subsequences (by default 6
letters, although this can be changed).

3. Look for homology to those sequences. Once found, identify all possible
short structural elements that are associated to those small sequences.

4. Start assembling the model as if you are synthesizing the protein in the
ribosome. The first six amino acids come first. Then, add the second
six, followed the third six, etc. This procedure automatically eliminates
all possible conformations in which assembling a new fragment causes
clashes with preexisting amino acids.

5. Once you finish assembling the sequences, you minimize the energy and
rank the models according to that energy.

There are servers that allow you to automatically create models of protein
structures using the various types of approaches available out there. The pro-
tein modeling portal collects many of these under the same umbrella. Swiss-
model is probably the best homology modelling server out there. ROBETTA
implements the ROSETTA algorithm, but it is very slow. If you need to use
ROSETTA just download the software, install it, and run it locally.

4.3 Proteomics
In the same way that organisms mount a long term, slower, response to
changes in environmental conditions by changing gene expression, those changes
do not necessarily propagate to the subsequent levels of cellular execution.
For example, sometimes, the change in environmental conditions causes changes
in the stability of the RNA and an observed change in gene expression is only
mounted to maintain RNA level and protein activity constant. To fully un-
derstand how the adaptive responses propagate one must also analyze the
protein complement of the cell, or its Proteome. This can be done in several
4.3. PROTEOMICS 37

ways. Classically one can study small sets of proteins and their abundances
using traditional biochemical approaches.
However, as was the case with RNA, there are methods that permit quan-
titative proteome wide studies. In contrast with the genome wide gene ex-
pression changes, where new technology had to be developed, proteome abun-
dance studies rely mostly on the development of previously existing technolo-
gies. Originally, proteome wide abundance studies relied on a combination
of electrophoresis and spectroscopy. The protein fraction of the cell was
solubilized and extracted, then ran through a 2D gel electrophoresis where
proteins are separated by charge in one dimension and mass in the other.
Protein spots could then then identified, for example via protein staining
methods. Each spot was isolated and the proteins extracted. A cocktail of
specific proteases could then be applied to these proteins, for the fragments
to be subject to spectroscopic/spectrometric analysis. In mass spectrometry,
for instance, the fragments are again separated by charge and size, resulting
in a spectrum per sample. This method is known as Peptide mass finger-
printing. Another method that is gaining users is tandem, where the peptide
fragments are generated via collisions on the fly. Modern methods replace
the electrophoresis step with alternative separation techniques.
Knowing the sequence of the proteins and the composition of the protease
cocktail, a theoretical spectrum can be calculated for each protein so it can
be compared with those of each spot. Thus, it is principle possible to identify
which proteins are in each spot and in what amount. This is not as easy as
it sounds, though. The spots often have more than one protein, making the
spectra difficult to analyze. In practice mathematical methodsare needed to
analyze and compare spectra. These methods often rely on comparing the
mass peaks of the two spectra and identifying a number of peaks that can be
associated to specific proteins above a certain level of statistical significance.
Proteomics methods can also achieve absolute quantification of protein
abundances. Relative proteomics methods compare two samples to deter-
mine the relative abundance of each protein between them. This is done by
marking two samples with alternative isotopes of some of the atoms used to
synthesize proteins and calculating the ratio for each protein. Calculating
absolute protein numbers from this methods, requires the availability of an
absolute standard for one of the samples. The field of proteomics is still
growing at a fast pace and new methods keep being developed, including
different label free approaches.
Proteome abundance measurements are not as sensitive as measurements
of changes in gene expression. The former tend to not detect proteins that
are in low abundance ( . 50 copies per cell). In addition, they may not
always be able to resolve which proteins are identified in a given sample. As
38 CHAPTER 4. BIOINFORMATICS WITH PROTEINS

is the case with changes in gene expression there are also other aspects of the
proteome that can be analyzed. For example, one can use techniques similar
to those used in nC methods to identify genomic binding sites for transcrip-
tion factors, for example using chip-chip experiments. Another example is
the use of marked ATP to measure changes in the phosphorylation patterns
of known proteins. There is an abundance of software that allows people
to identify proteins in MS experiments. Many of the available programs
are proprietary and come with the machines that are used for the experi-
ments. However, there are also many open source general software that can
be used and adapted for each specific purpose. Examples are Greylag, In-
spect, or MassWiz, which can either be downloaded and installed locally or
run through an online dedicated server.
CHAPTER 5

Bioinformatics with Networks

At this point we have an idea about many of the things we can do and
learn with bioinformatics about a gene or protein, based solely on static
sequence information. This is very nice and allows us to obtain a lot of
information about what those genes do or how they evolve (by comparing
sequences of orthologous genes over many organisms). One can also identify
the limit conditions in which organisms can survive. This can be done simply
by analyzing the genome of an organism, identifying which functions are
present and which are absent and correlating those functions to what the
environment needs to provide to the organism for it to survive. For example,
if the pathway for the biosynthesis of Lysine is absent from the genome, the
organism needs that the environment provides that amino acid for survival.
However, we often need another type of information that is more dynamic
and quantitative. For example, if we want to know how a microbe reacts
to an antibiotic in real time, sequence information provides very little help.
In this situation, one needs to measure the dynamic response of the various
components that permit a microbe to react to the drug. Such measurements
are the bread and butter of molecular and cellular biology. For more than
half a century we have been measuring how specific components of a cell
respond to specific challenge. For example, Monod and Jacob won the Nobel
prize by analyzing the response of the lac operon to various inducers and
repressors of gene expression, to reconstruct the regulation of the operon.
The limitation of these classical approaches is that one was only able to
follow the dynamics of a small set of molecular components of the cell at
the same time. Since 1995, however, technological developments and shifts

39
40 CHAPTER 5. BIOINFORMATICS WITH NETWORKS

in the way the we can measure the molecular components of the cell. The
omics revolution has allowed us to (in principle) measure dynamic changes
in all possible components of the cell at the various levels. This generates
a situation where humans need computers and computational methods to
integrate, analyze, and extract information from the data.

5.1 The dynamics of the metabolome


As is the case with gene expression changes, changes in protein abundance
and/or activity do not necessarily imply changes in the processes that are
executed by those proteins. For example, if temperature drops, additional
synthesis of some proteins might simply be used by the cell to maintain the
activity of those proteins at the same level as before the temperature drop.
Another example, if a protein in a biosynthesis pathway increases by 2 fold,
this does not necessarily mean that flux through the pathway or metabolite
concentrations will increase similarly.
To understand the metabolic consequences of protein changes one must
measure what happens at the metabolic level. Whole metabolome and flux-
some measurements allow us to do this. In general terms, determining the
metabolites that are present in a biological sample can be done using technol-
ogy that is similar to that used to detect and identify proteins. After sample
preparation, mass-spectrometry or NMR techniques can be used to create
spectra for the samples. These spectra can then be compared with those for
pure compounds in databases, and thus identify the likely metabolites that
are contained in the spectrum.
In practical terms, identifying the metabolites in a biological sample can
be much more difficult than identifying proteins. There are several reasons
for this. One of the most important is that the number of metabolites in a
sample is often much larger than the number of proteins, creating situations
where attributing peaks in a spectrum to a given metabolite might be very
difficult. Another important reason is that the database of metabolic spectra
can be relatively smaller than that for proteins, given the diversity of pos-
sible metabolites one might find (for example a single plant may synthesize
between 10,000 and 100,000 different metabolites) there is a limited database
for the spectrum of individual metabolites.
Metabolite ( gene expression and Proteome) spectra might also be used as
finger prints for various organic situations without identifying the individual
components of the spectra. For example, imagine one goes to a doctor and is
healthy. Nevertheless, the doctor asks for a metabolic analysis of one’s blood.
Several weeks later you fall in sick. The doctor can not really figure what
5.2. INFERENCE OF BIOLOGICAL NETWORKS 41

you have and asks for another metabolomics analysis. After several tries he
finds something that cures you but he still has no idea what you had. If the
following year you fall sick again, and get a metabolomics analysis of you
blood done. If the spectrum of that analysis is sufficiently similar to that
of your previous illness, your doctor might automatically give you the same
medicine that worked before.
In addition to the ability to determine the metabolites present in a biolog-
ical sample, we are also able to determine, in principle, how these metabolites
are synthesized. Fluxomics techniques allow us to do so. These techniques
rely on marking initial metabolites with rare isotopes and following how these
rare isotopes distribute among the various metabolites that are synthesized
from the marked precursor. For example, if we start with marked glucose
and, at small intervals, take samples and measure how the marked atoms of
glucose are distributed through the glycolytic pathway using NMR, we can
then obtain the kinetics for the individual processes in that pathway.

5.2 Inference of biological networks


We saw in the previous examples how, in principle, one can now look at the
various levels of the molecular and cellular landscape and understand how
each of these levels work. However, we also saw the need of an integrated
view if we are to make sense of what is going on in the cells and organisms
as a system. Measuring only changes at one level, be it gene expression,
protein levels or metabolite abundances, does not tell us what is going on at
the other levels. Moreover, these levels interact with one another and these
iteractions must also be incorporated in the analysis.
Even if we focus on a single protein and small set of metabolites, we
need biological context. Let us focus our attention on protein Stee11 from
Saccharomyces cerevisiae. To understand what this protein does, we need
to consider the other proteins in its “social network”. By doing so, we will
be able to see that it participates in mediating adaptive responses to both,
starvation and changes in osmotic pressure. If we are interested in the os-
motic response, then we should focus on the proteins, genes and metabolites
involved in that pathway. If we are able to integrate all the information
pertaining to the various molecular levels involved in this response, we will
actually have a shot at understanding who does what to whom and how!
Bioinformatics provide methods to (semi)automatically analyze the vari-
ous omics datasets, infer functional interactions between the various molecules
and create the networks that regulate the various biological processes and re-
sponses.
42 CHAPTER 5. BIOINFORMATICS WITH NETWORKS

Homology transfer

First, there is the most traditional method one can think of: Homology trans-
fer. There are many molecular pathways that have already been characterized
and for which the proteins, RNAs, and/or metabolites that participate in the
pathway are known. Imagine one can, through sequence comparison (or any
other method), identify the sequences of genes/proteins in a new genome
that are orthologous to those that are known to participate in that pathway
in other organisms. In this case one can almost automatically attribute that
gene or RNA in the new genome to the corresponding place in the pathway
that appears to be conserved from other species.

Data mining

This method is absed on automatically mining information from published


scientific documents. Such documents have been accumulating over more
than half a century now, and they are available online in many cases. These
are highly specialized texts where the information density is typically higher
than that of other non technical documents. In addition, if the names of
genes, proteins, metabolites or other chemicals appear in these documents,
more often than not this is because those molecules are involved in partic-
ipating in a biological response or process. If we turn this argument on its
head, we can say that if two or more chemicals/genes/proteins/metabolites
are found to co-occur in the same documents, this might indicate that the set
of co-occurring molecules is more likely than average to functionally interact
in a given process. This allows us to reconstruct in a semi-automated way
networks of genes/proteins/RNAs that are likely to be functionally interact-
ing.
There are several programs that allow such a reconstruction. One ex-
ample is iHOP. This program automatically analyzes co-occurrence of genes
in Medline abstracts. It preprocesses the abstracts, analyzing the genes in
each of them in advance and stores the co-occurring genes associated to each
abstract in a database. When the program receives a query, only has to
look up that pre-processed database, so the response is very fast. Another
example of a program that analyzes co-occurrence of genes is Biblio-MetReS.
Differences between this and iHOP are several. First, Biblio-MetReS allows
you to analyze full documents in many database, not being limited to Med-
line and abstract. Nevertheless you can do abstracts only too, if you choose
the Medline database as your corpus of analysis. Second, Biblio-MetReS
does on the fly analysis of the documents it finds. This ensures that you are
analyzing the most up to date corpus of documents, but it makes it much
5.2. INFERENCE OF BIOLOGICAL NETWORKS 43

slower than iHOP. To address the issue of speed, Biblio-MetReS stores the
gene/protein co-occurrence information regarding any document it finds and
analyzes once. Thus, if a document is found in a new search but was ana-
lyzed before, Biblio-MetReS will retrieve the processed information regard-
ing that document from its central database, rather than analyzing it again.
Finally, Biblio-MetReS allows users to include lists of biological processes
and/or pathways with which the genes and proteins might occur, providing
additional information regarding how may the various genes be functionally
interacting.

Evolutionary methods
This semi-automated method for network reconstruction focuses on analyz-
ing evolutionary information regarding genes and proteins from organisms.
The rationale for using this information in network reconstruction is as fol-
lows: Proteins (genes) that work together are likely to suffer evolutionary
pressures that make them have similar evolutionary patterns. Again, turn-
ing this argument on its head, one might state that proteins/genes that have
similar evolutionary patterns are more likely than average to be functionally
interacting. There several ways in which coevolution can be analyzed.
First, there are phylogenetic profiling techniques. Imagine that one
has the fully sequenced and annotated genome of several organisms. We
can then build a matrix where each row is an organism and each column
is a protein/gene. By analyzing which genes are simultaneously present or
absent in the same subset of organisms, one can identify genes that are likely
to participate in the same processes. More complicated logical analysis can
also be performed using such a matrix. For example, if a gene is always
absent when another gene is presence and vice versa, this might indicate
that both genes perform the same function in alternative organisms.
Second, one can also do conservation of gene neighborhood analysis
(termed synteny in eukaryotic genomes). In this case one analyzes the regions
of the genome that are close to a given gene and identify the other genes that
are present. If those genes are conserved in the proximity of orthologs of the
original gene in other organisms, this might be an indication that genome
evolution has kept these genes together for some reason and that they are
functionally cooperating to perform some function. This method is much
more accurate for prokaryotes than for eukaryotes. In fact, operons are the
main example that sustains this method. Typically an operon contain sets
of genes that belong to the same pathway.
Third, once can identify gene fusion events by comparing various
genomes. For example, E. coli employs two genes in the biosynthesis of
44 CHAPTER 5. BIOINFORMATICS WITH NETWORKS

tryptophan that, in B. subtilis, are merged together. In this case, gene fu-
sion events indicate both, functional and physical interaction between the
products of the genes.
Fourth, collecting sequences for the orthologs or several genes/proteins in
a number of organisms and building multiple alignments for each gene enables
the construction of phylogenetic trees. These trees can be compared to
one another. The genes that give rise to these similar trees have similar
evolutionary patterns and are thus candidates for functional interaction.
The fifth set of methods is based on the analysis of omics data. The prin-
ciples that underlie them are similar. If a given set of genes/proteins/metabolites/RNA
have similar patterns of regulation under different conditions, then the molecules
in this set are likely to be working together and participate in the processes
that regulate the adaptive responses to those conditions.
The sixth set of methods is based on large scale protein interaction
data. There are several techniques that pertain to this set of methods. First,
there have been experimental measurements of physical interactions between
every possible pair of proteins in a genome for several well studied model or-
ganism. This was done using, for example, Two-Hybrid system approaches,
where if two proteins interact they emit a signal that can be detected. If
the pair of proteins does not interact, no signal is detected. Physical in-
teraction is taken as an indicator of functional cooperation. Having these
large interaction datasets for several organisms allows us to transfer those
interactions to other organisms by assuming that orthologs of the interact-
ing proteins might also be interacting. Second, there are computer methods
that permit predicting if two proteins might interact and how. These meth-
ods are of two types. Sequence docking relies on having the sequences for
orthologs of the two proteins in several organisms, which permits creating
multiple alignments for each of the proteins. By comparing how the residues
are conserved in each of the two alignments one can sometimes predict if
specific positions in the two proteins physically interact. This is done by
identifying compensatory mutations between the two proteins. For example,
imagine two residues, one from each protein, that participate in the phys-
ical interaction between the proteins and that there are mutations in one

of the residues in some of the organisms (for example Asp Gln). In the
same organisms, the interacting residue from the other protein should have

mutated in order to maintain the interaction (for example from Gln Asp in
order to maintain the electrostatic complementarity between residues). The
other type of computer docking methods are termed in silico docking. These
methods require the experimental or modeled structure of a pair of proteins
and, using physical-chemical, electrostatic, thermodynamic and spatial con-
siderations, (and sometimes biological information) identify the most likely
5.2. INFERENCE OF BIOLOGICAL NETWORKS 45

way in which the two structures interact. These methods are very computer
intensive and require a lot of power. So far, they don’t scale up well for whole
genome/proteome level analysis. To my knowledge there is only a server that
combines most of these methods in a way that is user friendly. That server
is STRING and it lacks protein docking methods. Although the server and
its database and integration is far from perfect, it is probably as good as it
can be with the current level of biological, scientific and technical knowledge
that is available.
As a final note, be advised that gene regulatory information (for example,
which TFs might regulate gene expression or which RNAs might be regulated
mRNA use by the ribosomes) is still not included in STRING. In fact, in-
tegration and prediction of circuits at the gene and RNA regulatory levels
with circuits at the protein level is still incipient at best. This is also true
for automated identification of regulatory interactions of proteins by small
metabolites. Nevertheless, there are already databases and work regarding
such regulatory circuits that might in the near future facilitate automated
reconstruction of circuits at this level.
46 CHAPTER 5. BIOINFORMATICS WITH NETWORKS
CHAPTER 6

Systems Biology

This chapter focuses on presenting and discussing the methods and tech-
niques that you might need to address the second practical task that you
will have to do.

6.1 From Network Biology to Physiology and


Systems Biology
6.1.1 From omics data to circuits of interacting molecules
In the previous chapter we have seen how molecular networks and circuits
can be reconstructed using many different types of data and combining sev-
eral bioinformatics approaches. Imagine you use some or all of those meth-
ods to reconstruct your circuit of interest and now you have a prediction
of which molecular components might functionally or physically interact in
your circuit. Typically, what you want to do next is figure out the physiol-
ogy of action and regulation of the process in which the circuit of interacting
molecules participates.
Going from the network of interactions to the prediction/analysis of phys-
iological behavior is not trivial. In general, when you finish reconstructing a
molecular circuit in a (semi) automated way you get a graphical representa-
tion where the molecular components of interest are shown as nodes and the
interactions between components are shown as edges that unite interacting
nodes (see Slide 6). This is helpful, as it allows you to have some idea about

47
48 CHAPTER 6. SYSTEMS BIOLOGY

your network. However, when you start interrogating the network, the limi-
tations of this representation become apparent. Let us start by asking what
the interactions between nodes mean. Just by looking at the figure in slide 6
we can not really answer this question. Let us try another question and try
to figure out which nodes (molecules) are important regulatory points in the
dynamic responses of the system. Again, we can not answer this question
just from the representation we have here. What about identifying all nodes
that are fundamental for the response of the network, can we know that from
analyzing the graph? The easy answer is again no. However, there is a more
´´
accurate and complicated answer that is “partially . There is a branch of
mathematics called Graph theory that analyzes the connectivity of graphs
and the centrality of the various nodes. It turns out that biological circuits
have certain properties that enable using a set of four types of properties
of the graph to predict (with up to 70% accuracy) which nodes might be
fundamental in the functioning of the network. We will not go into details
about this and let us just keep the simple answer: No.
Overall, what we can say from this is that this node and edge representa-
tions tend to be ambiguous and are not very helpful in letting us predict the
physiological behavior of our circuits of interest. Even more detailed graphs,
such as the ones that can be displayed by STRING are not much better in
this respect, as you can not real infer what is going on in the interaction,
only what type of interaction it is.
How can we overcome this limitation and use reconstructed circuits to
analyze and predict the physiological behavior of the system? If ambiguity
is the problem, let us define a representation that is unambiguous.

6.1.2 Unambiguous graphical representation of biolog-


ical circuits
Let us look again at the node-edge representation of molecular circuits. That
two molecules (nodes) are connected by an edge may mean several things. If
Molecule A is connected to molecule B, this may mean for example that:

1. A and B directly cooperate to execute a biological function.

2. A performs a biological function and B inhibits that function or vice


versa

3. A and B cooperate indirectly to perform some biological function.

What we need is a representation that is unambiguous and allows us to


know exactly what is going on in the circuit. There can be many alternative
6.1. FROM NETWORK BIOLOGY TO PHYSIOLOGY AND SYSTEMS BIOLOGY49

ways to represent molecular circuits in such a way. For example, the Systems
Biology Graphical Notation (SBGN) provides such a representation, where
each type of biological reaction and event is represented by a specific symbol.
However, for our purposes this is unnecessarily complicated. There are dozens
of symbols and they are continuously being updated. In this course we will
opt for using the notation that chemists have been using for more than 100
years, with slight adaptations. This representation will be flexible enough to
represent all processes we will be interested in and simple enough that we
do not have to memorize tens to hundreds of symbols. The notation is very
simple. Whenever there is a material flow between two pools of material,
these two pools of material (or molecules) will be united by a full arrow,
with the arrowhead pointing in the direction of the material flow. If the
flow is influenced by another molecule or molecules that are neither used not
produced in the process, a dashed arrow will unite each of those molecules
to the material flow arrow. The arrowhead will point in the direction of the
material flow arrow. Dashed arrows will never point to other dashed arrow
or to species in the circuit, only to flux material arrows. If the dashed arrow
represents an activation of the flux, it is associated to a plus signal. If the
dashed arrow represents an inhibition of the flux, it is associated to a minus
signal.
In a circuit representation there are two types of molecules (or variables).
Molecules whose amounts change over time are called dependent variables.
Molecules whose amounts do not change over time are called independent
variables. This is a definition that we will use for the creation of mathematical
models. For example, in slide 19 A and B change over time, as material flows
from A to B. These two are dependent variables. On the other hand, C does
not change over time, as no arrow brings material to or draws material from
the pool of C. C is an independent variable.
In our network representation we also need to include stoichiometric infor-
mation. For example, if two molecules of A are needed to form one molecule
of B, the number 2 should appear before A. If more than one species as-
sociates to form another species or complex, then the associating species
should be included, together with their stoichiometry. For example if three
molecules of species D associate to 2 molecules of species A to form species or
complex B, the representation is as shown in slide 19. If the reactions are
reversible, then an arrow pointing from the products to the reactants should
also be included. In such cases, it is important to clearly identify of the
modulation arrows influence the forward reaction, the reverse reaction, or
both.
Also, let us define a notation for naming the molecules in a circuit. Al-
though this might seem like overkill, it can serve two purposes. First, in
50 CHAPTER 6. SYSTEMS BIOLOGY

large models where the names of the species are similar (e. g. G6P or G1P)
it is easy to make a mistake in writing the reactions. Having a systematic
notation that removes names and puts the focus on variables decreases the
probability of making such mistakes. Second, if and/or when you ever have
to implement methods to create and analyze these models, such methods will
internally convert each molecule into a consecutive set of variables with the
same tag and increasing number. What we will do in this course is use X as
the stand in for the variable and number the various consecutive variables in
increasing order.
When creating a conceptual model for a circuit we must also consider
that often we do not require models for the entire cell, only for the part of
the cell that we are interested in. Often, in such cases material flows into
and/or out of your system. This can be represented by source reactions,
where a full arrow pointing to the species where material comes into the
system from nowhere, and sink reactions, where a full arrow pointing from
the species where material goes off of the system and into nowhere. Often,
source reactions can also include substrates that are considered as being
independent variables in your circuit. For example, if a protein (mRNA) is
synthesized, the source reaction comes from the cellular pool of amino acids
(nucleotides). The metabolic levels of amino acids (or nucleotides) in the cell
are approximately constant, irrespective of protein (RNA) synthesis. Hence,
we can say that even though they are being used amino acids (nucleotides)
do not change over time for our purposes and they are independent variables
in our system.
Another aspect that one should consider is that of cellular compartments.
If the system we want to study is distributed throughout several cell com-
partments (e.g. cytoplasm and nucleus), it is likely that we should represent
these compartments and consider them later to build a model.
Now that we have a unambiguous representation for biological circuits,
let us check if such a representation is enough to allow for analysis and
prediction of the physiological behavior of a concrete system. Let us consider
a simple three step biosynthetic pathway that is a representative abstraction
for example for the biosynthesis of some amino acids. As represented in slide
28 a source X0 is used to produce a metabolite X1, which in turn produces
X2. X2 is further metabolized to create X3, which is the final product of the
pathway. The consecutive reactions are metabolized by enzymes E1, E2, E3,
and E4. E4 represents the cellular demand for the product of the pathway. It
is very common that the final product of the pathway, X3, inhibits the first
reaction of the pathway. This type of inhibition is termed overall feedback.
Now that we have the pathway representation, let us see if we can predict
how the pathway might work dynamically. Imagine that at time t0 you
6.2. MATHEMATICAL MODELING OF BIOLOGICAL SYSTEMS 51

increase X0. Using linear logic we would think that this would lead to an
increase in X1, which in turn would lead to an increase in X2, followed by
an increase in X3. If X3 increases then the inhibition of the first reaction
would become stronger and one might predict that it would be followed by a
decrease in the concentration of X1, then X2, then X3. This would again lift
the inhibition, allowing more material to come into the pathway, and leading
to subsequent increases in the pools of X1, X2, and X3. The cycle would
then repeat. This means that linear logic would predict that the dynamical
behavior of the pathway is oscillatory, with cyclic increases and decreases in
the amounts of the metabolites.
When we go in and actually observe how the pathway behaves this is what
we see. If the parameter that codes for the overall feedback strength is low,
the pathways will not oscillate. It will reach a dynamic equilibrium known as
steady state where the concentrations remain constant. This lack of change is
due to the fact that what goes into the system is perfectly balanced by what
comes out of it, and not because the system has stopped. If the parameter
that codes for the overall feedback has intermediate strength, then we do
observe oscillations. If the parameter that codes for the overall feedback has
high values, then the system becomes unstable and there is neither a steady
state nor an oscillation. Such a system would not allow for survival of an
organism.
What this example illustrates is that having an unambiguous representa-
tion of a biological circuit might be necessary but it is not sufficient to enable
analysis and/or prediction of the physiological behavior of that circuit. Why
might this be? Why can we not in general predict the dynamic behavior of a
biological system using simple logic? The reason is simple: Biological reac-
tions have non-linear dynamic dependencies on the variables that influence
them. Because of this, linear logic will fail when the system is operating near
a non-linear regime. The way to solve this problem is simply by using the
conceptual representations to create non linear mathematical models of the
systems of interest. The first of these models was created and published back
in 1943 by Britton Chance.

6.2 Mathematical modeling of biological sys-


tems
6.2.1 Mathematical representations and formalisms
How can we create mathematical models based on the conceptual represen-
tation of a system? First let us get some definitions out of the way. Each
52 CHAPTER 6. SYSTEMS BIOLOGY

dependent variable of the system must be represented by a differential equa-


tion of the form shown in slide 34. The left side of this equation represents
the time change of the variable A, while the right side of the equation in-
dicates that this change is determined by a function of the variables that
influence the fluxes going into or coming out of A. In this case, there a flux
producing A and another flux consuming A. The right side of the differential
equation can be abstractly represented by two functions. The function f1
depends explicitly only on D, while the function f2 depends on A and C.
For this model to be useful we must now, somehow determine the form
of these functions. What forms may these functions have? Many. However,
typically they will have one of three shape dependencies in biologically rel-
evant processes. In the simplest of cases a flux and its rate function may
depend linearly on the variables that influence the rate. In many cases the
fluxes depend on the variables in a saturating fashion. Most commonly this
saturation follows either a hyperbolic or a sigmoid shaped curve. If this is the
case, then the fluxes can be represented as shown in slide 35. However, more
often than not, we don’t know how the flux depends on the variables that
influence it. How can we create quantitative mathematical models in such
cases? The answer comes from mathematical approximation theory. There
is a whole body of mathematical theory that deal with this problem. One of
the most useful theorems of this body of knowledge is the Taylor theorem,
which states the following:
Any function of class C ∞ (that is, infinitely differentiable) can be exactly
represented by a sum of polynomial series of its variables and derivatives.
This series has the following shape:


X df
f (X1 , . . . , Xn ) = f0 (X1,0 , . . . , Xn,0 ) + dXi (Xi,0 − Xi )

0
i
X 1 d2 f X 1 dn f

2 n
+ dX 2 (Xi,0 − Xi ) + · · · + dX n (Xi,0 − Xi ) ) (6.1)

i
2 i 0 i
n! i 0

What this means is that we can now write pretty much any function that
is relevant to represent the dependency of a biological flux on its influencing
variables. However, it is still of little help, given that in general the series is
infinite and we are not good in dealing with infinities.
How can we solve this problem? Well, by approximating the infinite
series with something that is manageable and yet has a range of validity
where predictions and analysis can still be done with sufficient accuracy.
The easiest way to do this is by truncating the Taylor series in the first order
term:
6.2. MATHEMATICAL MODELING OF BIOLOGICAL SYSTEMS 53

X df

f (X1 , . . . , Xn ) = f0 (X1,0 , . . . , Xn,0 ) + dXi (Xi,0 − Xi )

i 0

However, if we approximate f with a first order Taylor series as we just did,


we obtain a linear representation, while we already saw that non-linearities
are very important in biological systems. How do we fix this? We use another
mathematical trick and change variables! The approximation we did above
was done in a Cartesian space, which is the way the brains of most of us
better understand and predict the world around us. However, let us know
switch to a logarithmic space. In this space we have that

F = log(f )
Y i = log(Xi )

We can now rewrite the truncated Taylor series as


X dF X
F (X1 , . . . , Xn ) = F0 (Y1,0 , . . . , Yn,0 ) + (Yi,0 − Yi ) = A + gi Yi
dYi
i 0 i

P dF
dF
Here, A = f0 (Y1,0 , . . . , Yn,0 ) + i dY
i 0
Y i,0 and gi = dYi . Both these
0
quantities are constant. To return to a Cartesian space from a logarithmic
space we have to apply the inverse of the logarithmic transformation, which
is the exponential transformation:
Y Y g
f (X1 , . . . , Xn ) = exp A + exp(gi Yi ) = α Xi i
i i

Hence, f can be approximated as a power law of its variables. Slide 40


shows a concrete example of this approximation for a flux that depends on
two variables, A and C. You don’t need to remember all the derivation. What
you need to know is that, independently of the number of variables, the form
of the function is always the same: a pseudo rate constant, multiplied by
each of the variables to an exponent that is referred to as the kinetic or-
der. This form is convenient for many reasons. One is for when it comes to
parameter values. The apparent rate constants (The as or α’s) are always
positive. The kinetic orders are zero if a variable does not directly influence
a flux, positive if increasing the value of the variable increases the flux and
negative if decreasing the value of the variable increases the flux. In addi-
tion, one can show that the kinetic order (in absolute value) is almost always
smaller than the number of binding sites for the molecules in the process
of interest. Furthermore, there are certain assumptions that one can use to
54 CHAPTER 6. SYSTEMS BIOLOGY

give numbers to these parameters. It has been shown that, under physio-
logical conditions, many reactions function with the variables close to the
value they have when flux is of half is maximum possible value. If this is the
case, the kinetic order can be taken as 0.5 or -0.5, depending on whether the
variable activates or inhibits the flux. Furthermore, if a variable of the flux
function works as a catalyst of the process, it is often reasonable to assume
that its kinetic order is close to 1. Furthermore, in sink reactions, without
any modifiers and for which we don’t know the mechanism, a typical simpli-
fying assumptions is that the kinetic order of the substrate is 1. However, if
additional information is available regarding the processes and fluxes, that
information should be used to better parameterize the models. So now we
have a non-linear formalism that can be used even when we don’t know what
the shape of the dependencies between the flux function and its variables is.
Furthermore, this formalism is regular and always looks the same. This is
important because it facilitates automating the derivation of mathematical
models from conceptually unambiguous representations of biological circuits
and because it permits developing analytical methods that take advantage
of the regularities in the functions and are computationally more efficient. It
should be noted that, if you know the shape of the flux dependencies, then
we can use other spaces to approximate the flux function. There are many
options, including saturating cooperative spaces, which perfectly represent the
cases from slide 35. However, in this course we will not go into those other
spaces. You can find details about those approximations in the paper on
campus virtual. As a final note I would like to add that often biologists try to
do the reverse of what we described in this section: Taking a mathe-matical
modeling paper, they try to reconstruct the network of interactions and
actions of the system. This is not always easy but it is good practice to
understand better what the math actually means. Still, often this in-verse
problem has more than one solution. Depending on the mathematical
formalism used for the equations a variable can be either a substrate or a
modulator. Solving this inverse problem requires that one looks further than
the mathematical formalism. As stated above, each dependent variable has
an equation describing its dynamic behavior in the system of ordinary differ-
ential equations that represents the circuit of interest dependent. As a rule
of thumb, plus and minus separate different terms in an ordinary differential
equation. Terms preceded by a plus represent a process that contributes to
the production of the dependent variable, while terms preceded by a minus
represent a process that contributes to the consumption of the dependent
variable. As another rule of thumb, if we don’t know whether a variable X is
a substrate or a modifier of a process that appears in the ordinary differen-
tial equation of another variable Y, one looks at the differential equation that
6.3. EXAMPLES OF MATHEMATICAL MODELS IN BIOLOGY 55

represents the dynamic behavior of X. If the term also appears, preceded by


a minus, X is a substrate; otherwise it is a modifier.

6.3 Examples of mathematical models in bi-


ology
You now have all the basic tools that required for solving the problems you
are given in your practical tasks. At this stage I want to start discussing less
abstract and more practical considerations about the work you have to do
in Task 2. In Task 1 I gave a clear set of instructions to identify and obtain
information about a specific (set of) protein(s). Task 2 is much more open
ended. I am going to ask you to take the biological system you identified
in Task 1, ask a question about its physiology and design and implement
a mathematical model of that system that will allow you to answer that
question. In doing so, you will need to consider several aspects that you are
not used to thinking about. Here I discuss some of them. Let us start by
defining what you system might be. If you are ambitious you might want
to understand how the system you identified works to regulate cell survival
as a whole. In this situation you need to consider the granularity of your
approach. How much detail do you want to include in the model? Is it ok if
represent the various aspects of cellular metabolism in a sort of cartoonish
way? This migh be fine if want ball park figure results. If you want more
quantitative results then you probably need to consider a more detailed rep-
resentation of the cell. However, you should be aware that the more details
you include in your model, the more data you are required to have, find,
estimate, or guess. Currently there are two types of models that are used to
create whole cell models. One type is FBA (flux balance analysis) models.
FBA models are most effective or accurate to model the whole metabolism of
the cell. Their construction works in the following way. You take the whole
metabolic map of all organisms and, based on the genome of your organism
or habitat of interest, map the various annotated gene functions onto this
map. With the existing functions mapped, you find the metabolic map of
your organism. Subsequently, you identify what are the inputs that are re-
quired for this metabolic map to be able to sustain the growth and survival
of your organism (e.g., the organism needs glucose or some amino acid from
the medium). Having identified this you can now analyze how the fluxes are
distributed in the map using mathematical optimization techniques. Typi-
cally you assume that metabolism is optimized to speed up biomass growth,
which implies that you have to identify the fluxes that lead to the synthesis
56 CHAPTER 6. SYSTEMS BIOLOGY

of molecular components of the cell (amino acids, nucleotides, lipids, pro-


teins, energy). Once this is done, you assume that fluxes going into the
metabolism are at a steady state (i.e. they are constant) and use mathe-
matical optimization techniques to identify how these fluxes are distributed
through the metabolism in order to maximize growth. This type of analysis
and models also permit identifying some of the genes whose mutations are
lethal in various organisms. This can be done in the following way: 1 – Once
you have your model, you delete one gene and optimize the model in order
to see if the organisms can still grow. If not, this mutant is predicted to be
lethal. 2 – Repeat this operation for every single gene in the genome. This
type of analysis has a success rate of between 501 – These models are linear.
They do not capture the nonlinearity of biological processes. If the lethality
of a gene is a consequence of its non-linear behavior, these models might
fail to capture that effect. 2 – These models deal with steady states. If the
lethality of a gene is a consequence of its dynamic transient response, these
models might fail to capture that effect. 3 – These models do not account for
dynamic regulation. If the lethality of a gene is a consequence of its dynamic
regulation, these models might fail to capture that effect. This reason is
partially overlapping with 2 –. This type of whole cell modeling is the most
widely used, in spite of its limitations. More recently, an alternative solution
to whole cell modeling has been proposed by Karr et al. for M. genitalium.
This solution consists of creating several types of models, each of which is
an adequate simplification for the cellular processes it is trying to model.
28 types of cellular processes are considered and whole cell submodels are
created for each type of process. Each submodel is then simulated during a
certain (small) amount of time. There are variables that participate in more
than one submodel. After this small amount of time these variables are taken
as output from the submodels where they are calculated and used as input in
the submodels where they regulate something. This is done repeatedly until
the end of the simulation is reached.
Once this model was developed it was validated by comparing the results
of the simulation with known experimental results. There they analyzed
different aspects of the metabolism of M. genitalium, such as cell cycle, energy
production, or material fluxes in metabolism. Using this approach they were
able to identify more accurately which genes are lethal than using FBA alone.
Having explained this, there is an older example I want to mention. That
example deals with the modeling of whole cell metabolism in human red
blood cells. In this case there was enough quantitative information to create
what was thought to be a complete quantitative model of the red blood cell,
using power laws. This is simpler than with other cells, given that there is
no gene expression or protein synthesis. With the model in hand they were
6.3. EXAMPLES OF MATHEMATICAL MODELS IN BIOLOGY 57

then able to test several things about our understanding of how red blood
cells work. By analyzing the effect of changing parameters on the behavior
of the model, they were able to identify which parts of metabolism were not
really well understood. Parameters whose changes lead to important loss of
functionality to the metabolism of the red blood cell identify parts of the
model for which we do not have enough information. Given that we know
red blood cells are very robust and can survive up to 120 days. In addition
they used this ability to survive to identify possible regulatory interactions
that were unknown. By systematically introducing these regulatory interac-
tions and identifying, which led to the survival of a model red blood cell to
approximate that of the real red blood cell. Some time later, one of the regu-
latory interactions that were predicted were actually experimentally verified
by other groups.
As you can see there are several approaches you can use to create large
scale models of cells and organisms. However, I don’t recommend that you
do this. Whole cell/Whole organism models are difficult to create and try-
ing to make one of these is not a productive way to start modeling. My
suggestion is that you focus you questions on a pathway, circuit, or process
that you identified in Task 1 and ask your question and create your models
about these circuits. Even when you do this, there are steps you need to
take before you start modeling. You need to decide what you are going to
include in your system. For example, if you want to model the biosynthe-
sis of Methionine, you can obtain a meta map of metabolism that includes all
reactions involved in this biological process in at least one organisms.
However, not all these reactions are present in all organ-isms. You need to
select the valid reactions for your organism. Once you have done that you also
need to decide that level of detail that you want to include in the reactions.
Are you going to be mechanistically detailed or using approximations is
enough? There is no valid answer for this and one needs to decide on a case by
case basis. You should also realize that creat-ing smaller models of subsections
and processes of cells and organisms is a valid way to understand how these
processes work. There is a fair amount of modularity in cells. This means that
studying these modules will give us valuable information about how they work
and regulate the global cellular metabolism. In general, you should simplify as
much as you can but no more than that. What “more than that” is, again is
case specific. Often, it takes several iterations before you finally decide what
you include in the model. Another aspect you should consider has to do with
what your questions are. If you simplify your model in such a way that it does
no longer allow you to answer the question(s) you are asking, then your model
is oversimplified. Up until this point I have been talking about the question
you need to ask
58 CHAPTER 6. SYSTEMS BIOLOGY

in an abstract fashion. Let us make it a little more concrete without going


to examples yet. Basically, what you ask of your system is “How does my
system work under such or such circumstances?”
This question is in reality two questions. The first is “How does my system
work qualitatively?” In other words, how do the components of my system
assemble in order to execute and regulate the process I am interested in?
This question has to do with network reconstruction. You have done a little
bit of it in Task 1. However, in the current context, network reconstruction
need to go a bit beyond that. Instead of finding functional and/or physical
interactions, what we are looking for now are causal relationships between
the various molecules of the system. In other words, which molecule does
what and how?
Now, I am going to give you a concrete example of a question of this type.
Let us start by the biosynthesis of FeS cluster. These clusters are crucial to
catalyze redox reactions, to create protein sensors for changes in the oxidative
conditions of the media, for DNA synthesis, and for many other things. It has
been known for several decades that if you purify proteins that should contain
FeS clusters and place them in solutions with Fe and S ions, the clusters would
assemble spontaneously. However, what people came to realize was that the
ion concentrations that were needed for such self-assembly to occur would be
too high and cells would die. Hence, a search for proteins that could form
a system for FeS cluster biogenesis ensued and this led to the discovery of
a set of proteins, conserved across the tree of life, that when deleted from
the genome led to defects in biological processes that depended on FeS C
clusters. A model organism for this study was Saccharomyces cerevisiae.
Several proteins were identified in that organisms that were involved in FeS
cluster biogenesis. However, even if most of these proteins have a known
molecular function, we did not know how they worked together. We only
knew that, when you knocked out one of the genes, cells accumulated Fe and
were depleted from FeS Cluster dependent protein activity.
With this information in mind what we did was create very many alter-
native models. Then, in each model we deleted the gene for one protein and
simulated to see if the system accumulated Fe and was depleted of FeS cluster
dependent protein activity. Several of the models showed no such depletion.
That allowed us to eliminate them as possibilities for how the network works
in cell. Of the ones we could not eliminate, so far, all experiments are con-
sistent with our predictions. The details can be found in the references from
slide 92. Let us now go back to the question “How does my system work
under such or such circumstances?”. The second way in which you can ask
this question is how does my system work quantitatively, now that I have
its network reconstructed. What are the numbers (parameter values and
6.3. EXAMPLES OF MATHEMATICAL MODELS IN BIOLOGY 59

concentrations) that make my system work as it should work? Let us look


at a concrete example of this. The system I will describe now is maize (pa-
nis). In subsaharian Africa, they use a type of white corn that produces very
little vitamin A. For several reasons that are not important, it is easier for
the populations to replace their staple corn with a GMO modified version
of that corn than with another natural corn with higher Vitamin A content.
Because of this, Paul Christou and his team used synthetic biology to intro-
duce the biosynthetic pathways for Vitamin A and other carotenoids in the
white corn. They created artificial lines with several versions of the path-
ways. They measured cellular changes in gene expression and metabolites
for each line, but were unable to measure protein abundances and activities
directly. That led to the following question: What protein activities are con-
sistent with the changes in gene expression and metabolic profiles that we
observe experimentally?
What we did to address this question was the following:

1. Use the known circuits for the pathways that synthesize carotenoids to
create power law models of the circuits. These models were different
for each maize line and depended on the genes that got inserted in the
specific line.

2. Use mathematical and statistical optimization and parameter fitting to


calculate the best values that make the simulations as similar to the
experimental results as possible.

Once we have done this we had two things. First we had the most likely
profiles for protein activity, which is something that experimentally could
not be measured. Second we had functioning mathematical models for the
various lines. With these models in hand we could ask two additional ques-
tions:

1. What are the parts of the pathway can we further engineer in order to
manipulate vitamin production?

2. How and by how much should we manipulate them?

To answer these questions we performed a sensitivity analysis of the pa-


rameters, which is a mathematical technique to identify which parameters (in
this case genes) should we change to obtain a certain change in the dependent
variables (in this case vitamin A).
These genes are basically those involved in the first steps of the pathway,
before and after the bifurcation. As a side note, we were able to further
identify a likely regulatory loop in the structure of the model. This was
60 CHAPTER 6. SYSTEMS BIOLOGY

due to the fact that we could not get a good fit to the data when using
only the known structure of the biosynthetic pathways. This meant that we
took advantage of the power law model and did the same as Ni & Savageau
(slides 74-76 of power point Class 3) and identified regulatory interactions that
would improve the fitting. Of all tested interactions we found that inhibition
of v3 in slide 103 of Class3 powerpoint by Lut would significantly improve the
fitting. This has not been tested yet, but later on we found that this regulation
exists in other plants, so the prediction makes sense.

This example is fairly easy for you to grasp because there are numbers in
it. However you can also ask semi-quantitative questions about pathways for
which you have few or no numbers. Consider the circuit shown in Slide 68 of
Class 3 powerpoint. This is an abstract representation that can represent any
linear biosynthetic pathway in a cell. Typically, in such pathways, the final
product exerts a negative feedback on the first reaction (overall feedback). X4
represents the cellular demand for the product X3, whatever that product is.
Using a power law approximation, we can write the system of differential
equations for the circuit.

Just by having these equations, we can analyze the long term or homeo-
static behavior of the system. This is done by equating the right hand side
of the equation to zero and solving the resulting algebraic equation. This is
known as calculating the steady state of the system and you can get infor-
mation about which parameters have a bigger control of your system simply
by calculating the derivative of the solutions of the algebraic equations with
respect to the parameters of interest. This type of analysis allows you to
understand how the system behaves on the long term. If you are looking
at faster responses then you most likely need numbers for the parameters
and numerical simulations. However, even in such cases you can use nor-
malization techniques that decrease the number of parameters you have to
estimate.

Irrespective of the model you have in hand you should never forget that
your model is only a model. It needs to be validated by contrasting its
results to those that are known about the system. If the model fails in this
comparison, then you need to go back to the drawing board and improve it.
However, if the model is validated, you should also never forget that a model
is never valid over all possible conditions. If you push it hard enough it will
also fail.
6.4. NOISE IN BIOLOGY 61

6.4 Noise in Biology


I will conclude this chapter by discussing a couple of issues that are likely to
not be relevant in the modeling exercise but that you should keep in mind
both for modelling and for any other research you might engage in: Biology
is noisy. There are two ways in which I mean this
First, whenever you take a sample from a culture, a tissue or an organism
you homogenize all the variability in the sample. Sometimes that variability
is important, some times it is not. For example, when you look at the image on
the left side slide 109 of Class 3 power point you find measurements for gene
expression levels in clonal mice. If you look at mice that are analyzed under
similar conditions, you will see that there is a reasonable amount of variability
in the levels of gene expression. This means that this variability must be taken
into account when planning for experiments or analyzing results. This
variability is also very important for the development of personalized or
precision medicine. It can also have consequences for how you create and
analyze models. If the variability is important then you should account for it
in your model. You can do this by creating statistical models and solving
master equations or by considering the parameter distributions in you
deterministic model, solve those models many times and analyze the ensemble
of the results.
Second, when creating and simulating a mathematical model, traditional
methods to solve ordinary differential equations assume that these numbers
are quite large and amounts (concentrations) of the various molecules are
continuous. However, single cells have a number of molecules that can be
quite small. For example, only a small number of proteins has a copy number
per cell that is higher than 1000 copies. Below this number, the cells work on
a stochastic regime and continuous simulations are not an accurate represen-
tation of the cellular processes. Because of that, people developed methods
to perform stochastic integration of systems of differential equations. The
differences between typical continuous and stochastic integration algorithms
can be seen by comparing slides 96 and 97. The continuous solution of a
system is done by calculating are rates simultaneously and using those rates
to calculate by how much each variable will change, updating the variables,
and keep on calculating until the final time of the simulation arrives. The
stochastic simulation requires more computational resources. If uses the rate
constants for each reaction to generate a random number. This random num-
ber will decide which reaction occurs in a given time step. Only one reaction
is assumed to occur in each time step. If the random number for a reaction
is generated, one iteration of the reaction occurs, and the algorithm updates
the number of molecules for that reaction and another dice is thrown. This
goes on until the end of the simulation time.
62 CHAPTER 6. SYSTEMS BIOLOGY

A final aspect you should be aware of when modeling is that sometimes


there are non homogeneities in your system and that gradients could be
important. If that is the case, then your model should be done using some
form of partial differential equations, where space is considered. However, in
this course we will avoid such systems, and I just mentioned them to make
you aware of the situation.

You might also like