Bioinformatics PDF
Bioinformatics PDF
A SEMINAR REPORT
Submitted by
NITHYA K. PILLAI
of
BACHELOR OF TECHNOLOGY
IN
SCHOOL OF ENGINEERING
KOCHI-682022
AUGUST 2008
DIVISION OF COMPUTER SCIENCE AND ENGINEERING
SCHOOL OF ENGINEERING
COCHIN UNIVERSITY OF SCIENCE & TECHNOLOGY,
COCHIN-682 022
Certificate
Date:
ACKNOWLEDGEMENT
At the outset, I thank the Lord Almighty for the grace, strength and
hope to make my endeavor a success.
Last but not the least, I thank all others, and especially my classmates
and my family members who in one way or another helped me in the
successful completion of this work.
NITHYA K.PILLAI
ABSTRACT
of life threatening diseases. Gene chips will be able to screen heart attack
patients will go to a doctor’s clinic with lab- on- a- chip devices. The device
will inform the doctor in real time if the patient’s ailment will respond to a
drug based on his DNA. These will help doctors diagnose life-threatening
chip would also confirm the patient’s identity and even establish paternity.
TABLE OF CONTENTS
ABSTRACT
LIST OF FIGURES
LIST OF TABLES
1. INTRODUCTION 1
2. EVOLUTION OF BIOINFORMATICS 2
3. HUMAN ELECTRONICS 4
4. GENE EXPRESSION 8
5. CHIP ELECTRONICS 11
5.1 Biochips
5.2 Clinical chips
6. PRESENT GOALS OF BIOINFORMATICS 13
7. ALGORITHMS USED 14
7.1 Comparing sequence
7.2 Constructing evolutionary trees
7.3 Detection patterns
7.4 Determining 3d structure
8. APPLICATIONS 23
8.1 Applications of internal chips
8.2 Applications of external chips
9. FUTURE DEVELOPMENTS 30
10. CONCLUSION 32
11. REFERENCES 33
LIST OF FIGURES
1. DNA 4
2. INVOLVEMENT OF COMPUTERS 6
3. GENE EXPRESSION 8
4. MICROARRAY 10
5. ACTIVA IMPLANT 25
6. COCHLEAR IMPLANT 26
7. CHIP IMPLANTED IN EYE 27
8. GENE CHIP 29
LIST OF TABLES
NO NAME PAGE NO
1. SOURCE OF DATA 7
BIOINFORMATICS
1.INTRODUCTION
2. EVOLUTION OF BIOINFORMATICS
DNA is the genetic material of organism. It contains all the information needed
for the development and existence of an organism. The DNA molecule is formed of two
long polynucleotide chains which are spirally coiled on each other forming a double
helix. Thus it has the form of spirally twisted ladder. DNA is a molecule made from
sugar, phosphate and bases. The bases are guanine(G), cytosine(C), adenine(A) and
thiamine(T).Adenine pairs only with Thiamine and Guanine pairs only with Cytosine.
The various combinations of these bases make up with DNA. That is; AAGCT, CCAGT,
TACGGT etc. An infinite number of combinations of these bases is possible. And then
the gene is a sequence of DNA that represents a fundamental unit of heredity. Human
genome consists of approximately 30,000 genes, containing approximately 3 billion base
pairs.
Currently, scientists are trying to determine the entire DNA sequence of various
living organisms. DNA sequence analysis could identify genes, regulatory sequences and
other functions. Molecular biology, algorithms, and computing have helped in
sequencing larger portions of genomics of several species. Sequence is the determination
of the order of nucleotides in a DNA as also the order of amino acids in a protein.
Sequence analysis, which is at the core of bioinformatics, enables function identification
of genes.
mapping-pinpointing the genomic location of all genes and markers; and DNA
sequencing-reading the chemical "text" of all the genes and their intervening sequences.
DNA sequences are entered in to large data bases, where they can be compared with the
known genes, including inter-species comparisons. The explosion of publicly available
genomic information resulting from the Human Genome Project has precipitated the need
for bioinformatics capabilities.
Determination of genome organization and gene
regulation will promote the understanding of how humans develop from single cells to
adults, why this process some times goes wrong, and the changes that take place as
people age. Bioinformatics finds applications in medicine for recommending individually
tailored drugs based on an individual's profile. It helps to identify a specific genetic
sequence that is responsible for a particular disease, its associated protein, and protein
function. For curing the disease a new drugs can be developed.
3. HUMAN ELECTRONICS
The nucleus is the most obvious organelle in the human cell. Within the nucleus
is the DNA responsible for providing the cell with its unique characteristics. The DNA is
similar in every cell of the body, but depending on the specific cell type; some genes may
be turned on or off-that is why a liver cell is different from a muscle cell, and a muscle
cell is different from a fat cell. About 99.9% of the sequence is identical between any two
people. But because the small percentage of DNA that differs can relate to an individual’s
disease. Scientists are comparing sequence using DNA chips from healthy people and
those from patients with a specific disease to help identify genetic targets for drug
discovery information about genetic variation can help to predict which patients are likely
to benefit from specific drugs
The most significant and the biggest application of DNA
chips is the use of DNA micro arrays for expression profiling. In expressions profiling
the chip controls how different parts of the genes turned on or off to create certain types
of cells. If the gene is expressed in one way, it may result in normal muscle, for instance.
If it is expressed in another way, it may result in a tumor. By comparing these different
expressions, researchers hope to discover ways to predict and perhaps to prevent diseases.
Fig 1.DNA
Electronic circuit can be incorporated in the chip to detect various states of DNA.
DNA carries an electric charge. That charge can be read on the chip, just like cells on a
memory array. This DNA chip would like to diagnose life-threatening bacterial
infections.
In DNA the medium is a chain of two units (phosphate & ribose), and the most
easily recognizable message is provided by a sequence of letters (bases) attached to the
chain. The DNA has two sequences of letters wrapped in the form of a double helix. The
DNA has two sequences of letters wrapped around each other in the form of a double
helix. One is the complement of other, so that the sequence of one string (strand) can be
inferred from the sequence of other. The DNA sequence of bases encodes 20 amino
acids. Under instructions received from DNA, amino acids join together in the same
order as they are encoded in DNA to form proteins. Chains of amino acids, which fold in
complicated ways, play a major role in determining how we interact with the
environment.
Genomic information is revolutionizing life sciences. The quest for under
standing how genetic factors contribute to human disease is gathering speed. The 46
human chromosomes house almost three billion base pairs of DNA that contain 30,000 to
40,000 protein-coding genes. Using bioinformatics find out how genes contribute to
diseases that have a complex pattern of inheritance, such as diabetics, asthma, and mental
illness. No one gene can tell whether a person has a disease or not. A number of genes
may make a subtle contribution to a person's susceptibility to a disease. Gene may also
affect how a person reacts to the environment. As the entire human genome is too big a
sequence on its own, sequencing and reading a genome demand heavy computational
resources.
Bioinformatics is largely, although not exclusively, a computer-based discipline.
Computers are important in bioinformatics for two reasons:
First, many bioinformatics problems require the same task to be repeated millions
of times. For example, comparing a new sequence to every other sequence stored in a
database or comparing a group of sequences systematically to determine evolutionary
relationships. In such cases, the ability of computers to process information and test
alternative solutions rapidly is indispensable.
Second, computers are required for their problem-solving power.
Typical problems that might be addressed using bioinformatics could include solving the
folding pathways of protein given its amino acid sequence, or deducing a biochemical
pathway given a collection of RNA expression profiles. Computers can help with such
problems, but it is important to note that expert input and robust original data are also
required.
We start with an overview of the sources of information: these may be divided into raw
DNA sequences, protein sequences, macromolecular structures, genome sequences, and
other whole genome data. Raw DNA sequences are strings of the four baseletters
comprising genes, each typically 1,000 bases long. The GenBank repository of nucleic
acid sequences currently holds a total of 9.5 billion bases in 8.2 million entries (all
database figures as of August 2000). At the next level are protein sequences comprising
strings of 20 amino acid-letters. At present there are about 300,000 known protein
sequences.
Table 1. Sources of data used in bioinformatics, the quantity of each type of data that is
currently available, and bioinformatics subject areas that utilise this data.
4.GENE EXPRESSION
FIG.4 MICROARRAY
The most significant and the biggest application of DNA chips is the use of DNA
micro arrays for expression profiling. In expressions profiling the chip controls how
different parts of the genes turned on or off to create certain types of cells. If the gene is
expressed in one way, it may result in normal muscle, for instance. If it is expressed in
another way, it may result in a tumor. By comparing these different expressions,
researchers hope to discover ways to predict and perhaps to prevent diseases.
5. CHIP ELECTRONICS
5.1.BIOCHIP
A decade ago, an eight-year old kid jumped from his swing set and landed flat,
shattering a leg bone where most kids would have sprained an ankle. An X-ray revealed
this problem. Where there should have been hard bone, a soft tumour was present. The
kid needed a precise diagnosis. If the cancer was aggressive, it needed immediate
treatment with the powerful but toxic drug 'adriamycin'. If the tumour was growing
slowly, doctors had the time to try out weaker but safer drugs.
A biopsy was inconclusive. Like many paediatric bone tumours, the kid’s tumour
was a small, round blue- cell tumour. The doctor had a problem treating the kid. As
“adriamycin” could cause serious heart damage, doctors weren't willing to give it to the
kid. Of all blue cell tumours spreads aggressively enough to require this potentially
deadly medicine. Doctors hoped that a less toxic medicine will do and gave the same to
the kid, resulting in the death of the kid just after six months. Today, rapid advances in
bioinformatics are providing new hopes to such patients. The new technology enables
doctors to proceed straight to genetic codes that instruct tumours to grow, finding
invisible molecular signals that differentiate cancers as well as a host of other deadly
diseases.
The key to this life saving, cost effective diagnostic power is a tiny glass chip
peppered with DNA strips, called the gene chip. Today, 60% of gene chips are used for
are research purposes, where these are speeding up drug design and helping researchers
to mine genomic data bases.
To perform these tasks, one usually has to investigate homologous sequences or proteins
for which genes have been determined and structures are available. Homology between
two sequences (or structures) suggests that they have a common ancestor. Since those
ancestors may well be extinct, one hopes that similarity at the sequence or structural level
is a good indicator of homology.
Based on the availability of the data now present the various algorithms that lead to a
better understanding of gene function. They can be summarized as follows:
algorithms of cubic complexity. The inference of shapes of proteins from amino acid
sequences remains an unsolved problem.
From the biological point of view sequence comparison is motivated by the fact that
all living organisms are related by evolution. That implies that the genes of species that
are closer to each other should exhibits similarities at the DNA level; one hopes that
those similarities also extend to gene function.
The following definitions are useful in understanding what is meant by the
comparison of two or more sequences. An alignment is the process of lining up sequences
to achieve a maximal level of identity. That level expresses the degree of similarity
between sequences. Two sequences are homologous if they share a common ancestor,
which is not always easy to determine. The degree of similarity obtained by alignment can
be useful in determining the possibility of homology between two sequences.
In biology, the sequences to be compared are either nucleotides (DNA, RNA) or amino
acids (proteins). In the case of nucleotides, one usually aligns identical nucleotide
symbols. When dealing with amino acids the alignment of two amino acids occurs if they
are identical or if one can be derived from the other by substitutions that are likely to
occur in nature. An alignment can be either local or global. In the former, only portions
of the sequences are aligned, whereas in the latter one aligns over the entire length of the
sequences. Usually, one uses gaps, represented by the symbol “-”, to indicate that it is
preferable not to align two symbols because in so doing, many other pairs can be aligned.
In local alignments there are
larger regions of gaps. In global alignments, gaps are scattered throughout the alignment.
A measure of likeness between two sequences is percent identity: once an alignment is
performed we count the number of columns containing identical symbols. The percent
identity is the ratio between that number and the number of symbols in the (longest)
sequence. A possible measure or score of an alignment is calculated by summing up the
matches of identical (or similar) symbols and counting gaps as negative.
With these preliminary definitions in mind, we are ready to describe the algorithms that
are often used in sequence comparison.
7.1.1. Pairwise Alignment. Many of the methods of pattern matching used in computer
science assume that matches contain no gaps. Thus there is no match for the pattern bd in
the text abcd. In biological sequences, gaps are allowed and an alignment abcd with bd
yields the representation:
abcd
− b − d.
Similarly, an alignment of abcd with buc yields:
ab−cd
− b u c −.
The above implies that gaps can appear both in the text and in the pattern. Therefore there
is no point in distinguishing texts from patterns. Both are called sequences. Notice that, in
the above examples, the alignments maximize matches of identical symbols in both
sequences. Therefore, sequence alignment is an optimization problem. A similar problem
exists when we attempt to automatically correct typing errors like character replacements,
insertions, and deletions. Google and Word, for example, are able to handle some typing
errors and display suggestions for possible corrections. That implies searching a
dictionary for best matches.
7.1.2. Aligning Amino Acids Sequences. The DP algorithm is applicable to any
sequence provided the weights for comparisons and gaps are properly chosen. When
aligning nucleotide sequences the previously mentioned weights yield good results. A
more careful assessment of the weights has to be done when aligning sequences of amino
acids. This is because the comparison between any two amino acids should take evolution
into consideration. Biologists have developed 20×20 triangular matrices that provide the
weights for comparing identical and different amino acids as well as the weight that
should be attributed to gaps. The two more frequently used matrices are known as PAM
(Percent Accepted Mutation) and BLOSUM (Blocks Substitution Matrix). These matrices
reflect the weights obtained by comparing the amino acids substitutions that have
occurred through evolution. They are often called substitution
matrices.
7.1.3. Complexity Considerations and BLAST. The quadratic complexity of the Dpbased
algorithms renders their usage prohibitive for very large sequences. Recall that the
present genomic database contains about 30 billion base pairs (nucleotides) and
thousands of users accessing that database simultaneously would like to determine if a
sequence being studied and made up of thousands of symbols can be aligned with
existing data. That is a formidable problem! The program called BLAST (Basic Local
Alignment Search Tool) developed by the National Center for Biotechnology Information
(NCBI) has been designed to meet that challenge. The best way to explain the workings
of BLAST is to recall the approach using dot matrices. In BLAST the sequence, whose
presence one wishes to investigate in a huge database, is split into smaller subsequences.
The presence of those subsequences in the database can be determined efficiently (say by
hashing and indexing).
(3) Phylograms are extended cladograms in which the length of a branch quantifies the
number of genetic transformations that occurred between a given node and its immediate
ancestor.
(4) Ultrametric trees are phylograms in which the accumulated distances from the
root to each of the leaves is quantified by the same number; ultrametric trees are therefore
the ones that provide most information about evolutionary changes. They are also the
most difficult to construct. The above definitions suggest establishing some sort of
molecular clock in which mutations occur at some predictable rate and that there exists a
linear relationship between time and number of changes.
introns can be interpreted in many different ways, thus accounting for the fact that a
given gene may generate alternate proteins depending on contexts.
7.3.2. Hidden Markov Models (HMMs). HMMs are widely used in biological sequence
analysis. HMMs can be viewed as variants of probabilistic or stochastic finite-state
transducers (FSTs). In an FST, the automaton changes states according to the input
symbols being examined. On a given state, the automaton also outputs a symbol.
Therefore, FSTs are defined by sets of states, transitions, and input and output
vocabularies. There is as usual an initial state and one or more final states. The automata
that we are dealing with can be and usually are nondeterministic. Therefore, upon
examining a given input symbol, the transition depends on the specified probabilities. An
HMMis a probabilistic FST in which there is also a set of pairs [p, s] associated to each
state; p is a probability and s is a symbol of the output vocabulary. The sum of the p’s in
each set of pairs within a given state also has to equal 1. One can assume that the input
vocabulary for an
HMM consists of a unique dummy symbol (say, the equivalent of an empty
symbol). Actually, in the HMM paradigm, we are solely interested in state transitions and
output symbols. As in the case of finitestate automata, there is an initial state and a final
state. Upon reaching a given state, the HMM automaton produces the output symbol s
with a probability p. The p’s are called emission probabilities. As we described so far, the
HMM behaves as a string generator.
The main usage of HMMs is in the reverse problem: recognition or parsing. Given a
sequence of H’s and T’s, attempt to determine the most likely corresponding state
sequence of F’s and L’s.
section has two subsections. In the first, we cover some approaches available to infer 2D
representations from RNA sequences. In the second, we describe one of the most
challenging problems in biology: the determination of the 3D structure of proteins from
sequences of amino acids. Both problems deal with minimizing energy functions.
7.4.1. RNA Structure. It is very convenient to describe the RNA structure problem in
terms of parsing strings generated by context-free-grammars (CFG). As in the case of
finite-state automata used in HMMs we have to deal with highly ambiguous grammars.
The generated strings can be parsed in multiple ways and one has to choose an optimal
parse based on energy considerations. RNA structure is determined by the attractions
among its nucleotides: A (adenine) attracts U (uracil) and C (cytosine) attracts G
(guanine). These nucleotides will be represented using small case letters. The CFG rules:
S → aSu/uSa/ε
generate palindrome-like sequences of u’s and a’s of even length. One could map this
palindrome to a 2D representation in which each a in the left of the generated string
matches the corresponding u in the right part of the string and viceversa. In this
particular case, the number of matches is maximal. This grammar is nondeterministic
since a parser would not normally know where lies the middle of the string to be parsed.
The grammar becomes highly ambiguous if we introduce a new nonterminal N generating
any sequence of a’s and u’s. S → aSu/uSa/N N→aN/uN/ε. Now the problem becomes
much harder since any string admits a very a large number of parses and we have to
chose among all those parses the one that matches most a’s with u’s and vice versa. The
corresponding 2D representation of that parse is what is called a hairpin loop. An actual
grammar describing RNA should also include the rules specifying the attractions among
c’s and g’s:
S → cSg/gSc/.
7.4.2. Protein Structure. The largest repository of 3D protein structures is the PDB
(Protein Data Base): it records the actual x, y, z coordinates of each atom making up each
of its proteins. That information has been gathered mostly by X-ray crystallography and
NMR techniques. There are very valuable graphical packages (e.g., Rasmol) that can
present the dense information in the PDB in a visually attractive and useful form
allowing the user to observe a protein by rotating it to inspect its details viewed from
different angles. The outer surface of a protein consists of the amino acids that are
hydrophilic (tolerate well the water media that surrounds the protein). In contrast, the
hydrophobic amino acids usually occupy the protein’s core. The configuration taken by
the protein is one that minimizes the energy of the various attractions and repulsions
among the constituent atoms.
A domain is a portion of the protein that has its own function. Domains are capable of
independently folding into a stable structure. The combination of domains determines the
protein’s function. Protein folding, the determination of protein structure from a given
sequence of amino acids, is one of the most difficult problems in present-day science.
The approaches that have been used to solve it can only handle short sequences and
require the capabilities of the fastest parallel computers available.
8.APPLICATIONS
1. Internal biochips
2. External biochips
1. lab on a chip
2. mass spectrometry
1. GLUCOSE MEASUREMENT
Nowadays diabetics measure the level of sugar glucose in their blood by using a
skin prick and a hand held blood test and medicate them with insulin. The disadvantage
of this simple system is that the need to draw blood makes the diabetics not to test the
sugar levels themselves as often as they could.
By using Biochips the measurement can be done in a much simpler way. The
chips are of size less than an uncooked grain of rice can be injected under the skin. It
sense the glucose level and send the result back out by radio frequency communication.
If there are any post operative problems the simulator (pulse generator) can be
simply turned off.
3. COCHLEAR IMPLANT
Hearing aids used in present days are glorified amplifiers, but the cochlear
implant is for patients who have lost the hair cells that detect sound waves. For these
individuals no amount of amplification is enough.
The cochlear implant delivers electrical pulses directly to the nerve cells in the
cochlea, the spiral shaped structure that translates sound into nerve pulses. In normal
hearing individuals, sound wave set up vibrations in the walls of the cochlea, and hair
cells detect these vibrations.
High frequency noises vibrate the base of the cochlea, while low frequency notes
vibrate near the top of the spiral.
The cochlear implant does the job of the hair cells. It splits the frequencies of
incoming noises into a number of channels and then stimulates the appropriate part of
cochlea.
Increasing the number of channels will improve sound perception. But speech is
perceived in an area of the cochlea only 14 mm long and spacing the electrodes to close
to each other causes signals to bleed from one channel to another. This causes a broad
version of hearing.
4. EYE IMPLANT
Vision occurs as the light reflected from a body is received by photoreceptors, the
light sensing cells at the back of the eye. Blindness occurs if the photoreceptors are lost
in retinitis pigmentosa, a genetic disease and in related macular degeneration.
The chip used in eye implant does the function of photoreceptors. The chip will
be at least ten times smaller than the thickness of the human hair with an area of 1mm2.
There will be a camera mounted on a pair of glasses. The camera will detect and encode
the scene and then send it into the eye as a laser pulse. The laser will also provide the
energy to drive the chip. The energy required for stimulating a nerve cell in the eye is
almost 100 times lower than that required in stimulating a nerve cell in an ear.
5. PERSON IDENTIFICATION
Biochips when implanted into human body can have an identification number,
and all the details about that person. This can help agencies to locate lost children,
soldiers and Alzheimer’s patient. Biochips are widely used in identification of criminals
and terrorists in America.
1. LAB ON A CHIP
Biochips scan, process biological data very rapidly. The technology is commonly
known as ‘lab on a chip’. The idea of a cheap and reliable computer chip look alike that
performs thousands of biological reactions is very attractive to drug developers. Because
these chips automate highly repetitive laboratory tasks by replacing cumbersome
equipment with miniaturized micro fluidic assay chemistries. Biochips are able to
provide ultrasonic detection methodologies at significantly lower costs per assay than
traditional and also amount of space.
2. MASS SPECTROMETRY
Mass spectrometry determines molecular structures from ionized samples of
materials. Biochips can be used to perform mass spectrometry and researches are going in
that area. This can help in saving much space and time in laboratories
Gene chip will be able to screen diseases like heart attack and diabetics years
before patients develop symptoms. These will help doctors diagnose life-threatening
illness faster, eliminating expensive, time-consuming ordeals like biopsies and
sigmoidoscopies, or simple blood, saliva, stool, or urine tests. Gene chips reclassify
diseases based on their underlying molecular signals, rather than misleading surface
symptoms.
9. FUTURE DEVELOPMENT
A few specific areas that fall within the scope of bioinformatics are as follows:
1.Sequence assembly –
Once the DNA sequence of a fragment of the genome is determined, the next step
is the understanding of the function of the gene. This involves various analyses, which
are carried out by high- powered computing and specialised software. Many would
concider this activity as the most important area of focus within bioinformatics.
3.Proteomics –
A relatively new area, proteomics studies not the entire genome, but the portion of
the genome that is expressed in particular cells. This involves the collections between
patterns of expression of the genes and a particular disease state to determine likely
targets for drug and/or gene therapy. Bioinformatics specialists work closely with
scientists to accomplish the same.
4.Pharmacogenomics –
10. CONCLUSION
Days aren't far off when beauty saloons will perform fundamental body changes
apart from customizing looks of the people. If you aren't born perfect, free from any
diseases and deformity, you need not despair. Rapid advances in bioinformatics are
providing new hopes to such patients. At the first sign of physical defect or deformity,
people will shop around for a better and stronger organically grown heart, brain, or
kidney, as the case may be. With bioinformatics man kind will be able to prolong its life
or , even live forever.
11.REFERENCES
Websites
www.electronicsforu.com
www.inbios.org
www.bioinformatics.org
www.biochip.org