bioinformatics
bioinformatics
a large amount of data that is stored in a computer and can easily be used, added to, etc.
Nucleic acids are biopolymers, macromolecules, essential to all known forms of life.[1] They are
composed of nucleotides. The two main classes of nucleic acids are deoxyribonucleic acid
(DNA) and ribonucleic acid (RNA). If the sugar is ribose, the polymer is RNA; if the sugar is
deoxyribose, a version of ribose, the polymer is DNA. Nucleic acids are chemical compounds
that are found in nature. They carry information in cells and make up genetic material. Nucleic
acids are chemical compounds that are found in nature. They carry information in cells and make
up genetic material. One DNA or RNA molecule differs from another primarily in the sequence
of nucleotides. Nucleotide sequences are of great importance in biology since they carry the
ultimate instructions that encode all biological molecules, molecular assemblies, subcellular and
cellular structures, organs, and organisms, and directly enable cognition, memory, and behavior.
Enormous efforts have gone into the development of experimental methods to determine the
nucleotide sequence of biological DNA and RNA molecules,[26][27] and today hundreds of
millions of nucleotides are sequenced daily at genome centers and smaller laboratories
worldwide. In addition to maintaining the GenBank nucleic acid sequence database, the National
Center for Biotechnology Information (NCBI, https://www.ncbi.nlm.nih.gov) provides analysis
and retrieval resources for the data in GenBank and other biological data made available through
the NCBI web site
Genomes
The genome is the entire set of DNA instructions found in a cell. In humans, the genome consists
of 23 pairs of chromosomes located in the cell's nucleus, as well as a small chromosome in the
cell's mitochondria. A genome contains all the information needed for an individual to develop
and function.
DNA is the information molecule for all living organisms. All of the DNA of an organism is
called its genome. For example, the human genome contains about 3 billion nucleotides.
Protein structures are made by condensation of amino acids forming peptide bonds. The
sequence of amino acids in a protein is called its primary structure. The secondary structure is
determined by the dihedral angles of the peptide bonds, the tertiary structure by the folding of
protein chains in space.
Bibliography
Sequence analysis
Sequence analysis in molecular biology includes a very wide range of relevant topics:
The comparison of sequences in order to find similarity, often to infer if they are related
(homologous)
Identification of intrinsic features of the sequence such as active sites, post translational
modification sites, gene-structures, reading frames, distributions of introns and exons and
regulatory elements
Identification of sequence differences and variations such as point mutations and single
nucleotide polymorphism (SNP) in order to get the genetic marker.
Revealing the evolution and genetic diversity of sequences and organisms
Identification of molecular structure from sequence alone
In chemistry, sequence analysis comprises techniques used to determine the sequence of a
polymer formed of several monomers (see Sequence analysis of synthetic polymers). In
molecular biology and genetics, the same process is called simply "sequencing".
In social sciences and in sociology in particular, sequence methods are increasingly used to study
life-course and career trajectories, time use, patterns of organizational and national development,
conversation and interaction structure, and the problem of work/family synchrony. This body of
research is described under sequence analysis in social sciences.
Since the very first sequences of the insulin protein were characterized by Fred Sanger in 1951,
biologists have been trying to use this knowledge to understand the function of molecules. The
method used in this study, which is called the “Sanger method” or Sanger sequencing, was a milestone in
sequencing long strand molecules such as DNA. This method was eventually used in the human genome
project.
the first complete genome of a bacteriophage in 1977. Robert Holley and his team in Cornell University were
believed to be the first to sequence an RNA molecule. There are millions of protein and nucleotide
sequences known. Relationships between these sequences are usually discovered by aligning them
together and assigning this alignment a score. There are two main types of sequence alignment.
Pair-wise sequence alignment only compares two sequences at a time and multiple sequence
alignment compares many sequences. Two important algorithms for aligning pairs of sequences are
the Needleman-Wunsch algorithm and the Smith-Waterman algorithm. Popular tools for sequence
alignment include:
CLUSTALW
Clustal W is a general purpose multiple sequence alignment program for DNA or
proteins.It produces biologically meaningful multiple sequence alignment.
Alignment methods
Very short or very similar sequences can be aligned by hand. However, most interesting problems
require the alignment of lengthy, highly variable or extremely numerous sequences that cannot be
aligned solely by human effort. Instead, human knowledge is applied in constructing algorithms to
produce high-quality sequence alignments, and occasionally in adjusting the final results to reflect
patterns that are difficult to represent algorithmically (especially in the case of nucleotide
sequences). Computational approaches to sequence alignment generally fall into two categories:
global alignments and local alignments. Calculating a global alignment is a form of global
optimization that "forces" the alignment to span the entire length of all query sequences. By contrast,
local alignments identify regions of similarity within long sequences that are often widely divergent
overall. Local alignments are often preferable, but can be more difficult to calculate because of the
additional challenge of identifying the regions of similarity. A variety of computational algorithms have
been applied to the sequence alignment problem. These include slow but formally correct methods
like dynamic programming. These also include efficient, heuristic algorithms or probabilistic methods
designed for large-scale database search, that do not guarantee to find best matches.
Hybrid methods, known as semi-global or "glocal" (short for global-local) methods, search for the
best possible partial alignment of the two sequences (in other words, a combination of one or both
starts and one or both ends is stated to be aligned). This can be especially useful when the
downstream part of one sequence overlaps with the upstream part of the other sequence. In this
case, neither global nor local alignment is entirely appropriate: a global alignment would attempt to
force the alignment to extend beyond the region of overlap, while a local alignment might not fully
[8]
cover the region of overlap. Another case where semi-global alignment is useful is when one
sequence is short (for example a gene sequence) and the other is very long (for example a
chromosome sequence). In that case, the short sequence should be globally (fully) aligned but only
a local (partial) alignment is desired for the long sequence.
Fast expansion of genetic data challenges speed of current DNA sequence alignment algorithms.
Essential needs for an efficient and accurate method for DNA variant discovery demand innovative
approaches for parallel processing in real time. Optical computing approaches have been suggested
as promising alternatives to the current electrical implementations, yet their applicability remains to
be tested
Pairwise alignment
Pairwise sequence alignment methods are used to find the best-matching piecewise (local or global)
alignments of two query sequences. Pairwise alignments can only be used between two sequences
at a time, but they are efficient to calculate and are often used for methods that do not require
extreme precision (such as searching a database for sequences with high similarity to a query). The
three primary methods of producing pairwise alignments are dot-matrix methods, dynamic
[1]
programming, and word methods; however, multiple sequence alignment techniques can also
align pairs of sequences. Although each method has its individual strengths and weaknesses, all
three pairwise methods have difficulty with highly repetitive sequences of low information content -
especially where the number of repetitions differ in the two sequences to be aligned.
Structural alignment
Structural alignments, which are usually specific to protein and sometimes RNA sequences, use
information about the secondary and tertiary structure of the protein or RNA molecule to aid in
aligning the sequences. These methods can be used for two or more sequences and typically
produce local alignments; however, because they depend on the availability of structural information,
they can only be used for sequences whose corresponding structures are known (usually through
X-ray crystallography or NMR spectroscopy). Because both protein and RNA structure is more
[20]
evolutionarily conserved than sequence, structural alignments can be more reliable between
sequences that are very distantly related and that have diverged so extensively that sequence
comparison cannot reliably detect their similarity.
Structural alignments are used as the "gold standard" in evaluating alignments for homology-based
[21]
protein structure prediction because they explicitly align regions of the protein sequence that are
structurally similar rather than relying exclusively on sequence information. However, clearly
structural alignments cannot be used in structure prediction because at least one sequence in the
query set is the target to be modeled, for which the structure is not known. It has been shown that,
given the structural alignment between a target and a template sequence, highly accurate models of
the target protein sequence can be produced; a major stumbling block in homology-based structure
prediction is the production of structurally accurate alignments given only sequence information.
Phylogenetic analysis
phylogeny, the history of the evolution of a species or group,
especially in reference to lines of descent and relationships among
broad groups of organisms.
In biology, phylogenetics is the study of the evolutionary history and relationships
among or within groups of organisms. These relationships are determined by
phylogenetic inference methods that focus on observed heritable traits, such as DNA
sequences, protein amino acid sequences, or morphology
A phylogenetic tree is a diagram that represents evolutionary relationships among
organisms. Phylogenetic trees are hypotheses, not definitive facts. The pattern of
branching in a phylogenetic tree reflects how species or other groups evolved from a
series of common ancestors.
Phylogenetic analysis provides an in-depth understanding of how species evolve
through genetic changes. Using phylogenetics, scientists can evaluate the path that
connects a present-day organism with its ancestral origin, as well as can predict the
genetic divergence that may occur in the future.