3.8
3.8
Microbiology
Unit no 3
Multiple Sequence Unit title Introduction
to databases
alignment Subject name and
code Bioinformatics
and Biostatistics and
02MB301
• Compare all the sequences two by two, producing a global alignment and a
series of local alignments (using lalign).
• By default an alignment will display the following symbols that denote the degree of
conservation observed in each column:
"*" -residues or nucleotides in that column are identical in all sequences in the
alignment.
":" -conserved substitutions have been observed. "." -semi-
conserved substitutions are observed.
• ‘Download Alignment File’ - to download the alignment in .aln format.
• Cladogram
-Branching diagram (tree) assumed to be an estimate of a
phylogeny where the branches are of equal length, thus cladograms show
common ancestry, but do not indicate the amount of evolutionary "time"
separating taxa.
Neighbour-joining tree without correcting the distances
submission details of the alignment
MUSCLE
• MUltiple Sequence Comparison by Log-Expectation
• Better average accuracy and better speed than ClustalW2 or
T-Coffee
• An accurate MSA tool, especially good with proteins and suitable
for medium alignments.
• Aligns 5000 sequences with average length of 350.
• MUSLE algorithm includes
-fast distance estimation using kmer counting.
-progressive alignment using a new profile
function called the log‐expectation score.
-refinement using tree‐dependent partitioning. restricted
MUSCLE MSA interface at EBI (http://www.ebi.ac.uk/Tools/msa/muscle/)
1. Go to http://www.ebi.ac.uk/Tools/msa/muscle/ in your browser.
2. Enter your input sequence.
3. Go to choose file and give your input sequence in any valid format (GCG,
FASTA, EMBL, GenBank, PIR, NBRF or UniProtKB/Swiss-Prot format).
4. Similarly give other input sequences (There is currently a limit of 500
sequences and 1MB of data).
5. Click on ‘more options’ button to set the alignment options. Change output
format to clustalW to get results in aln format. Default value is
Pearson/FASTA [fasta]. Output Order -in which the sequences appear in
the final alignment. Default value is: ‘aligned’.
6. Enter submit.
MSA generated by MUSCLE MSA generated by T-Coffee
• By default an alignment will display the following symbols that denote the degree of
conservation observed in each column:
"*" -residues or nucleotides in that column are identical in all sequences in the
alignment.
":" -conserved substitutions have been observed. "." -semi-
conserved substitutions are observed.
• ‘Download Alignment File’ - to download the alignment in .aln format.
• Iterative methods
–rectify this problem by repeatedly realigning
subgroups of the sequences.
-then by aligning these subgroups into a global
alignment of all of the sequences.
-Major objective is to improve the overall
alignment score, such as a sum of pairs score.
• Selection of groups -based on the ordering of the sequences on
a phylogenetic tree.
MAFFT MSA interface at EBI (http://www.ebi.ac.uk/Tools/msa/mafft/)
•Click on ‘more options’ button to set the alignment options
Change the output format ‘clustalw’ (Default value is:
Pearson/FASTA [fasta]).
Matrix
Protein comparison matrix to be used when adding sequences to the alignment.
Matrix (Protein Only)
Default value is: BLOSUM 62 [bl62]
Gap Open
Penalty for first base/residue in a gap. Default
value is: 1.53
Gap Extension
Penalty for each additional base/residue in a gap.
Default value is: 0.123
Order
The order in which the sequences appear in the final alignment Default value is:
aligned
Tree Rebuilding Number
Default value is: 1 Guide Tree Output
Generate guide tree file
Default value is: ON [true]
Max Iterate
Maximum number of iterations to perform when refining the alignment.
Change the Max Iterate value to ‘2’ to change number of iterations for better
alignment.
Default value is: 0
Perform FFTS (Fast Fourier Transform) Default value is:
local pair
• Click ‘submit’
MSA generated by MAFFT
• The N terminal alignment in the result generated by the
MAFFT is similar to the one generated by the T-Coffee.
• The alignment is in the middle and C terminal is entirely
different from the T-Coffee or MUSCLE.
• Though alignments differ, conservation of amino acids at the
active site are retained.
Formats
Gaps in sequences
• In all EMBOSS alignment formats, gaps indicated by ‘- 'character.
• Exception
-msf format which uses '.' as the gap character inside the sequences
-'~' as the gap character at the terminal ends of the alignment.
Head and tail of the format
• The majority of the alignment formats (except those that are also standard sequence formats, like fasta or MSF) have a block of information at the start of
the alignment describing the program, date, output filename, ID names of the sequences and some of the parameters and statistics of the alignment.
########################################
# Program: demoalign
# Rundate: Thu Jan 17 09:30:08 2002 # Report_file: stdout
########################################
#=====================================
Aligned_sequences: 4
# 1: IXI_234
# 2: IXI_235
# 3: IXI_236
# 4: IXI_237
# Matrix: EBLOSUM62
# Extend_penalty: ‐1 ##
# Gap_penalty: 9
Length: 131
# Identity: 95/131 (72.5%)
# Similarity: 127/131 (96.9%) # Gaps: 25/131 (19.1%)
#
9/27/2016 Alignment Formats http://emboss.sourceforge.net/docs/themes/AlignFormats.html 3/6 #
#====================================
There is also a block of information at the end of the alignment for summary information. This is used by a few programs e.g.
merger .
Length
The header block contains a line similar to:
# Length: 131
This is the length of the alignment, including any gaps that have been introduced to construct the alignment.
Identity
The header block contains a line similar to:
# Identity: 95/131 (72.5%)
This is a count of the number of positions over the length of the alignment where all of the residues or bases at that position are
identical. It is followed by '/131' the length of the alignment and '(72.5%)' the percentage of positions in the alignment where there are
identities.
Similarity
The header block contains a line similar to:
# Similarity: 127/131 (96.9%)
This is a count of the number of positions over the length of the alignment where >= 51% of the residues or bases at that position
are similar. Any two residues or bases are defined as similar when they have positive comparisons (as defined by the comparison
matrix being used in the alignment algorithm). It is followed by '/131' the length of the alignment and '(96.9%)' the percentage of
positions in the alignment where there are similarities. Note that the sum of identical and similar positions is greater than 100%.
This is because the count of similar positions includes the count of identical positions; if residues are identical,
they must also be similar.
Gaps
The header block contains a line similar to: # Gaps: 25/131
(19.1%)
This is a count of the number of positions over the length of the
alignment where there are one or more sequences with a gap.
9/27/2016 Alignment Formats
http://emboss.sourceforge.net/docs/themes/AlignFormat s.html4/6
It is followed by '/131' the length of the alignment and '(19.1%)' the
percentage of positions in the alignment where there are gaps.
Score
The header block may contain a line similar to:
# Score: 100.0
This is the score used by the program that calculated the alignment to determine which is the best
possible alignment to report. The algorithm that was used to derive the score is not part of the alignment
formatting routines. You should see documentation about the relevant algorithm to see how the score is
derived.
Alignment Formats (MSA)
Alignment viewers/editors
Name Integrated with Can Align Can Calculate Other Features Formats License Link
Struct. Sequences Phylogenetic Supported
Prediction Tools Trees