4.Alignment Notes
4.Alignment Notes
2
3
4
local alignment which does not need to start from or end at the start or end of either
sequences.
5
Note that in semi-global alignment, the start of alignment must be the start of ref or
query and also the end of alignment must be the end of ref or query. This is unlike
local alignment which does not need to start from or end at the start or end of either
sequences.
6
7
Extra Resources:
1. This video explains “dynamic programming algorithm” but in minimum edit
distance problem not sequence alignment. It is very easy to follow and understand
https://www.youtube.com/watch?v=b6AGUjqIPsA
2. Now this video explain the “dynamic programming algorithm in sequence
alignment problem
https://www.youtube.com/watch?v=LhpGz5--isw
3. This is a MUST-READ review about the topic and things moved to short read
aligners
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5425171/
8
9
10
The Expect value (EValue) is a parameter that describes the number of hits one can
"expect" to see by chance when searching a database of a particular size. It decreases
exponentially as the Score (S) of the match increases. Essentially, the E value
describes the random background noise. For example, an E value of 1 assigned to a
hit can be interpreted as meaning that in a database of the current size one might
expect to see 1 match with a similar score simply by chance.
The lower the E-value, or the closer it is to zero, the more "significant" the match is.
However, keep in mind that virtually identical short alignments have relatively high E
values. This is because the calculation of the E value takes into account the length of
the query sequence. These high E values make sense because shorter sequences have
a higher probability of occurring in the database purely by chance.
https://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html
11
Scoring matrices & gap penalties :
ftp://ftp.ncbi.nlm.nih.gov/blast/matrices/
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3848038/
12
13
14
15
16
17
18
19
20
21
https://samtools.github.io/hts-specs/SAMv1.pdf
CRAM: https://en.wikipedia.org/wiki/CRAM_(file_format)
CRAM was designed to be an efficient reference-based alternative to the Sequence
Alignment Map (SAM) and Binary Alignment Map (BAM) file formats. It optionally
uses a genomic reference to describe differences between the aligned sequence
fragments and the reference sequence, reducing storage costs (approach similar to
the GTF file). Additionally each column in the SAM format is separated into its own
blocks, improving compression ratio. CRAM files typically vary from 30 to 60% smaller
than BAM, depending on the data held within them.
22
23
samtools tview align .bam all . fa
Why is it better to use uBAM and not Fastq? Easier compression and enable BAM
operations e.g. Merge 2 samples and keep track where every read came from
24
25
Decoding SAM flags
https://broadinstitute.github.io/picard/explain-flags.html
26
CIGAR stands for Concise Idiosyncratic Gapped Alignment Report.
27
https://davetang.org/wiki/tiki-index.php?page=SAM
28
https://davetang.org/wiki/tiki-index.php?page=SAM
29
https://davetang.org/wiki/tiki-index.php?page=SAM
30
From Bowtie 2 manual: http://bowtie-
bio.sourceforge.net/bowtie2/manual.shtml#getting-started-with-bowtie-2-lambda-
phage-example
31
https://davetang.org/wiki/tiki-index.php?page=SAM
32