0% found this document useful (0 votes)
9 views32 pages

4.Alignment Notes

The document discusses various types of sequence alignment, including local and semi-global alignments, and introduces the Expect value (EValue) as a measure of significance in sequence matches. It also covers the CRAM file format as a more efficient alternative to SAM and BAM formats for storing sequence alignment data. Additional resources and tools for sequence alignment and analysis, such as Bowtie 2 and scoring matrices, are provided.

Uploaded by

Mahmoud Atef
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views32 pages

4.Alignment Notes

The document discusses various types of sequence alignment, including local and semi-global alignments, and introduces the Expect value (EValue) as a measure of significance in sequence matches. It also covers the CRAM file format as a more efficient alternative to SAM and BAM formats for storing sequence alignment data. Additional resources and tools for sequence alignment and analysis, such as Bowtie 2 and scoring matrices, are provided.

Uploaded by

Mahmoud Atef
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

1

2
3
4
local alignment which does not need to start from or end at the start or end of either
sequences.

5
Note that in semi-global alignment, the start of alignment must be the start of ref or
query and also the end of alignment must be the end of ref or query. This is unlike
local alignment which does not need to start from or end at the start or end of either
sequences.

6
7
Extra Resources:
1. This video explains “dynamic programming algorithm” but in minimum edit
distance problem not sequence alignment. It is very easy to follow and understand
https://www.youtube.com/watch?v=b6AGUjqIPsA
2. Now this video explain the “dynamic programming algorithm in sequence
alignment problem
https://www.youtube.com/watch?v=LhpGz5--isw
3. This is a MUST-READ review about the topic and things moved to short read
aligners
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5425171/

8
9
10
The Expect value (EValue) is a parameter that describes the number of hits one can
"expect" to see by chance when searching a database of a particular size. It decreases
exponentially as the Score (S) of the match increases. Essentially, the E value
describes the random background noise. For example, an E value of 1 assigned to a
hit can be interpreted as meaning that in a database of the current size one might
expect to see 1 match with a similar score simply by chance.
The lower the E-value, or the closer it is to zero, the more "significant" the match is.
However, keep in mind that virtually identical short alignments have relatively high E
values. This is because the calculation of the E value takes into account the length of
the query sequence. These high E values make sense because shorter sequences have
a higher probability of occurring in the database purely by chance.
https://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html

11
Scoring matrices & gap penalties :
ftp://ftp.ncbi.nlm.nih.gov/blast/matrices/
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3848038/

12
13
14
15
16
17
18
19
20
21
https://samtools.github.io/hts-specs/SAMv1.pdf

CRAM: https://en.wikipedia.org/wiki/CRAM_(file_format)
CRAM was designed to be an efficient reference-based alternative to the Sequence
Alignment Map (SAM) and Binary Alignment Map (BAM) file formats. It optionally
uses a genomic reference to describe differences between the aligned sequence
fragments and the reference sequence, reducing storage costs (approach similar to
the GTF file). Additionally each column in the SAM format is separated into its own
blocks, improving compression ratio. CRAM files typically vary from 30 to 60% smaller
than BAM, depending on the data held within them.

More about CRAM:


https://genome.ucsc.edu/goldenPath/help/cram.html
http://samtools.github.io/hts-specs/CRAMv3.pdf

22
23
samtools tview align .bam all . fa

Why is it better to use uBAM and not Fastq? Easier compression and enable BAM
operations e.g. Merge 2 samples and keep track where every read came from

24
25
Decoding SAM flags
https://broadinstitute.github.io/picard/explain-flags.html

Convert a decimal no (e.g. 8) to binary


echo ' obase =2;8 ' | bc

26
CIGAR stands for Concise Idiosyncratic Gapped Alignment Report.

27
https://davetang.org/wiki/tiki-index.php?page=SAM

28
https://davetang.org/wiki/tiki-index.php?page=SAM

29
https://davetang.org/wiki/tiki-index.php?page=SAM

30
From Bowtie 2 manual: http://bowtie-
bio.sourceforge.net/bowtie2/manual.shtml#getting-started-with-bowtie-2-lambda-
phage-example

-k mode: search for one or more alignments, report each


In -k mode, Bowtie 2 searches for up to N distinct, valid alignments for each read,
where N equals the integer specified with the -k parameter. That is, if -k 2 is specified,
Bowtie 2 will search for at most 2 distinct alignments. It reports all alignments found,
in descending order by alignment score. The alignment score for a paired-end
alignment equals the sum of the alignment scores of the individual mates. Each
reported read or pair alignment beyond the first has the SAM 'secondary' bit (which
equals 256) set in its FLAGS field. Supplementary alignments will also be assigned a
MAPQ of 255.

31
https://davetang.org/wiki/tiki-index.php?page=SAM

32

You might also like