Novoalign & Novoaligncs Reference Manual: Bioinformatics Specialists
Novoalign & Novoaligncs Reference Manual: Bioinformatics Specialists
Pb,i = 10log
10
Pr b, i
Single Base Quality to Score Conversion
Sanger FASTQ, Solexa FASTQ, Colour Space reads and other read formats such as Phred have a
called base S(i) or colour and single quality score Q(i) at each position, i, in the read. The quality
value is converted to a probability, Pr(i) and then to a penalty P(S(i), i).
Novocraft Technologies Sdn Bhd
Solexa
Pr (i) =
10
Q(i )
10
(1+10
Q(i )
10
)
Fastq or Phred
Pr (i) = 1 10
Q(i )
10
Alignment Penalty
P(S(i) , i) = 10 log
10
( Pr (i))
P(b({ A,C, G, T}S(i)), i) = 10 log
10
((1Pr (i))3)
Base Penalty Limit
For nucelotide alignments the penalties calculated above are further limited to a maximum of 30 at
any base position. For colour space alignments no limit is applied to the penalty for a colour error
and a default penalty of 30 is used for SNPs.
Base Quality to Penalty Table
The following table illustrates the conversion of base qualities to alignment score penalties. Other
factors affecting penalties include ambiguous IUPAC codes in the reference and quality calibration.
Note. Very low quality bases can contribute to alignment score even if they match the reference! It
is not possible in Novoalign to use the threshold parameter to control the number of mismatches
allowed in the alignments. The threshold sets a lower limit on the probability that the aligned
sequence could have generated the read.
Base Quality Match Penalty Mismatch Penalty
1 6 6
2 6 6
3 3 8
4 2 9
5 2 10
6 1 11
7 1 12
8 1 13
9 1 14
10 0 15
11 0 16
12 0 17
13 0 18
14 0 19
15 0 20
Novocraft Technologies Sdn Bhd
16 0 21
17 0 22
18 0 23
19 0 23
20 0 24
21 0 25
22 0 26
23 0 27
24 0 28
25 0 29
26 0 29
27 0 30
28 0 30
29 0 30
30 0 30
Alignment Score and Threshold
The alignment score is -10log
10
(P(R|Ai)) where P(R | Ai) is the probability of the read sequence
given the alignment location i.
A threshold of 75 would allow for alignment of reads with two mismatches at high quality base
positions plus one or two mismatches at low quality positions or to ambiguous characters in the
reference sequence.
If a threshold is not specified then Novoalign will calculate a threshold for each read (with a limit of
250) such that the chance of finding a false positive alignment is less than 0.001, resulting in a
possible alignment quality of not more than 30 for a read that aligned with a score equal to the
default threshold. I.e. The iterative process of finding an alignment will terminate before finding a
low quality chance alignment.
4.3.2 Posterior Alignment Probabilities and Quality Scores
The posterior alignment probability calculation includes all the alignments found; the probability
that the read came from a repeat masked region or from any regions coded in the reference genome
as N's; and an allowance for a chance hit above the threshold based on the mutual information
content of the read and the genome.
A posterior alignment probability, P(Ai | R, G) is calculated as:
P A
i
R, G =
PRA
i
, G
P RN, G
i
PRA
i
, G
where P(R|N,G) is the probability of finding the read by chance in any masked reference sequence
or any region of the reference sequence coded as N's, and where
i
is the sum over all the
alignments found plus a factor for chance alignments calculated using the usable read and genome
lengths.
The P(R|N,G) term allows for the fact that a fragment could have been sourced from portions of the
genome that are not represented in the reference seuence. !or instance in "uman genome build #$
there is appro%imatel& '( of seuence represented b& large bloc)s of N*s.
Novocraft Technologies Sdn Bhd
A quality score is calculated as min(70, -10log
10
(1 - P(Ai| R, G))), where P(Ai|R, G) is the
probability of the alignment given the read and the genome.
4.3.3 Adapter Stripping
Single End Reads - miRNA
Adapter stripping does a gapped global alignment of the adapter against the read and then trims the
read from the start of the optimum alignment.
A few details:
1. The read and base qualities are first converted to a weight matrix where each base will score
max(30, -10log(P)) where P is probability of the base. This results in a match scoring 0 and a
mismatch at high quality base position scoring 30
2. During adapter stripping we subtract 7 from the weights so at a high quality base position a
match scores -7 and a mismatch 23.
3. If the optimum alignment scores <= -7 it is stripped.
4. There are no penalties for unmatched letters at the beginning of the read or at the end of the
adapter.
Paired End Reads Short !ragments
If a DNA fragments is shorter than the read length then both reads of the pair will have extended
into adapter or primer sequence and unless stripped off will be used in alignment.
If there are only a few bases of adapter the read may still align but with some mismatches or indels
in the adapter portion of the alignment. This contributes to SNP noise and reduced consensus
quality.
When there is more than a few bases of adapter the read is unlikely to align which isn't a problem
except that there has been an attempt to align it that will have tried to align with possible 8
mismatches and up to 7 indels. This attempt to align the read with so many mismatches can
consume considerable CPU time so it's desirable to identify these reads before aligning them.
Novoalign identifies short fragments by aligning the two reads of a pair against each other to detect
overlap and adapter sequence. If overlap is detected then any adapter is trimmed from the two reads.
Novocraft Technologies Sdn Bhd
4.3.4 Amplicon Clipping
For targeted amplicon sequencing it is often desirable to exclude the amplicon primer sequence
from the variant calling process. To facilitate this Novoalign includes an option to soft clip primer
sequences from read alignments.
Reads are aligned using all the bases including the primer and then, post alignment, if the read
alignment maps to an amplicon then the primer bases are soft clipped from the alignments.
For paired end reads the reads 5' alignment locations are checked against primer locations and
trimmed accordingly. Then a check is done for each read to see if read 3' alignment overlaps with
the same primer trimmed from it's mates 5' end and if so the 3' end is trimmed. This allows for
amplicons where the insert is shorter than the read.
In paired end mode we allow the two reads of the pair to align to primers of different amplicons.
In single end mode the read 5' is checked for alignment to a primer and if so it's soft clipped. The 3'
alignment location is then checked for alignment to the other primer of the same amplicon and if so
it is soft clipped. This allows for reads shorter than the insert length.
Amplicon soft clipping is enabled by the option...
--amplicons amplicons.BED
Bed File Format
chrom Name of the chromosome
Novocraft Technologies Sdn Bhd
Illustration 1: Dynamic Programming alignment of two paired-end reads with insilco pre-pended the first 12bp
of the adapter sequence. High scoring diagonal identifies the amount of overlap and adapter sequence present
in the read. False positive rate is low as reads must be complementary and align to the adapter to get a good
score.
Adapter
A
d
a
p
t
e
r
R
e
a
d
2
Read1
R
e
a
d
s
1
0
0
%
o
v
e
r
l
a
p
b
u
t
w
i
t
h
n
o
a
d
a
p
t
e
r
R
e
a
d
s
o
v
e
r
l
a
p
w
i
t
h
s
o
m
e
a
d
a
p
t
e
r
p
r
e
s
e
n
t
R
e
a
d
s
a
r
e
1
0
0
%
a
d
a
p
t
e
r
R
e
a
d
s
p
a
r
t
i
a
l
l
y
o
v
e
r
l
a
p
chromStart Start position of the amplicon (includes primer bases)
chromEnd End position of the amplicon
name Amplicon name if any
score ignored
strand + or -, ignored for now.
thickStart Start of amplicon excluding primer
thickEnd end of amplicon excluding primer
itemRgb ignored
Example
chr2 29083861 29084059 AMP.1 100 - 29083881 29084039
chr2 29085075 29085273 AMP.2 100 - 29085095 29085254
chr2 29089969 29090233 AMP.3 100 - 29089989 29090214
chr2 29091056 29091241 AMP.4 100 - 29091076 29091220
At the end of the alignment process the counts of amplicon clipping events are printed to the
Novoalign log.
e.g.
# Amplicon Count SE5 SE3
# AMP.1 371 0 0
# AMP.2 190 0 0
There are three counters, first is the number of hits where both reads of pair aligned to primers of
same amplicons. Next two counts are where read1 & read2 of pair aligned to different amplicons or
perhaps one read of the pair failed to align.
4.3.5 Read Quality
Reads with too many low quality base positions will not be aligned. This is controlled by the -l
options and effectively sets the minimum length, or minimum number of high quality base positions
in order for an alignment to be attempted. The read length calculation uses base qualities to
calculate the information content of the read.
Homopolymer reads are also deemed low quality and not aligned. These are fairly frequent in real
data and are possibly the result of dust on slides.
Novocraft Technologies Sdn Bhd
4.3.6 Reads with Multiple Alignments
There are times that reads will align to multiple locations with very similar alignment scores.
Situations where this might occur are reads originating from repeats and the alignment of very short
reads such as small RNA.
Depending on the users project and objectives, reads and alignments may be or not be of interest.
Every read will have multiple alignment locations however the alignment score could be very
different, so for detection of repeats novoalign programs use the difference in score between the best
alignment and the rest of the alignments. This score difference is set by the '-R99' option and
defaults to 5 which corresponds to the best alignment being approximately 3 times more probable
than the next best alignment. For example, two alignments with probabilities 0.7 (score 1) and 0.3
(score = 5) would be considered as multiple alignments to the read. Two alignments with
probabilities 0.8 (Score 0) and 0.2 ( score 7) would be treated as a unique alignment to the location
with the higher probability.
Having identified a read as having multiple alignment locations we then have several options for
reporting.
Option Description
None No alignments will be reported. The read will be reported as a status R with a
count of the number of alignments. No alignment locations will be reported.
Random A single alignment location is randomly chosen from amongst the alignment
results. The choice is made using posterior alignment probabilities.
All All alignment locations are reported. Note, that this is all alignments with a score
within 5 points of the best alignment unless you use the -R99 option to extend
the range.
Exhaustive This option bypasses the iterative alignment process and the normal repeat
alignment detection. It finds all alignments with a score no worse than the
threshold (-t 99 option) and reports all the locations.
4.3.7 Sequence file formats
Read files are introduced using the -f options. Novoalign examines the file name and the first few
lines of each file to determine the file format.
Licensed versions of Novoalign will also process read files compressed with gzip.
Format File Names Description and detection method
Novocraft Technologies Sdn Bhd
FASTA *.fa
*.fna
*.fasta
Standard FASTA format input file can be used. This file type is
recognised by the name matching *.fa, *.fna , or *.fasta or by the
first line starting with a '>' character. e.g.
>sequence_0
GATGTCACTCAGTATGAGAAAGAGGCAGGTTCTGGG
>sequence_1
ACACGCAGCGCCGCGCATGCTTGCGCCGCCACTCCA
>sequence_2
ACCTGCGCTCTGCCCTGAAACCACTGTTGGCTTGAG
Example:
novoalign -f reads.fa -d celegans
.FASTA &
Quality
as above with
*.qual
Fasta file are detected and then the folder is checked for a quality
file. If Novoalign detects a fasta format read file it looks for a
matching *.qual file in the same folder. If found then it will be
used for base qualities.
>sequence_0
40 40 40 40 40 40 40 40 40 40 40 40 14 40 40 40 40 40 40 40 40
40 25 40 40
40 40 40 40 5 40 8 9 21 40 4
>sequence_1
40 19 7 22 4 40 8 40 40 40 9 40 28 40 40 40 17 31 11 40 32 24 4 9
14
10 36 16 40 9 2 8 6 16 3 3
Sanger
FASTQ
*.fastq Sanger format FASTQ files are recognised by the file name
matching *.fastq. Quality scores are from ASCII code 33.
For non-standard filenames this format is detected by an '@'
character starting the first line and by a test on the quality codes of
the first read. Sanger fastq files are automatically detected as the
ASCII coded qualities are lower than for a Solexa format FASTQ
file.
Example:
novoalign -f reads.fastq -d celegans
Novocraft Technologies Sdn Bhd
Solexa
FASTQ
and
Illumina
FASTQ
*_sequence.txt Files produced by Illumina pipeline with Solexa variant of the
FASTQ format. Solexa quality scores are ASCII letter code 64;
See Gerald documentation for a full description. These files are
named like s_lane_sequence.txt and recognised by matching the
file name against s_*_sequence.txt.
For non-standard filenames this format is detected by an '@'
character starting the first line and by a test on the quality codes of
the first read. Solexa fastq files are automatically detected as the
ASCII coded qualities are higher than for a Sanger format FASTQ
file.
Starting from Version 1.3 of the Illumina Casava Pipeline the
coding of quality values was changed to the Phred scale. If you are
using Pipeline 1.3 you may need to add the option -F ILMFQ.
This option will treat quality codes as being coded as
-10log
10
(Perr) + '@'.
The old Solexa format is the default for _sequence.txt files and
interprets quality values according to formula -10log
10
(P/(1-P)) +
'@'
Novocraft Technologies Sdn Bhd
Illumina
Casava 1.8
FASTQ
*_sequence.txt New format introduced in Casava V1.8 these base qualities are
now coded in Sanger format.
The header also includes an 'is_filtered' field that is set to 'Y' if the
base caller has flagged the read as low quality (more details
below). By default low quality reads will be skipped. Refer to -F
command line option for further options.
@EAS139:136:FC706VJ:2:5:1000:12850 1:Y:18:ATCACG
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+
BBBBCCCC?<A?BC?7@@???????DBBA!!!!A##
From seqanswers.com:
The Illumina is_filtered flag is based solely on the relative
intensity of the fluorescent signals. There are two methods
Illumina uses to calculate relative intensities called Chastity and
Purity. Chastity is defined as the ratio of the intensity of the most
intense base for a cluster divided by the sum of the most intense
plus the second most intense signal. Purity is defined as the ratio
of the most intense signal divided by the sum of all four
fluorescent signals. The default parameter used by GERALD when
filtering reads is CHASTITY 0.6. Stated another way (after
doing a little algebra) the most intense signal must be at least 1.5x
higher than the second most intense signal. Also, filter passing is
only based on the signals over the first 12 cycles. I am not sure
whether this means that the value must be 0.6 for each of those
12 cycles or that average is 0.6.
This filter is designed to detect polyclonal clusters.
Solexa PRB *_prb.txt Illumina/Solexa prb file from the base calling program Bustard.
This file has quality values (probabilities) for each of the 4 bases at
each position in the read. This format is recognised by file name
matching s_*_prb.txt.
For non-standard filenames a prb format file is identified as having
a first line that consists only of digits, minus sign and whitespace.
Solexa PRB
& SEQ
as above with
*_seq.txt
If a prb file is detected by filename test then we look for the
corresponding seq file produced by Bustard base caller. This file
contains lane, tile and X,Y coordinates of the read which are then
used as the read sequence identifier. It is recognised by file name
s_*_seq.txt.
Novocraft Technologies Sdn Bhd
Illumina
QSEQ
*_qseq.txt Illumina qseq file format. e.g.
SOLEXA 90403 4 1 23 1566 0 1 ACCGCTCTCGTGCTCGTCGCTGCGTTGAGGCTTGCG `aaaaa```aZa^`]a``a``a]a^`a\Y^`^^]V` 1
The 7 fields before the read sequence are converted to a header by
prefixing with a '>' and substituting tabs with underscores '_'.
The last field is the Illumina quality flag, reads with a value 0 for
the flag will be skipped (default action)
ABI Solid
CSFASTA
*.csfasta
*_QV.qual
*.csfasta
>2_14_26_F3
T011213122200221123032111221021210131332222101
>2_14_192_F3
T110021221100310030120022032222111321022112223
*_QV.qual
>2_14_26_F3
24 24 22 27 23 10 13 13 20 19 19 18 24 20 22 12 14 5 20 17 14 20 18 17 19 11 21 19 13 13 12 25 9 19 19 6 5 12 20
13 11 8 12 7 14
>2_14_192_F3
14 19 21 13 24 17 18 18 25 21 8 12 21 8 7 11 14 7 19 23 11 24 7 11 29 12 28 17 7 19 7 11 5 11 5 14 13 9 24 8 7 20 0
8 9
CSFASTQ *.csfastq Colour Space FASTQ with primer base quality
@SRR015241.1 CLARA_20071207_2_CelmonAmp7797_16bit_26_88_34_F3 length=50
T32322133300002330031001022230020232002203222030231
+SRR015241.1 CLARA_20071207_2_CelmonAmp7797_16bit_26_88_34_F3 length=50
!21(()+%'+%40*.%%**)&%&*&%%%&%%%%%%%%%%%%%%%(+%%%%'
@SRR015241.2 CLARA_20071207_2_CelmonAmp7797_16bit_26_88_269_F3 length=50
T01212120333223322020022322232232232222022232033230
+SRR015241.2 CLARA_20071207_2_CelmonAmp7797_16bit_26_88_269_F3 length=50
!,*+*()+*(%'+)%%%&%+&%%'%%%%%%%%%%%%%%%%%%%%'+%%%%%
BFASTQ *.csfastq Colour Space FASTQ without primer base quality. Paired end
reads should be in two files, we do not support BFAST format
where pairs can be interleaved in a single file.
@SRR015241.1 CLARA_20071207_2_CelmonAmp7797_16bit_26_88_34_F3 length=50
T32322133300002330031001022230020232002203222030231
+SRR015241.1 CLARA_20071207_2_CelmonAmp7797_16bit_26_88_34_F3 length=50
21(()+%'+%40*.%%**)&%&*&%%%&%%%%%%%%%%%%%%%(+%%%%'
@SRR015241.2 CLARA_20071207_2_CelmonAmp7797_16bit_26_88_269_F3 length=50
T01212120333223322020022322232232232222022232033230
+SRR015241.2 CLARA_20071207_2_CelmonAmp7797_16bit_26_88_269_F3 length=50
,*+*()+*(%'+)%%%&%+&%%'%%%%%%%%%%%%%%%%%%%%'+%%%%%
4.3.8 Output Formats
Three output formats are provided.
1. Native
2. Extended Native
3. Pairwise
4. SAM
Native Re"ort !ormat
The native format is designed to be compact, giving essential information necessary for downstream
processing. This is default report format.
Novocraft Technologies Sdn Bhd
# noo!"#$n %1&0' ( )*o+, +-!. !"#$n-+ /#,* 01!"#,#-)&
# %C' 2008 2ooC+!3,
# 4#5-n)-. 3o+ -!"1!,#on !n. -.15!,#on!" 6)- 7n"8
# noo!"#$n (. ))1#) (3 &&9&&9):8:01009):8:0100&3! (0 &&9&&9):8:01009):8:0100&01!"
# ;n.-< B1#". V-+)#on: 1&0
# =!)* "-n$,*: 11
# S,-> )#?-: 1
# ;n,-+>+-,#n$ #n>1, 3#"-) !) FASTA /#,* @*+-. 01!"#,8 3#"-&
A;8:100:293:551 S CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCACC ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; 2B
A;8:100:880:9C7 S TTATTATCTTTATTGACGTACCTCTAGAAGACCCAA ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;DA1 6
0 150 AS)1#) C20732 E & &
A;8:100:975:68C S AGTAGACACCTGGTGAACGAACCAACTGAGAAACGA ;;;;;;;;;;;;;;;;;;;;;;;;;;(E;;';;;;G 6
1 150 AS)1#) 1113C3 E & &
A;8:100:87C:727 S GTGAAAGCCAGCGTCTTTAGGCGCTGGGTGGTGGTG ;;;;;;;;;;;;;;;;;;;;;;;;;;;F;;;;;G59 E
C
A;8:100:2CC:639 S AACATAATTAGACAGAATATAAGATATGACTAATTC ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;9=2'; 6
1 150 AS)1#) 136C8C3 E & &
A;8:100:C92:8 S A22222222222222222222222222222222222 ;HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH IC
A;8:100:515:7C1 S GGAAATCACGGAGCAGGAGTTTCGTGAGCTTCGCCG ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;9; 6
C9 1C1 AS)1#) C290C2 F & & & 35GAC 36AAG
A;8:100:510:80C S AACCGACAGTTGCTTCGTCTACAATCACAATACCCG ;;;;;;;;;;;;;;;;C9;;JK;;L;;M;GM=89+0 6
5C 117 AS)1#) 1C99130 E & & & CCAG 9TAG 15TAG
A;8:100:188:601 S ACTACGTTCACAGAAAATCTAGCCTTTGTACTAGAC ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;9;;F 6
9 150 AS)1#) 1C5620 E & & & 1TAG
A;8:100:63:601 S AGCGGCAGGGCTTGTTCCAGCTAAGGCTCCGATTTT ;;;;;;;;;;;;;;K;;;;;;;;=;;FAFG;M;;;; 6
11C 57 AS)1#) 1997C59 E & & & 8TAG 9TAA 11TAC 22TAA 27TAC
A;8:100:331:271 S GGATTATGTGAAACAACATGCTGATGCACCGCTTAA ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;G;;@M 6
18 150 AS)1#) 188339C E & & & 5TAG
A;8:100:C08:93C S ATGATATTAGGTCCTATCTTACTTTTCTCAACCAAC ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;J 6
0 150 AS)1#) 580585 E & &
A;8:100:269:390 S GTGTTCCCAAACCTGCTGCAGGGATAACGGCTTTTT ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;8 6
28 150 AS)1#) 1853977 E & & & 1GAA
&&&
A;8:100:768:102 S AAACACATGGTGTTAT2AAACTCGCGACGTAGTCAT 'F';;HM;LG;9;'H'H;KN7;G;9H2K;ELL;MFL 2B
A;8:100:582:231 S TAAGCAAAAAACATAATTCCAGGATATGCAACCAGT KFFHM#M'LFFLLHLL##FKFLLF##HL'F#FLL#H IC
A;8:100:2C0:200 S AATAAAGCCTAAACAATGGACAAACAAACTACACAC :FLF9L#FMLMLLLMLH#HLMF#LFKFFF#LLLLKM IC
# E-!. S-01-n5-): 10959
# A"#$n-.: 9699
# 6n#01- A"#$nO-n,: 9CC2
# G!>>-. A"#$nO-n,: 92
# I1!"#,8 F#",-+: 273
# =oOo>o"8O-+ F#",-+: 5
# E"!>)-. T#O-: 19G0C6)
# Don-&
Normally a read is printed on one line with a series of tab delimited fields. The fields are :-
Field Description
Read Header The fasta or fastq header of the read sequence.
S, L or R S indicate this is an alignment for single ended read.
For paired end reads
L indicates the read is from the first file.
R indicates the read is from the second file.
Read Sequence The read sequence.
Base Qualities Standard (Sanger) Fastq format base qualities, empty for fasta input unless
using quality calibration.
If quality calibration is used these are calibrated qualities.
Nucleotide Sequence For NovoalignCS only, this field is the decoded nucleotide sequence.
Novocraft Technologies Sdn Bhd
Aligned Base
Qualities
For NovoalignCS only, this field is the base qualities for the decoded
nucleotide sequences. This follows the BFAST & MAQ 0.7.1 convention
from BFAST Wiki (http://sourceforge.net/apps/mediawiki/bfast/index.php?
title=Mapping_Quality).
For ABI SOLiD data, base qualities are calculated using the following
formula:
If the base is the last decoded base (last base sequenced), then the
base quality is equal to the colour quality of the last colour.
Else if the two colours observing the base are not called sequencing
errors, then the base quality is the sum of the two colour qualities.
Else if exactly one out of the two colours observing the base are
called sequencing errors, then the base quality is calculated from the
difference between the colour penalties of the
non-sequencing-error-colour and the sequencing-error-colour.
Else the base quality is zero.
Note that colour qualities are converted to alignment penalties before
alignment and then alignment penalties are converted back to base qualities
for the reported alignment.
Colour
Quality
Colour
Error Penalty
0 0
1 0
2 2
3 5
4 7
5 8
6 10
7 11
8 12
9 13
10 14
11 15
12 16
>=13 Quality + 5
5' trim count Count of bp trimmed from the 5' end of a read. Refer -5 command line
option. Only present in Extended Native format
3' trim count Count of bp trimmed from the 3' end of a read. Refer -a & -s command line
options. Only present in Extended Native format
Current versions of NovoalignCS do not support read trimming.
Status
Status Meaning
Novocraft Technologies Sdn Bhd
U A single alignment with this score was found.
R Multiple alignments with similar score were found.
QC The read was not aligned as it bases qualities were too
low or it was a homopolymer read.
NM No alignment was found.
QL An alignment was found but it was below the quality
threshold.
Alignment Score This is the Phred format alignment score -10log
10
(P(R|Ai)).
For status of 'R' and when not report alignment locations for repeats, this
field becomes the number of alignments to the read.
For paired end the alignment score includes the fragment length penalty.
Alignment Quality This is the Phred format alignment quality score -10log
10
(1 - P(Ai|R, G))
using Sanger fastq coding method.
Proper pair flag A value of 1 indicates that the read pair was aligned as a proper pair. Only
present in extended native format.
miRNA score Alignment score for adjacent opposite strand alignment. Optional, only
included in miRNA mode.
Aligned Sequence The fasta header of the aligned sequence. This is truncated at first space.
Aligned Offset The 1-based position of the alignment in the sequence.
Strand F/R Indicator of alignment direction.
Pair Sequence The fasta header of the sequence the reads pair was aligned to. For single
ended reads, or pairs where both ends aligned to the same sequence, this
field is set to '.'.
If a paired alignment that fits the fragment length distribution is not found
and we are reporting two individual alignments for the pair then the pair
alignment location is only reported if both alignments have an alignment
quality > 10.
Pair Offset The 1-based position of the alignment to the pair of this read. For single
ended reads this field is a '.'.
In miRNA mode we report the alignment location for adjacent opposite
strand alignment.
Pair Strand F/R Indicator of alignment direction of the pair of this read. '.' for single
ended reads.
Mismatches A list of base indels, mismatches and bases inserted or deleted. Format is
'offset''refbase'>'readbase' where the offset is 1 based position of difference
relative to the 'Aligned Offset'.
Novocraft Technologies Sdn Bhd
Note. Offset of mismatches are relative to the alignment location. They are
not the location of the mismatches in the read. This distinction is
important when the alignment contains indels and/or is soft clipped back to
the best local alignment.
Inserts are in format 'offset'+'insertedbases' and deletes in format
'offset'-'refbase'
The mismatch list is space delimited.
A mismatch is only reported if the probability of the base is less than 0.16.
For fastq files this corresponds to a Perr 0.5
When using soft clipping the number of bases soft clipped from the 5' (as
aligned) end of the alignment is reported using format 0x'n', and for 3' end
as 'offset'x'n' where n is the number of bases soft clipped.
Paired End Native Re"ort !ormat
This example is for native format with good pairs found. The alignment score for one of the reads in
the pair will include the fragment length penalty. The quality score is based on the posterior
fragment alignment probability.
# noo!"#$n %2&0' ( )*o+, +-!. !"#$n-+ /#,* 01!"#,#-)&
# %C' 2008 2ooC+!3,
# 4#5-n)-. 3o+ -!"1!,#on !n. -.15!,#on!" 6)- 7n"8
# noo!"#$n (. ))1#) (3 &&9&&9)#O"3,9):1:)-01-n5-&,<,
&&9&&9)#O+$,9):1:)-01-n5-&,<,
# ;n.-< B1#". V-+)#on: 1&0
# =!)* "-n$,*: 11
# S,-> )#?-: 1
@S)1#):633667:633825:091 4 GCTCAATGACTATCCGCAGATTGAGGGGTTTCTGCT ;;;;;;;;;;;;;
;;;;G;;;;;%GD!FL3CD;A!!6 51 150 AS)1#) 633790 E & 633667 F
@S)1#):633667:633825:092 E GTCTGACTCATGGCTGTGCGAATGGCTTCTTCCCTA ;;;;;;;;;;;;(
;;;;;;;;;;;;;;;;;;0FF!!6 16 150 AS)1#) 633667 F & 633790 E
@S)1#):1657C28:1657600:191 4 AGTACGTGTCAATATCGTCCACTCTGCAGGTGGTCC ;;;;;;;;+;;;;
;;;;;;;;;;;;;;;;+CB(CF76 C2 150 AS)1#) 1657565 E & 1657C28 F 2CAG 7AAC
@S)1#):1657C28:1657600:192 E TGTAAATGATGCTGTGAAGACGTACTTCAACATCAT ;;;;;;;;;;;;;
;;;;B;;;;6;<';;;;;';+7%6 3 150 AS)1#) 1657C28 F & 1657565 E
@S)1#):973563:97372C:391 4 TTACCAAGCGTGGTAATCCCTACGCTAGAAAGATTC ;;;;;;;;;C;;;
;;;;;;;;;;;KFL2;;;(;;(2E 2
@S)1#):973563:97372C:392 E TGGCACCAATCGTGTGCAGCTTCGTTGAAGTCGTTT ;;;F!!F
+;;;;;;;;;;;;;;;;;;+;;;;;;G;; E 2
&&&
# @!#+-. E-!.): 2000
# @!#+) A"#$n-.: 2000
# E-!. S-01-n5-): C000
# A"#$n-.: C000
# 6n#01- A"#$nO-n,: 39C0
# G!>>-. A"#$nO-n,: 9
# I1!"#,8 F#",-+: 0
# =oOo>o"8O-+ F#",-+: 0
# E"!>)-. T#O-: 0G313)
# Don-&
This example is for native format when a good pair was not found. In this case both alignments were
on different chromosomes. The quality values reflect the quality of the individual end alignments.
@S4PA(EAS1:3C:FCC751:E1:1:1:53:21 4
TTGATGGATCAATTGTAGTTGCCTGCAATAAGAGG ??????????????????????7??:?????9+2L
6 23 150 A;;; 71970C0 E A;V
11532213 F 3GAT
Novocraft Technologies Sdn Bhd
@S4PA(EAS1:3C:FCC751:E1:1:1:53:21 E AATTGGAAGAGGACAGAAGAGATGA
JJJJJJJJJJJJJJJJJJJJJMJJ+ 6 1 93 A;V 11532213
F A;;; 71970C0 E
@S4PA(EAS1:3C:FCC751:E1:1:2:993:712 4
GTGCCTACCATTGTGATTCGACTATATACGCGCTC ???????8?8????????5?09?5?7?N%MM7F7G
6 6 150 A;V 59C3661 F A; C229259 E
@S4PA(EAS1:3C:FCC751:E1:1:2:993:712 E GGGAAAAGGTGCCAAAAAGTATAGA
<<<1<<<<<<<1<(99C(<31<3<( 6 0 9C A; C229259 E
A;V 59C3661 F
This example is for native format with multiple alignments to a read and using -r All option.
A8:100:1:16 4 TTACCAAGCGTGGTAATCCCTACGCTAGAAAGATTC ;;;;;;;;;C;;;;;;;;;;;;;;KF
L2;;;(;;(2 E 6 3 AS,+->,o5o551):)1#) 973563 F & 973689 E
A8:100:1:16 E TGGCBDCAATCGTGTGCAGCTTCGTTGAAGTCGTTT ;;;FH#F
+;;;;;;;;;;;;;;;;;;+;;;;;;G;; E C1 3 AS,+->,o5o551):)1#) 973689 E & 973563
F
A8:100:1:16 4 TTACCAAGCGTGGTAATCCCTACGCTAGAAAGATTC ;;;;;;;;;C;;;;;;;;;;;;;;KF
L2;;;(;;(2 E 6 3 AS,+->,o5o551):)1#) 1717310 E & 171718C F
A8:100:1:16 E TGGCBDCAATCGTGTGCAGCTTCGTTGAAGTCGTTT ;;;FH#F
+;;;;;;;;;;;;;;;;;;+;;;;;;G;; E C1 3 AS,+->,o5o551):)1#) 171718C F & 171
7310 E
Pair#ise Re"ort !ormat
Pairwise format has some similarity to Blast and is designed to be easily read. To use this report
format add the option -oPairwise to the command line.
I1-+8J@;41nQno/n:1nQno/n:8:100:35:698
4-n$,*J36
A4;G2BE2TS
AS,+->,o5o551):)1#)
4-n$,*J2007C91
S5o+-J0G I1!"#,8J150
S,+!n.JB#n1)9@"1)
I1-+8 36 ATTTTATACTCATATTTTTATATTGTCAATCATATA 1
RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR
SST5, 19C5571 ATTTTATACTCATATTTTTATATTGTCAATCATATA 19C5606
I1-+8J@;41nQno/n:1nQno/n:8:100:293:551
4-n$,*J36
2o )#$n#3#5!n, )#O#"!+#,8 3o1n.&
I1-+8J@;41nQno/n:1nQno/n:8:100:605:15
4-n$,*J36
A4;G2BE2TS
AS,+->,o5o551):)1#)
4-n$,*J2007C91
S5o+-J8CG I1!"#,8J21
S,+!n.J@"1)9@"1)
Novocraft Technologies Sdn Bhd
I1-+8 1 TATG2AG22AA2A2ATTCGATTC222T2T2T2T222 36
RRRR RR RR R RRRRRRRRR R R R R
SST5, 11C56 TATGTAGCTAATAAATTCGATTCTAATTTTTATCAA 11C91
I1-+8J@;41nQno/n:1nQno/n:8:100:87C:727
4-n$,*J36
EE@EATG C A4;G2BE2TS
Paired End Pair#ise Re"ort !ormat
The pairwise (Blast like) output format includes a pair header. The details of the pairwise format
depend on whether the alignment process found a pair or whether it is reporting individual
alignments.
In this example, both paired reads aligned to a fragment that fit the fragment distribution.
# noo!"#$n %2&0' ( )*o+, +-!. !"#$n-+ /#,* 01!"#,#-)&
# %C' 2008 2ooC+!3,
# 4#5-n)-. 3o+ -!"1!,#on !n. -.15!,#on!" 6)- 7n"8
# noo!"#$n (o @ (. ))1#) (3 )#O"3,9):1:)-01-n5-&,<, )#O+$,9):1:)-01-n5-&,<,
# ;n.-< B1#". V-+)#on: 1&0
# =!)* "-n$,*: 11
# S,-> )#?-: 1
@!#+ I1-+81J@S,+->,o5o551):)1#):633667:633825:091
I1-+82J@S,+->,o5o551):)1#):633667:633825:092
A4;G2ED @A;ES:
@!#+ A"#$nO-n,%1' AS,+->,o5o551):)1#) 633667<(A633790 S5o+-J(67 I1!"#,8J 150
I1-+8J@S,+->,o5o551):)1#):633667:633825:092
4-n$,*J36
AS,+->,o5o551):)1#)
4-n$,*J2007C91
S5o+-J16G I1!"#,8J150
S,+!n.J@"1)9@"1)
I1-+8 1 GTCTGACTCATGGCTGTGCGAATGGCTTCTTCCCTA 36
RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR
SST5, 633667 GTCTGACTCATGGCTGTGCGAATGGCTTCTTCCCGG 633702
I1-+8J@S,+->,o5o551):)1#):633667:633825:091
4-n$,*J36
AS,+->,o5o551):)1#)
4-n$,*J2007C91
S5o+-J51G I1!"#,8J150
S,+!n.JB#n1)9@"1)
I1-+8 36 AGCAGAAACCCCTCAATCTGCGGATAGTCATTGAGC 1
RRRRR R RRRRRRRRRRRRRRRRRRRRRRRRRR
SST5, 633790 GTCAGAATCACCTCAATCTGCGGATAGTCATTGAGC 633825
&&&
@!#+ I1-+81J@S,+->,o5o551):)1#):1362887:1363089:75391
I1-+82J@S,+->,o5o551):)1#):1362887:1363089:75392
Novocraft Technologies Sdn Bhd
A4;G2ED @A;ES:
@!#+ A"#$nO-n,%1' AS,+->,o5o551):)1#) 136305C<(A1362887 S5o+-J(35 I1!"#,8J 150
I1-+8J@S,+->,o5o551):)1#):1362887:1363089:75392
4-n$,*J36
AS,+->,o5o551):)1#)
4-n$,*J2007C91
S5o+-J3G I1!"#,8J150
S,+!n.JB#n1)9@"1)
I1-+8 36 AAAATCCTCACGAATTTTTCGATTTGGATAATATTT 1
RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR
SST5, 136305C AAAATCCTCACGAATTTTTCGATTTGGATAATATTT 1363089
I1-+8J@S,+->,o5o551):)1#):1362887:1363089:75391
4-n$,*J36
AS,+->,o5o551):)1#)
4-n$,*J2007C91
S5o+-J32G I1!"#,8J150
S,+!n.J@"1)9@"1)
I1-+8 1 ACGATACCTGTTAAGGCAGTCGGGAATAGAATTTAC 36
RRRRRRRRRRRRRRRRRRRRRRR RRRRRRRRRR R
SST5, 1362887 ACGATACCTGTTAAGGCAGTCGGTAATAGAATTTTC 1362922
# @!#+-. E-!.): 2000
# @!#+) A"#$n-.: 2000
# E-!. S-01-n5-): C000
# A"#$n-.: C000
# 6n#01- A"#$nO-n,: 39C0
# G!>>-. A"#$nO-n,: 9
# I1!"#,8 F#",-+: 0
# =oOo>o"8O-+ F#",-+: 0
# E"!>)-. T#O-: 0G318)
# Don-&
In this example a paired alignment could not be found so alignments to individual reads were
reported. The second read of the pair failed to align.
@!#+ I1-+81J@22:698981C:698998C:6!91 I1-+82J@22:698981C:698998C:6!92
2o )#$n#3#5!n, >!#+) 3o1n.G +->o+,#n$ #n.##.1!" !"$nO-n,)&
I1-+8J@22:698981C:698998C:6!91
4-n$,*J25
A4;G2BE2TS
A22
4-n$,*J10058659
S5o+-J0G I1!"#,8J58
S,+!n.J@"1)9@"1)
I1-+8 1 GGGCTCAGCGCTCTTCCTAAGCGGC 25
RRRRRRRRRRRRRRRRRRRRRRRRR
SST5, 6989880 GGGCTCAGCGCTCTTCCTAAGCGGC 698990C
I1-+8J@22:698981C:698998C:6!92
Novocraft Technologies Sdn Bhd
4-n$,*J25
2o )#$n#3#5!n, )#O#"!+#,8 3o1n.&
SA$ Re"ort !ormat
SAM report format is for use with SAMtools, just add the option -oSAM to the command line.
The report format is documented as part of SAM/BAM specification at
http://samtools.sourceforge.net/
The standard tags Novoalign can add to SAM alignments are...
Tag Default Description
AM On The smallest template-independent mapping quality of other segments in the read. Only for
multi-template reads.
AS On Alignment score generated by Novoalign.
CC On Reference name of the next hit; `=' for the same chromosome. Only present if read has multi-mappings
reported
CM On Edit distance between the color sequence and the color reference. Only for colour space alignments.
CP On Leftmost coordinate of the next hit. Only present if read has multi-mappings reported
CQ On Color read quality on the original strand of the read. Same encoding as QUAL; same length as CS. Only
for colour space alignments. IF using -k this has calibrated colour qualities.
CS On Color read sequence on the original strand of the read. The primer base must be included. Only for colour
space alignments.
LB On Library. This is extracted from the LB tag of the @RG record and is redundant.
HI On Query hit index, indicating the alignment record is the i-th one stored in SAM. Only present if there is
more than one alignment reported for the read.
IH On Number of stored alignments in SAM that contains the query in the current record. Only present if there
is more than one alignment reported for the read.
MD On String for mismatching positions. Regex : [0-9]+(([A-Z]|\^[A-Z]+)[0-9]+)*6
NH On Number of reported alignments that contains the query in the current record. Only present if there is
more than one alignment reported for the read.
Note. In our interpretation of the SAM specification we have taken this as the count of alignments that
would be reported if there wasn't a limit imposed by the -r option, so this is the number of alignments
found with the qulaity range (-R) or threshold limits for -r Exhaustive. The IH tag is the number of
alignments that were stored in the SAM file.
NM On Edit distance to the reference, including ambiguous bases but excluding clipping
OQ Off Original base quality. Useful with quality calibration (-k option)
PG On Program. Matches the @PG tag which is automatically added by Novoalign*
PQ On Phred likelihood of the template, conditional on both the mapping being correct. Only for
multi-template reads.
PU On Platform unit. Value to be consistent with the header RG-PU tag if @RG is present. This appears to be
redundant.
RG On Read group. Value matches the header RG-ID tag if @RG is present in the header.
SM On Template-independent mapping quality. i.e. The mapping quality if mapped as a single end read. Only
present for multi-template reads.
Whilst Novoalign attempts to ensure this value is approximately correct there are no guarantees to it's
Novocraft Technologies Sdn Bhd
accuracy.
UQ On Phred likelihood of the segment, conditional on the mapping being correct
Novoalign also adds several custom tags...
Tag Type Default Description
ZB Z On For Bi-Seq alignments indicates which index was used to align the read. This can be used to
separate alignments by original strand of the DNA fragment.
Value Meaning
CT The C/T index was used for alignment. This means the fragment was from
the 5' -3' strand of the chromosome.
GA The G/A index was used. This means the fragment was from the 3'-5' strand
of the chromosome.
ZH i On Hairpin score for miRNA alignment (-m option)
ZL i On In miRNA mode (-m option) this is the alignment location for adjacent opposite strand
alignment.
ZO Z On Indicates long or short insert fragment for mate pair alignments when short insert has been
enabled.
Value Meaning
'+-' Indicates pair was aligned as a short insert fragment.
'-+' Pair was aligned as a long insert fragment.
This tag is only present for Illumina mate pairs when a short fragment length size has been
specified with the -i option and reads are aligned as a proper pair .
ZS Z On Novoalign alignment status. Not present for unique alignments.
Status Meaning
NM No alignment was found.
QC The read was not aligned as it bases qualities were too low or it was a
homopolymer read.
QL An alignment was found but it was below the quality threshold.
R Multiple alignments with similar score were found.
Z3 i Off 3' mapping location. Only reported if option --3Prime is used.
Z5 i Off Mapping location of the first base of the last template in this read. Z5, ZQ & ZR should be
enabled to use the duplicate read detection option in Novosort. Only present if differs
from PNEXT.
ZQ i Off Quality score for all templates in the read, higher is better.
ZR Z Off Mapped reference sequence name for the last template in this read. Only present if it differs
from RNEXT.
SAM tags can be enabled or disabled using the --tags option, ALL operates on every tag.
Examples
novoalign --tags "Z3 Z5 ZQ -LB -PU"
novoalign --tags "-ALL Z3 Z5 ZQ"
Novocraft Technologies Sdn Bhd
When using SAM report format the run headers and statistics normally output as part of Native
format reports are now written to stderr.
# noo!"#$n (oSAB (. ))1#) (3 &&9&&9):8:01009):8:0100&3! (0 &&9&&9):8:01009):8:0100&01!"
# ;n.-< B1#". V-+)#on: 1&0
# =!)* "-n$,*: 11
# S,-> )#?-: 1
# ;n,-+>+-,#n$ #n>1, 3#"-) !) FASTA /#,* @*+-. 01!"#,8 3#"-&
# E-!. S-01-n5-): 10959
# A"#$n-.: 9699
# 6n#01- A"#$nO-n,: 9CC2
# G!>>-. A"#$nO-n,: 92
# I1!"#,8 F#",-+: 273
# =oOo>o"8O-+ F#",-+: 5
# E"!>)-. T#O-: 19G0C6)
# Don-&
4.4 Paired End Alignment Mode
4.4.1 Scoring
Novoalign aligns paired reads against a reference genome using qualities and ambiguous nucleotide
codes. The scoring system is based on Phred quality scores and the score for a paired alignment is
-10log
10
(P(F | Ai)) where P(F | Ai) is the probability that the fragment read by the sequencer
originated from the alignment location.
A paired alignment score comprises three parts, Needleman-Wunsch alignment scores for each end
of the pair in the form -10log
10
(P(R| Ai)) and a fragment length penalty in the form -10log
10
(P(l | F))
calculated from the fragment length distribution, F.
A posterior alignment score or quality is also given and is -10log
10
(1 - P(Ai| Ai, G, F)) where P(Ai|
Ai, G, F) is the probability of the alignment location given the read, R; the genome, G; and the
fragment length distribution, F. For paired end reads the quality score is limited to not more than
150.
Setting of gap penalties and threshold is similar to single end novoalign.
4.5 Alignment process
With paired end reads Novoalign can have "proper fragments" and pairs that don't fit the fragment
model.
The alignment process works as follows:
For Read1 Novoalign uses a seeded alignment process to find alignment locations each with a
Read1 alignment score. For each good location found Novoalign does a Needleman-Wunsch
alignment of the second read against a region starting from the Read1 alignment and extending 6
standard deviations beyond mean fragment length. The best alignment for Read2 will define the pair
Novocraft Technologies Sdn Bhd
score for Read1/Read2. All the alignments are added to a collection for Read1.
This process is repeated using Read2 seeded alignment and then N-W for Read1, creating a
collection of Read2/Read1 pairs. There are very likely duplicates amongst the two collections.
Novoalign then decides whether there is a "proper pair" or not. To do this a structural variation
penalty is used as follows.
Novoalign has a proper pair if the score of the best pair (Read1/Read2 or Read2/Read1 combined
score including fragment length penalty) is less than the structural variation penalty (default 70)
plus best single-end Read1 score plus best single-end Read2 score.
If Novoalign has a proper pair, Read1/Read2 & Read2/Read1 lists are combined, removing
duplicates and sorting by alignment score. At this point Novoalign has a list of one or more proper
pair alignments. This list is passed to reporting which can report one or more alignments depending
on the options.
If there wasn't a proper pair then Novoalign reports alignments to each read in single end mode and
the reporting options will decide whether Novoalign reports one or more alignments.
The result of the paired search can be two paired alignments where the pairing is more probable
than a structural variation, or it can be two individual alignments, one to each read of the pair.
Given the threshold, gap penalties and reads it is quite possible for novoalign to find alignments
with gaps in both ends of the reads. There are no design restrictions that prevent this type of result
and it depends only on the scoring parameters and threshold.
Novocraft Technologies Sdn Bhd
4.6 Bisulphite Mode
Bisulphite mode requires building of a double index, the first uses a hash table with all Cs translated
to T's and the second a hash table with Gs translated to A's for fragments off the complementary
strand.
Memory utilisation for the index may be higher in bisulphite mode than normal mode as we now
have two hash tables. Novoindex will choose k &s values that allow the index to fit in RAM if
possible. You can reduce memory further by increasing s or decreasing k.
Alignment is done iteratively gradually increasing error tolerance until a match is found. Each round
of iteration will align the read in forward and reverse complement against the CT and the GA index.
During CT alignment Cs in the read are translated to Ts for hash lookup, then during alignment, T's
in the read can align to a T or a C in the reference sequence with no penalty. The process is then
repeated for the GA alignment.
Scoring for alignments is similar to normal alignment scoring with difference that T in the read can
align to a C in the reference without any penalty (or A to G for GA index alignments). This means
that methylation status does not affect the alignment score.
I addition there is a command line option, -u, to impose a penalty on unconverted cytosines at CHG
and CHH positions. If specified each unconverted cytosine in CHG or CHH positions in a read will
be penalised thus biasing alignment in favour of methylated CGs.
Thelow-level of non-CpG methylation in vertebrates and the incomplete bisulphite conversion of
unmethylated cytosines should be factored in to selecting this value. As a rough guide, a penalty can
be worked out as follows:
Let P
UC
be the probability an non-methylated cytosine is not converted, P
CG
the probability that a
cytosine at CpG is methylated and P
CH
be the probability that a cytosine at a CHG or CHH is
methylated. Then the probability of reading a cytosine at a CG position is:
P(C|CG) = P
CG
+ (1 - P
CG
).P
UC
and the probability of reading a C at a CHN position is:
P(C|CH) = P
CH
+ (1 - P
CH
).P
UC
We can then convert to log (phred) scale and calculate a penalty as:
Penalty = -10log
10
(P(C|CH)) + 10log
10
(P(C|CG))
Applying values from Ramsahoye et al. [
6
] for Drosophila
P
CG
= 62%, P
CH
= 3% (derived)
and
P
UC
= 1%
Penalty = -10log
10
(.03 + .97 * .01) + 10log
10
(.62 + .38 * .01)
= -10log
10
(.04) + 10log
10
(.66))
6 Ramsahoye BH, Biniszkiewicz D, Lyko F, Clark V, Bird AP, Jaenisch R. Non-CpG methylation is prevalent in
embryonic stem cells and may be mediated by DNA methyltransferase 3a. Proc. Natl Acad. Sci. USA (2000)
97:52375242.[Abstract/ Free Full Text]
Novocraft Technologies Sdn Bhd
= 14 2
= 12
As mentioned above, using a penalty for unconverted cytosines at CHG and CHH positions will
slightly bias alignment in favour of methylated CG sites. This will mainly have an effect when there
are multiple alignment sites with similar scores.
Novoalign will switch to bisulphite alignment mode whenever a bisulphite index is used.
4.6.1 Bisulphite Report Format
The differences to the output format are:
1) an indication of whether CT or GA index was used for the alignment. This is reported before
mismatches and delimited from mismatches by a space.
2) Mismatches caused by unmethylated cytosines are shown with a hash '#' rather than a greater
than '>' symbol. e.g. 5C#T to indicate a C in reference aligns to a T in the read and may be
an unmethylated cytosine that was converted to uracil by bisulphite treatment. Similarly,
6G#A indicates a Cytosine on the complementary strand was unmethylated and hence
appears in the read as an A.
The mismatch list does not show methylated cytosines as they match the reference sequence.
@5*+2:98C67308:98C673CC:191 S CGGTATTGTAGAATAGTGTATATTAATGAGTTATAA
CBC??(@@BBBBBBB@@@@@BB??D??G62C7092& 6 0 15 A5*+2
98560C53 E & & & GA 7G#A
@5*+2:115989213:1159892C9:191 S CGGTTTATTTTTTTTGGGGAATAGATTAAGTTTAAT
CCCCC(CCCCCCCJJ(?775BBBAABCJB899DD9A 6 0 107 A5*+2
1160793C8 F & & & CT 10C#T 13C#T
@5*+2:C8CC0862:C8CC0898:191 S CGGATATGTTATTTTAGGAGAAAAGAGGAAAAAATT
CCCCCJCCCCCCCCDD?BBC@CCCC@?2::<<022B 6 CC 23 A5*+2
C85788CC E & & & GA 2TAA 9TAC 15G#A
@
Novocraft Technologies Sdn Bhd
4.7 Quality Calibration
Quality calibration is the process of re-evaluating base qualities using the actual counts of
mismatches from alignments. The calibration in Novoalign is base specific which means two things:
1. We keep mismatch counts based on the actual base called so we can detect situations where,
say, T is overcalled and likely to be wrong but calls of A, C &G are likely to be correct.
2. Rather than count mismatches we maintain counts for each of the bases aligned. This
allows us to detect situation where a wrong call of , say, a T is more likely to be an A than a
C. We can then calculate base specific mismatch penaltiesfor each base at each position in a
read.
These counts are used to calculate an actual mismatch probability or penalty as a function of: the
position in the read; the as called base quality; the base called; and the base aligned. The
empirical mismatch probability is then used in Novoalign alignment process in place of the as
called base quality to set penalties for the alignment dynamic programming.
Categories used for counting mismatches are:
The read within the pair (0 for first read, 1 for second read)
The base position in the read, zero based.
The as called quality
The base or colour called
For each combination, Novoalign maintains the count of the number of alignments to each of the
four bases, M
A
, M
C
, M
G
& M
T
. Only ungapped alignments with a quality >= 60 , or >= 70 for paired
end, are used to count mismatches.
The first step in the process of calculating calibrated qualities for each category involves binning
counts across read length and quality values. Binning helps to increase the counts and to smooth
fluctuations. Bins are 5 bases long and have variable number of quality values. At low qualities bins
take a single quality value, in mid range bins are 3 quality values wide and above a quality of 30
they are 5 wide. There is a bin for each base position and quality values so mismatch counts get
added to multiple overlapping bins, this design eliminates edge effect between bins.
The second step involves adding priors to the count of calls and mismatches. Use of a prior helps
stabilise calibrated quality values when counts are low. The prior is a minimum value for mismatch
count and if the actual mismatch count is below the prior then we add extra mismatches to bring the
count up to the prior and then a corresponding number of extra matches based on the as called
quality. Unaligned reads (status NM) are also added to the priors as examples of correct base and
quality calls.
Novoalign then calculates 4 base penalties is P
I
= -10log
10
(M
I
/N) for I in [ACGT] where M
I
is the
number of times an alignment matched base I and N is the total calls for this bin. The penalties are
used in the dynamic programming alignment.
A Phred scaled quality value is also calculated as P = -10log
10
(M/N) where M is the total
mismatches and N the total calls for the bin. This calibrated quality value is used in the report for
the base qualities.
For colour space quality calibration we only track the number of correct calls and colour errors for
Novocraft Technologies Sdn Bhd
each category. Calibrated penalties are specific to colour called, position in read and quality called,
but not to the substituted colour.
4.7.1 Using Quality Calibration
Quality calibration works for read files in the following formats:
Solexa & Illumina FASTQ
Sanger FASTQ
FASTA Every base is assumed to have a starting quality
of 30.
FASTA with separate quality file
CSFASTA Without a quality file we assume a colour
quality of 20.
CSFASTQ
BAM
Quality calibration does not work with prb files.
The simplest way to use quality calibration is just to add the option -k to the Novoalign command
line. This turns on calibration with calibration based on actual alignments. The calibration will start
off neutral as a result of the priors and gradually, as more alignments are added, the calibration will
shift to reflect the actual mismatch counts.
Novoalign also has the ability to save the mismatch count data and then use this as input to the
calibration of a following run of Novoalign. Scenarios where this might be used include:
Using mismatch counts from phiX lane to calibrate another lane
Running an initial Novoalign at a low threshold to get mismatch statistics for use in a
following run, possibly at a higher threshold. This would remove some startup effects
from a single pass run.
Operation is controlled by two command line option:
-k [infile] Enables quality calibration. The quality calibration data (mismatch counts) are
either read from the named file or accumulated from actual alignments. Default
is no calibration.
Note. Quality calibration does not work with reads in prb format.
-K [file] Accumulates mismatch counts for quality calibration by position in the read and
called base quality. Mismatch counts are written to the named file after all reads
are processed. When used with -k option the mismatch counts include any counts
read from the input quality calibration file.
These two options can be used in several combinations :
-k Turns on calibration with mismatch counting. Effects of calibration can be
Novocraft Technologies Sdn Bhd
seen after a few thousand reads have been aligned. Calibration data is
recalculated periodically as more reads are aligned.
-k infile Turns on calibration with mismatch counts read from infile. Mismatch
counts from alignments are not used.
-K outfile Turns on mismatch counting without calibration. At the end of the run the
mismatch counts are written to the outfile ready for use as input in another
run.
-k -K outfile Turns on calibration with mismatch counting. At the end of the run the
mismatch counts are written to the outfile ready for use as input in another
run. Calibration table is recalculated periodically as more reads are aligned.
-k infile -K outfile Turns on calibration and mismatch counting. Initial mismatch counts are
loaded from infile, new alignments are added to the counts, and then at the
end of the run the mismatch counts are written to the outfile ready for use as
input in another run. Calibration table is recalculated periodically as more
reads are aligned.
Quality Calibration and Novoalign Reports
There is no change to the report format, for Novoalign the quality string displayed is now the
calibrated qualities. For NovoalignCS the calibrated colour qualities are not displayed. They are
used internally during alignment as colour error penalties and then used to calculate base qualities.
For Novoalign SAM format you can use the option rOQ to add original quality tag OQ:Z:qualities
An R script 'qcalplot.R' that can produce charts of empirical quality for the reads from the mismatch
file is included with the release.
Novocraft Technologies Sdn Bhd