Position Weight Matrix
Position Weight Matrix
position-specic weight matrix (PSWM) or positionspecic scoring matrix (PSSM), is a commonly used 2 From Sequences to PWM
representation of motifs (patterns) in biological sequences.
A PWM has one row for each symbol of the alphabet:
PWMs are often derived from a set of aligned sequences 4 rows for nucleotides in DNA sequences or 20 rows
that are thought to be functionally related and have be- for amino acids in protein sequences. It also has one
come an important part of many software tools for com- column for each position in the pattern. In the rst
step in constructing a PWM, a basic position frequency
putational motif discovery.
matrix (PFM) is created by counting the occurrences
of each nucleotide at each position. From the PFM, a
position probability matrix (PPM) can now be created
1 Background
by dividing that former nucleotide count at each position
by the number of sequences, thereby normalising the
values. Formally, given a set X of N aligned sequences
of length l, the elements of the PPM M are calculated:
Mk,j =
N
1
I(Xi,j = k),
N i=1
The position weight matrix was introduced by American geneticist Gary Stormo and colleagues in 1982[1] as
an alternative to consensus sequences. Consensus se- the corresponding PFM is:
quences had previously been used to represent patterns
in biological sequences, but had diculties in the prediction of new occurrences of these patterns.[2] The rst
6
2
1
1
7
1
1
1
2
1
5
2
1
2
.
1
6
[5]
When the PWM elements are calculated using log likelihoods, the score of a sequence can be calculated by
adding (rather than multiplying) the relevant values at
each position in the PWM. The sequence score gives an
indication of how dierent the sequence is from a random sequence. The score is 0 if the sequence has the
same probability of being a functional site and of being
a random site. The score is greater than 0 if it is more
likely to be a functional site than a random site, and less
than 0 if it is more likely to be a random site than a functional site.[5] The sequence score can also be interpreted
in a physical framework as the binding energy for that
sequence.
log(pi,j )
The expected (average) self-information of a particular
element in the PWM is then:
pi,j log(pi,j )
Most often the elements in PWMs are calculated as log
likelihoods. That is, the elements of the PWM are transFinally, the IC of the PWM is then the sum of the exformed using a background model b so that:
pected self-information of every element:
Mk,j = log2 (Mk,j /bk ).
describes how an element in the PWM (left), Mk,j , can
be calculated. The simplest background model assumes
that each letter appears equally frequently in the dataset.
That is, the value of bk = 1/|k| for all symbols in the
alphabet (0.25 for nucleotides and 0.05 for amino acids).
Applying this transformation to the PPM M from above
(with no pseudocounts added) gives:
A 0.26
1.26
C
0.32
0.32
M=
G1.32 1.32
T 0.68 1.32
1.32
1.32
1.49
1.32
1.26
0.32
1.0 1.32
1.0 1.32
i,j
pi,j log(pi,j )
Often, it is more useful to calculate the information content with the background letter frequencies of the sequences you are studying rather than assuming equal
probabilities of each letter (e.g., the GC-content of DNA
of thermophilic bacteria range from 65.3 to 70.8,[7] thus
a motif of ATAT would contain much more information
than a motif of CCGG). The equation for information
1.32
log(pi,j1.26
/pb )
0.32
i,j pi,j
3
where pb is the background frequency for that letter. This [9] Kel AE, et al. (2003). MATCHTM: a tool for
searching transcription factor binding sites in DNA secorresponds to the KullbackLeibler divergence or relaquences. Nucleic Acids Research. 31 (13): 3576
tive entropy. However, it has been shown that when using
3579. doi:10.1093/nar/gkg585. PMC 169193 . PMID
PSSM to search genomic sequences (see below) this uni12824369.
form correction can lead to overestimation of the importance of the dierent bases in a motif, due to the uneven
[10] Wrzodek, Clemens; Schrder, Adrian; Drger, Andreas;
distribution of n-mers in real genomes, leading to a sigWanke, Dierk; Berendzen, Kenneth W.; Kronfeld, Marnicantly larger number of false positives.[8]
cel; Harter, Klaus; Zell, Andreas (9 October 2009).
Using PWMs
There are various algorithms to scan for hits of PWMs [11] Beckstette, M.; et al. (2006). Fast index based algorithms and software for matching position specic
in sequences. One example is the MATCH algorithm[9]
[10]
scoring matrices. BMC Bioinformatics. 7: 389.
which has been implemented in the ModuleMaster.
doi:10.1186/1471-2105-7-389.
PMC 1635428 . PMID
More sophisticated algorithms for fast database search16930469.
ing with nucleotide as well as amino acid PWMs/PSSMs
are implemented in the possumsearch software and are
described by Beckstette, et al. (2006).[11]
7 External links
References
8.1
Text
8.2
Images
8.3
Content license