PSSM

Scoring Matrices
Scoring matrices, PSSMs

Position Specific Score Matrix
(PSSM)
A position specific scoring matrix (PSSM) is a
matrix based on the amino acid frequencies
(or nucleic acid frequencies) at every position
of a multiple alignment.
From these frequencies, the PSSM that will
be calculated will result in a matrix that will
assign superior scores to residues that
appear more often than by chance at a
certain position.
Creating a PSSM: Example
NTEGEWI
NITRGEW
NIAGECC
Amino acid frequencies at every
position of the alignment:
Amino acids that do not appear at a specific position of a multiple
alignment must also be considered in order to model every possible
sequence and have calculable log-odds scores. A simple procedure
called pseudo-counts assigns minimal scores to residues that do not
appear at a certain position of the alignment according to the
following equation:
Where
Frequency is the frequency of residue i in column j (the count of
occurances).
pseudocount is a number higher or equal to 1.
N is the number of sequences in the multiple alignment.
In this example, N = 3 and lets use pseudocount = 1:
Score(N) at position 1 = 3/3 = 1.

Score(I) at position 1 = 0/3 = 0.
Readjust:
Score(I) at position 1 -> (0+1) / (3+20) = 1/23 = 0.044.
Score(N) at position 1 -> (3+1) / (3+20) = 4/23 = 0.174.
The PSSM is obtained by taking the logarithm of (the values obtained above
divided by the background frequency of the residues).
To simplify for this example well assume that every amino acid appears equally in
protein sequences, i.e. fi = 0.05 for every i):
PSSM Score(I) at position 1 = log(0.044 / 0.05) = -0.061.

PSSM Score(N) at position 1 = log(0.174 / 0.05) = 0.541.
The matrix assigns positive scores to residues

that appear more often than expected by
chance and negative scores to residues that
appear less often than expected by chance.
Using a PSSM
To search for matches to a PSSM, scan along a the
sequence using a window the length (L) of the
PSSM.
The matrix is slid on a sequence one residue at a
time and the scores of the residues of every region of
length L are added.
NTEGEWI
Scores that are higher than an empirically
NITRGEW
predetermined threshold are reported.
NIAGECC
Advantages of PSSM
Weights sequence according to

observed diversity specific to the family
of interest
Minimal assumptions
Easy to compute
Can be used in comprehensive
evaluations.
More sophisticated PSSMs
From less to more complicated
1. PSSM with pseudocounts.

2. Giving pseudocounts less weight when more
alignment data is available.
3. Weight pseudocount amino acids by their
frequency of occurrence in proteins.
4. Instead of giving pseudocounts all the same
value, weight them by their similarity to the
consensus (like BLOSUM62 does) at each
position. (PSI-BLAST method).
5. Combine 2 & 4 (Dirichlet mixture method).
Method 1 and
standard
BLOSUM62
matrix
Method 5
A PSSM column with a perfectly conserved isoleucine with different

methods used to calculate the scores.
Using Hidden Markov models to
describe sequence alignment profiles
A profile HMM can represent a sequence

alignment profile similar to how a PSSM does.
A profile HMM includes information on the

amino acid consensus at each position in the
alignment like a PSSM.
A profile HMM also has position-specific

scores for gap insertion and extensions.
Background: Creating HMMs
To create an HMM to model data we need to

determine two things:
The structure/topology of the HMMstates

and transitions
The values of the parametersemission and
transition probabilities.
Determining the parameters is called
training.
A HMM structure/topology
M=matchstate(scoretheaainthesequenceatthispositioninthe
profile)
I=insertion(w.r.tprofileinsertgapcharactersinprofile)
D=deletion(w.r.tsequenceinsertgapcharactersinsequence)
M1isfirstaaintheprofile,M2issecond,etc.
Example HMMER parameters
NULE 595 -1558 85 338 -294 453 -1158 (...) -21 -313 45 531 201 384
HMM A C D E F G H (...) m->m m->i m->d i->m i->i d->m d->d b->m m->e
1 -1084 390 -8597 -8255 -5793 -8424 -8268 (...) 1
- -149 -500 233 43 -381 399 106 (...)
C -1 -11642 -12684 -894 -1115 -701 -1378 -16 *
2 -2140 -3785 -6293 -2251 3226 -2495 -727 (...) 2
- -149 -500 233 43 -381 399 106 (...)
C -1 -11642 -12684 -894 -1115 -701 -1378 * * (...)
76 -2255 -5128 -302 363 -784 -2353 1398 (...) 103
- -149 -500 233 43 -381 399 106 (...)
E -1 -11642 -12684 -894 -1115 -701 -1378 * *
77 -633 879 -2198 -5620 -1457 -5498 -4367 (...) 104
- * * * * * * * (...)
C * * * * * * * * 0
//
A profile HMM with match state
probabilities shown
AAs PATH is the consensus sequence.

Building a profile HMM
Pick a HMM structure/topology.

Estimate initial parameters.
Train the HMM by running sequences
through it.
Transitions that get used are given
higher probabilities, those rarely used
are given lower probabilities.
Protein profile HMMs
Better (in theory) representations than PSSMs.
More complicated.
Not hand-tuned by curators.
Used in some protein profile databases:
Pfam (http://pfam.sanger.ac.uk/)
SMART (http://smart.embl-heidelberg.de/)
Difficult to describe in human readable formats.
Schuster-Bckler et al., 2004 (http://www.biomedcentral.com/1471-2105/5/7)

PSSM

Uploaded by

Copyright:

Available Formats

PSSM

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

PSSM

Uploaded by

Copyright:

Available Formats

Scoring Matrices

Scoring matrices, PSSMs

Score(N) at position 1 = 3/3 = 1.

PSSM Score(I) at position 1 = log(0.044 / 0.05) = -0.061.

The matrix assigns positive scores to residues

Weights sequence according to

1. PSSM with pseudocounts.

A PSSM column with a perfectly conserved isoleucine with different

A profile HMM can represent a sequence

A profile HMM includes information on the

A profile HMM also has position-specific

To create an HMM to model data we need to

The structure/topology of the HMMstates

AAs PATH is the consensus sequence.

Pick a HMM structure/topology.

Schuster-Bckler et al., 2004 (http://www.biomedcentral.com/1471-2105/5/7)

You might also like