swat.doc

/*****************************************************************************
# Copyright (C) 1993-1996 by Phil Green.
# All rights reserved.
#
# This software is part of a beta-test version of the swat/cross_match/phrap
# package. It should not be redistributed or
# used for any commercial purpose, including commercially funded
# sequencing, without written permission from the author and the
# University of Washington.
#
# This software is provided ``AS IS'' and any express or implied
# warranties, including, but not limited to, the implied warranties of
# merchantability and fitness for a particular purpose, are disclaimed.
# In no event shall the authors or the University of Washington be
# liable for any direct, indirect, incidental, special, exemplary, or
# consequential damages (including, but not limited to, procurement of
# substitute goods or services; loss of use, data, or profits; or
# business interruption) however caused and on any theory of liability,
# whether in contract, strict liability, or tort (including negligence
# or otherwise) arising in any way out of the use of this software, even
# if advised of the possibility of such damage.
#
#*****************************************************************************/

/* SWAT: a program for searching one or more DNA or protein query
sequences, or a query profile, against a sequence database, using (an efficient
implementation of) the Smith-Waterman or Needleman-Wunsch algorithms,
with linear (affine) gap penalties. For each match an empirical measure of
statistical significance derived from the observed score distribution
is computed.

5/11/94 N.B. NW option will not work correctly until function call has
been made similar to that for s-w.

Usage:
swat query_file_name database_file_name [-option value] [-option value] ...

Available options are :

option name value to be provided default value

-matrix (or -M) name of file containing score matrix or profile BLOSUM50

-gap_init gap initiation penalty (= penalty for -12
first residue in a gap)
-gap_ext gap extension penalty (= penalty for -2
each subsequent residue)
-ins_gap_ext insertion gap extension penalty (= penalty for gap_ext
each subsequent residue, for insertions in
subject relative to query)
-del_gap_ext deletion gap extension penalty (= penalty for gap_ext
each subsequent residue, for deletions in
subject relative to query)
)
-end_gap gap extension penalty at ends of -1
sequences (N-W algorithm only)
-E E-value cutoff; alignments with an E-value 1.0
higher than this will not be displayed.
-z z-score cutoff; alignments with a z-value 6.0
lower than this will not be displayed.
(See below for description of z-scores).
-N maximum no. of displayed alignments 20
-truncatedb no. of entries at start of db that will
be searched; useful for testing purposes only.
The default is to search the entire database.
-nw [flag; no value should be provided. Indicates that
Needleman-Wunsch algorithm should be used instead of
Smith-Waterman]
-raw [flag indicating that z and E-values should not be computed,
and matches should be ranked by their raw alignment
scores.]
-file [flag to create an output file for each query, giving
the "complete" results of the search (but without alignments).
This option is useful primarily for testing the relative
power of different score matrices and gap penalties and
will not be needed by most users.
The file name is constructed automatically for each query and
should NOT be given on the command line; it consists of the
query sequence ID, followed by the suffix ".allscores".
On completion of the run the file contains one line for each
database entry giving the ID, length, score, and z-score.
Entries are sorted by decreasing z. This file is produced
independently of the usual program output and is unaffected
by the -z, -N, or -E settings.]

The query file (which may contain more than one query sequence,
unless a profile search is to be performed in which case it can
contain only one query sequence) and database file must each be in FASTA
format: Each entry must have a single header line with the leading
character '>' followed immediately by the ID of the sequence; the
remainder of this line may contain an optional description. The
header line is followed by lines containing the sequence. Any
non-alphabetic characters are ignored, lower case letters are
automatically converted to upper case, and any letter not in the score
matrix alphabet is converted to the last character in this alphabet
(typically an 'X' or '*').

N.B. NO LONGER CASE SENSITIVE -- all row and column labels should be
capital letters.

An appropriate matrix must be provided, in order to score alignments
between the query (or profile) and each "subject" (database sequence).
The matrix may be either a conventional score matrix, or a profile,
but must contain only integer score values. The format of the matrix
file is as follows: Any blank lines, and lines beginning with # are
ignored. A line beginning with the word "GAP" is taken to contain the
gap initiation and gap extension penalties (a pair of integers
separated by spaces); this line must precede the line containing the
column labels. The values given here will be overridden by command
line parameters. The first non-blank line not beginning with # or GAP
is assumed to contain the column labels, each of which should be a
letter or '*'; there should be one column for each distinct type of
residue that may appear in a subject sequence. The same label should
not be used for two different columns. Succeeding lines should
contain the rows of the matrix. Each row may optionally include a
label (i.e. a single letter or '*' at the beginning of the line); if
these are omitted, the row labels are taken to be the same as the
column labels. If the matrix is a profile, row labels must be
included, a single query sequence must be provided in the query file
whose positions correspond 1-1 to the rows of the profile, and the
sequence of this query must match the list of row labels (the only
role of the query sequence in this case is to identify positions in
printed alignments -- it plays no role in the analysis). If the
matrix is a conventional score matrix, there should be one row for
each type of residue that may appear in the query.
Given an alignment, the score for an aligned pair of residues is
determined by looking up the matrix entry whose column label matches
the subject residue, and whose row corresponds to the query residue,
either by position (if the matrix is a profile) or by label (if it is
a conventional score matrix). Case is significant (so 'A' does not
match 'a'). Residues for which there is no matching label are
assigned to the last column (or row).

Note that three different options, E, z, and N, influence how many
alignments are displayed. If a single one of these is set on the
command line it becomes the decisive criterion; if more than one is
set, the most stringent is used. If none are set, the most stringent
default value for the three options is used. By default, matches are
sorted and displayed by decreasing z-score (see below), or
equivalently E-value, rather than raw alignment score, unless the
flags -raw or -nw are set. N.B. For Needleman-Wunsch searches, or if
the database is small, z- and E-scores are not computed and the
matches are instead ranked by the raw alignment score.

Current restrictions:

(i) In Smith-Waterman searches, if the alignment score for a database entry
is less than -1 * gap_init it will be reported as 0. This restriction can
be relaxed (at some cost in execution time) by moving the position of
the corresponding block of statements in the source code. It is irrelevant
for most purposes unless a very large gap_init value is used.

(ii) ints must be at least 32 bits (so this program may not work with some
compilers on DOS machines).

(iii) The statistical analyses currently assume that "chance" scores are
very unlikely to exceed 200. This assumption is reasonable when
protein searches are done using the usual score matrices; it may
not be reasonable for searches of nucleotide databases. It can
easily be relaxed by changing appropriate program parameters.

(iv) z- and E-values are only computed for Smith-Waterman searches. They
may be unreliable for some combinations of score matrix / gap penalties.
(see below), or if the database is small.

(v) A single set of gap penalties is allowed; i.e. they cannot be
made position-specific.

Output: Mostly self-explanatory. includes a summary of the statistical
analysis of the score distribution, and the alignments, scores and
E-values of the best matches. Matches are ranked by decreasing
z-score.

Statistical procedures/ ranking of matches:

(N.B. The following applies only to Smith-Waterman searches.)

When the database sequences vary greatly in length (as is usually the
case), theoretical arguments and empirical studies suggest
that the raw alignment score is less sensitive, for discriminating
true homologies from chance matches, than an adjusted score which has
a null distribution independent of the length of the database
sequence. SWAT attempts to compute such an adjusted score in the form
of a "z-score", defined for a given database sequence by

z = (S - f(n)) / sqrt(g(n))

where S and n are the raw alignment score and length of the database
sequence, and f(n) and g(n) are the predicted mean score and variance
for sequences of length n. z-scores thus should have mean 0 and
standard deviation 1, independent of n. Like the raw score S, but
unlike the E-value, z should be relatively independent of database
size (although it may depend somewhat on the characteristics of
database sequences); moreover it has a more easily interpretable scale
than the raw score. At present, f and g are estimated from the
observed score distribution for the database search, by regressing the
observed scores and squared residuals, respectively, against log(n).
This functional form has some theoretical justification (see below)
and appears to work well for the score matrices and gap penalties in
common use, although we are still exploring whether other functions
may work better in some situations. (The regressions are actually
performed twice, the results of the first one being used to identify
high-scoring outliers (e.g. true homologies) which are then eliminated
prior to performing the second regression).

When the search has "local" behavior (i.e. the score distribution
conforms to Karlin-Altschul type formulae), then known theoretical
results imply that f(n) is in fact a linear function of log(n), g(n)
is constant, and z is distributed as the (unique) Gumbel distribution
with mean 0 and variance 1, independently of n. Consequently it is
easy to convert z-values to E-values in this case, and they are
monotonically related to each other. For searches that do not have
local behavior, it is not known what the appropriate theoretical forms
for f and g are, nor what the distribution of z is nor even whether
this distribution is independent of n. Nonetheless z-scores should
still be preferable to raw scores, and consequently the same
procedures for computing z and E are used by SWAT in this case. They
appear to work reasonably well at least for for PAM250 and gap
penalties -12 and -2, a particular combination having non-local
behavior that Pearson has found to be optimal for Smith-Waterman
searches. Our tests using random queries indicate that the E-values
obtained in this case tend to be somewhat conservative.

A second procedure for estimating statistical significance is also
included in the code but no longer used. It explictly assumes the
search has local behavior and is based on fitting an extreme-value
(Gumbel) distribution to the observed score distribution using maximum
likelihood estimation (inspired by, but simpler than, Mott's
procedure). (The idea of using the observed score distribution to
estimate significance appears to be due to Collins and Coulson). The
top part of the distribution (by default, the top 5% of the scores) is
ignored, and a censored Gumbel distribution is fitted to the remainder
to get estimates of lambda and K (in the notation of Karlin and
Altschul). (This is different from Mott's procedure, which uses an
outlier detection criterion that may miss many true positives. Also,
we assume a simpler form for the distribution that does not allow
lambda to vary for different database sequences; this makes it
possible to use an iterative algorithm for finding the maximum
likelihood parameter estimates that should converge much more
rapidly.)

For brief description of the implementation of the s-w algorithm, see
header to smith_wat.c.