DPDB Home Page Search Analysis Help Statistics Links Contact us
(1) Sequence comparison (2) Nucleotide Diversity
Blast Clustal Jalview SNPs-Graphic PDA Server

Blast Help

 

 

GENERAL SETTINGS

A region of the query sequences can be used for BLAST searching. You can enter the range in nucleotides or protein residues in the "Form" and "To" boxes provided under "Set Subsequence". For example to limit matches to the region from nucleotide 24 to nucleotide 200 of a query sequence, you would enter From= 24 To= 200. If one of the limits you enter is out of range, the intersection of the [From,To] and [1,length] intervals will be searched, where length is the length of the whole query sequence.

 

OUTPUT OPTIONS

This option only affects the fact that the output page contains links to GenBank (Yes) or not (No); default is Yes. This option can be applied to all the programs.

Restricts the number of short descriptions of matching sequences reported to the number specified; default limit is 100 descriptions. This option can be applied to all the programs.

 

ADVANCED OPTIONS

The statistical significance threshold for reporting matches against database sequences is in some ways the most useful of the scores that BLAST provides. It provides an estimate of the number of alignments one would expect to find with a score greater than or equal to that of the observed alignment in a search against a random database of the same composition, according to the stochastic model of Karlin and Altschul (1990). An E value greater than 1 therefore indicates that the alignment probably has occurred by chance, and that the query sequence has been aligned to a sequence in the database to which it is not related. E values less than 0.1 or 0.05 are typically taken to represent biological significance. It is common practice to use the expectation value (or E value) as a measure of statistical significance. If the statistical significance ascribed to a match is greater than the EXPECT threshold, the match will not be reported. Lower EXPECT thresholds are more stringent, leading to fewer chance matches being reported. Increasing the threshold shows less stringent matches. Fractional values are acceptable. The default value is 10, meaning that 10 matches are expected to be found merely by chance. This option can be applied to all the programs.

Filter (low-complexity): Masking or Filtering is the removal of repeated or low complexity regions from a sequence in order to improve the sensitivity of sequence similarity searches performed with that sequence.

This option masks off segments of the query sequence that have low compositional complexity, as determined by the SEG program of Wootton & Federhen (Computers and Chemistry, 1993) or, for BLASTN, by the DUST program of Tatusov & Lipman. SEG is a program for filtering low complexity regions in amino acid sequences, while DUST is used for filtering these regions in nucleic acid sequences. Residues that have been masked are represented as "X" in an alignment.

Filtering can eliminate statistically significant but biologically uninteresting reports from the blast output (e.g., hits against common acidic-, basic- or proline-rich regions), leaving the more biologically interesting regions of the query sequence available for specific matching against database sequences. 

Filtering is only applied to the query sequence (or its translation products), not to database sequences. Default filtering is DUST for BLASTN and SEG for other programs

It is not unusual for nothing at all to be masked by SEG, when applied to sequences in  SWISS-PROT, so filtering should not be expected to always yield an effect.

 

Masking: Furthermore, in some cases, sequences are masked in their entirety, indicating that the statistical significance of any matches reported against the unfiltered query sequence should be suspect.

Mutational events include not only substitutions but also insertions and deletions. A gap is a space introduced into an alignment to compensate for insertions and deletions in one sequence relative to another. To prevent the accumulation of too many gaps in an alignment, introduction of a gap causes the deduction of a fixed amount (the gap score) from the alignment score. Extension of the gap to encompass additional nucleotides or amino acid is also penalized in the scoring of an alignment.  Increasing the Gap Cost will result in alignments which decrease the number of Gaps introduced. The penalty for the creation of a gap should be large enough that gaps are introduced only where needed, and the penalty for extending a gap should take into account the likelihood that insertions and deletions occur over several residues at a time. There can only be a limited number of options for these parameters. Some suggested values are existence penalties of 7, 8 and 9 with an extension penalty of 2, and existence penalties of 10, 11 and 12 with an extension penalty of 1.

See cost to open a gap.

When evaluating a sequence alignment, one would like to know how meaningful it is. This requires a scoring matrix, or a table of values that describes the probability of a biologically meaningful amino-acid or nucleotide residue-pair occurring in an alignment. Typically, when two nucleotide sequences are being compared, all that is being scored is whether or not two bases at a given position are the same. All matches are given the same score (Penalty for mismatch, typically +1 or +5), as are all mismatches (Reward for a match, typically -1 or -4). Here, the default values are -3 for the Penalty for a mismatch, and 1 for the Reward for a match. As these parameters apply for comparisons between two nucleotide sequences, they only run with the blastn program. The homologue parameters used for comparisons between two protein sequences (for all the programs except for blastn) are defined in the matrix parameter.

See Penalty for mismatch.

A key element in evaluating the quality of a pair wise sequence alignment is the substitution matrix, which assigns a score for aligning any possible pair of residues.  This matrix contains values proportional to the probability that one amino acid mutates into another, for all pairs of amino acids. Such matrices are constructed by assembling a large and diverse sample of verified pair wise alignments of amino acids. If the sample is large enough to be statistically significant, the resulting matrices should reflect the true probabilities of mutations occurring through a period of evolution. The matrix used in a BLAST search can be changed depending on the type of sequences you are searching with: PAM30, PAM70, BLOSUM45, BLOSUM62 and BLOSUM80. The PAM120 matrix is considered a good scoring matrix for closely related sequences, while the PAM250 matrix is more appropriate for more distantly related sequences. About the BLOSUM matrices, the number associated with a BLOSUM matrix (such as BLOSUM62 or BLOSUM80) indicates the cutoff value for the percentage sequence identity that defines the clusters. Lower cutoff values allow more diverse sequences into the groups, and the corresponding matrices are therefore appropriate for examining more distant relationships. As these parameters apply for comparisons between two amino acid sequences, they run with all the programs except for blastn, that is to say, for blastp, blastx, tblastn and tblastx programs, and the default value is BLOSUM62. See Penalty for mismatch for comparing with the scores used with nucleic sequences comparisons.

The BLAST algorithm runs in three steps. In the first step, given a query sequence of length L, BLAST derives a list of short sequences, or words, of length w, that make up the query. There are at most L - w + 1 such words. This word list is then expanded to include all high-scoring matching words, keeping only those whose score is greater than the neighborhood word score threshold T when scored using a scoring matrix such as PAM250 or BLOSUM62. For typical parameter values, this results in about 50 words per residue of the query sequence. The default word lengths are 11 for nucleotide sequences (blastn) and 3 for amino-acid sequences (others).

The effective length of the database is the result of dividing the effective length of the search space by the effective length of the query. The option can be applied in all the programs, and the default value is 0, which means the real length of the database.

The effective length of the search space is the product of the effective lengths of the query sequence and the database. The option can be applied in all the programs, and the default value is 0, which means the real length of the search space.

You can drop-off, or reduce the score, for blast extensions. This can apply to all the programs, and default values are 20 for blastn, and 7 for all the other programs.

You can drop-off, or reduce the score, for final gapped alignments. This can apply to all the programs, and default values are 50 for blastn, and 25 for all the other programs.

You can drop-off, or reduce the score, for gapped alignments. This can apply to all the programs except for blastn, that is to say, blastp, blastx, tblastn and tblastx, and the default value is 15.

BLAST searches can be limited to the results of an Entrez query against the database chosen. This can be used to limit searches to subsets of the BLAST databases. If you want to use this option, you will have to visit the Entrez page in the NCBI and make a query. In the output page, you can select to Display "GI list", and then save it in a text file in your computer. You can upload this file from the BLAST analysis page.

 

UPLOAD:

You can either paste a set of sequences in any supported format, or upload a file from your computer with the sequences you want to analize. 

BLAST accepts a number of different types of input and automatically determines the format. To allow this feature there are certain conventions required with regard to the input of identifiers (e.g., accessions or gi's). These are described in the third section below. Accepted input types are:

  1. FASTA (PEARSON and LIPMAN, 1988): A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. It is recommended that all lines of text be shorter than 80 characters in length. An example sequence in FASTA format is:

    >gi|129295|sp|P01013|OVAX_CHICK GENE X PROTEIN (OVALBUMIN-RELATED) QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCKMKILELPFELPF
    ASGDLSMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEKRRVKVYLPQMKIEEKYNLTSVLMALELPF
    GMTDLFIPSANLTGISSAESLKISQAVHGAFMELSEDGIEMAGSTGVIEDIKHSPESEQFRADHPFELPF
    LFLIKHNPTNTIVYFGRYWSP

    Blank lines are not allowed in the middle of FASTA input.

    Sequences are expected to be represented in the standard IUB/IUPAC amino acid and nucleic acid codes, with these exceptions: lower-case letters are accepted and are mapped into upper-case; a single hyphen or dash can be used to represent a gap of indeterminate length; and in amino acid sequences, U and * are acceptable letters (see below). Before submitting a request, any numerical digits in the query sequence should either be removed or replaced by appropriate letter codes (e.g., N for unknown nucleic acid residue or X for unknown amino acid residue). The nucleic acid codes supported are:

    A --> adenosine           M --> A C (amino)
    C --> cytidine            S --> G C (strong)
    G --> guanine             W --> A T (weak)
    T --> thymidine           B --> G T C
    U --> uridine             D --> G A T
    R --> G A (purine)        H --> A C T
    Y --> T C (pyrimidine)    V --> G C A
    K --> G T (keto)          N --> A G C T (any)
                              -  gap of indeterminate length

    For those programs that use amino acid query sequences (BLASTP and TBLASTN), the accepted amino acid codes are:

    A  alanine                         P  proline
    B  aspartate or asparagine         Q  glutamine
    C  cystine                         R  arginine
    D  aspartate                       S  serine
    E  glutamate                       T  threonine
    F  phenylalanine                   U  selenocysteine
    G  glycine                         V  valine
    H  histidine                       W  tryptophan
    I  isoleucine                      Y  tyrosine
    K  lysine                          Z  glutamate or glutamine
    L  leucine                         X  any
    M  methionine                      *  translation stop
    N  asparagine                      -  gap of indeterminate length
    
  2. Bare Sequence: This may be just lines of sequence data, without the FASTA definition line, e.g.:

    QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNNSFNVATLPAE KMKILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEKRRVKVYLPQMKIEEKYNLTS VLMALGMTDLFIPSANLTGISSAESLKISQAVHGAFMELSEDGIEMAGSTGVIEDIKHSPESEQFRADHP FLFLIKHNPTNTIVYFGRYWSP

    It can also be sequence interspersed with numbers and/or spaces, such as the sequence portion of a GenBank/GenPept flatfile report:

      1 qikdllvsss tdldttlvlv naiyfkgmwk tafnaedtre mpfhvtkqes kpvqmmcmnn
     61 sfnvatlpae kmkilelpfa sgdlsmlvll pdevsdleri ektinfeklt ewtnpntmek
    121 rrvkvylpqm kieekynlts vlmalgmtdl fipsanltgi ssaeslkisq avhgafmels
    181 edgiemagst gviedikhsp eseqfradhp flflikhnpt ntivyfgryw sp

    Blank lines are not allowed in the middle of bare sequence input.

     

  3. Identifiers: Normally these are simply accession codes, version codes or gi's (e.g., p01013, AAA68881.1, 129295), but a bar-separated NCBI sequence identifier (e.g., gi|129295) will also be accepted. These NCBI sequence identifiers have a very specific syntax: the identifier may consist of only one token (i.e., word), so spaces between letters in the input will cause it to be treated as bare sequence (spaces before or after the identifier are allowed). Examples of illegal input are:

Example 1. ACCESSION P01013 

Example 2. AAA68881. 1 

Example 3. gi| 129295

In the first example, "ACCESSION" must be removed. In the second one, there is a space before the version number of the accession. Finally, in the third example, there is a space after the bar ("|").



DGM UAB