DPDB Home Page Search Analysis Help Statistics Links Contact us
(1) Sequence comparison (2) Nucleotide Diversity
Blast Clustal Jalview SNPs-Graphic PDA Server

ClustalW Help



You can write any text to help you identify the results of the alignment.

You can choose in which format you want to obtain the sequences alignment. The options are:

  1. Clustal o ALN: This is a self explanatory alignment.  The alignment is written out in blocks.  Identities are highlighted and (if you use a PAM 250 matrix) positions in the alignment where all of the residues are "similar" to each other (PAM 250 score of 8 or more) are indicated.

  2. GCG o MSF: In version 7 of the Wisconsin GCG package, a new multiple sequence format was introduced.  This is the MSF (Multiple Sequence Format) format.  It can be used as input to the GCG sequence editor or any of the GCG programs that make use of multiple alignments.   THIS FORMAT IS ONLY SUPPORTED IN VERSION 7 OF THE GCG PACKAGE OR LATER. 

  3. Phylip: This format can be used by the Phylip package of Joe Felsenstein (see the references/algorithms section for details of how to get it).  Phylip allows you to do a huge range of phylogenetic analyses (we just offer one method in this program) and is probably the most widely used set of programs for drawing trees. It also works on just about every computer you can think of, providing you have a decent Pascal compiler.

  4. PIR: This is the usual NBRF/PIR format with gaps indicated by hyphens ("-"). AS we have stressed before, this format is EXACTLY compatible with the sequence input format.  Therefore you can read in these alignments again for profile alignments or for calculating phylogenetic trees.

  5. GDE

You can decide in which order you want the sequences in the alignment appear. The options are:

  1. ALIGNED: depending on the punctuation in the alignment: from more to less far away.

  2. INPUT: the order is the same used by the user to introduce the sequences.



Can be 1 or 2 for proteins; 1 to 4 for DNA. Increase this to increase speed; decrease to improve sensitivity.

The number of diagonals around each "top" diagonal that are considered. Decrease for speed; increase for greater sensitivity.

The number of matching residues that must be found in order to introduce a gap. This should be larger than K-Tuple Size. This has little effect on speed or sensitivity.



For protein comparisons, a weight matrix is used to differentially weight different pairs of aligned amino acids.  The default is the well known Dayhoff PAM 250 matrix. We also offer a PAM 100 matrix, an identity matrix (all weights are the same for exact matches) or allow you to give the name of a file with your own matrix. What's more, you can choose also these other series:

  1. Henikoff BLOSUM. These seem to be the best in order to do similarity studies in databases (homologue searches).

  2. GONNET. These matrixs come from the Dayhoff matrixs, but they are more actualized and they are based in larger information groups, so they seem to be more sensitive.

  3. IDENTITY MATRIX (ID). The punctuation is 10 for two identical amino acids, or 0 in the other cases.

Reduce this to encourage gaps of all sizes; increase it to discourage them.   Terminal gaps are penalized same as all others except for END GAPS not being selected.  BEWARE of making this too small (approx 5 or so); if the penalty is too small, the program may prefer to align each sequence opposite one long gap.

Penalization for the distance between gaps.



You will need to introduce an alignment to use this option. The format of this alignment must be one of the followings:

The method used is NJ (Neighbor-Joining) of Saitou and Nei. First, it calculates the distances (percentage of divergence) between all the pares of sequences in the multiple alignment; then, the distances matrix is calculated.

You can choose one of the following tree formats with this option:

  1. Neighbor-Joining

  2. Phylip

  3. Distance

You will need a program capable of showing the information, such as Tree-View, in order to see these trees.

As sequences diverge, substitutions accumulate. It becomes increasingly likely that more than one substitution (as a result of a mutation) will have happened at a site where you observe just one difference now. This option allows you to use formulae developed by Motoo Kimura to correct for this effect. It has the effect of stretching long branches in trees while leaving short ones relatively untouched. The desired effect is to try and make distances proportional to time since divergence.


UPLOAD: you can include an archive with the sequences you want to align from your computer. All the sequences must be in the same archive, and in one of these formats: NBRF/PIR, EMBL/SwissProt o FASTA (Pearson y Lipman, 1988). The sequences can be introduced in capital letters or in small letters. The symbols recognized for proteins are: A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W y Y, and for DNA/RNA: A, C, G, T y U. All the other letters of the alphabet will be considered as X for proteins, or as N in DNA/RNA. The other symbols (spaces, numbers, ...) will be ignored except the hyphen "-", which can be used to specify a gap. This can be specially useful for two reasons: 1) you can fix the position of some gaps before doing the alignment; 2) the resulting alignment can be in NBRF format using hyphens for the gaps. So these alignments can be used as input to make phylogenetic trees.