|DPDB Home Page||Search||Analysis||Help||Statistics||Links||Contact us|
|(1) Sequence comparison||(2) Nucleotide Diversity|
You can write any text to help you identify the results of the alignment.
You can choose whether to do a full alignment (SLOW), or rather to use an stringent algorithm in order to create the philogenetic tree or a fast algorithm to create the alignment (FAST).
You can let the program detect the type of sequence leaving the predetermined option (AUTOMATIC), or rather, in cases of complex sequences, select specifically the type as protein (PROTEIN) or DNA (DNA). The sequence will be considered as DNA when at least 85% of the letters are A, C, G, T or U.
You can choose in which format you want to obtain the sequences alignment. The options are:
Clustal o ALN: This is a self explanatory alignment. The
alignment is written out in blocks. Identities are highlighted and
(if you use a PAM 250 matrix) positions in the alignment where all
of the residues are "similar" to each other (PAM 250 score of 8 or
more) are indicated.
GCG o MSF: In version 7 of the Wisconsin GCG package, a new
multiple sequence format was introduced. This is the MSF (Multiple
Sequence Format) format. It can be used as input to the GCG
sequence editor or any of the GCG programs that make use of multiple
alignments. THIS FORMAT IS ONLY SUPPORTED IN VERSION 7 OF THE GCG
PACKAGE OR LATER.
Phylip: This format can be used by the Phylip package of
Joe Felsenstein (see the references/algorithms section for details
of how to get it). Phylip allows you to do a huge range of
phylogenetic analyses (we just offer one method in this program) and
is probably the most widely used set of programs for drawing trees.
It also works on just about every computer you can think of,
providing you have a decent Pascal compiler.
PIR: This is the usual NBRF/PIR format with gaps
indicated by hyphens ("-"). AS we have stressed before, this format
is EXACTLY compatible with the sequence input format. Therefore you
can read in these alignments again for profile alignments or for
calculating phylogenetic trees.
You can decide in which order you want the sequences in the alignment appear. The options are:
ALIGNED: depending on the punctuation in the alignment: from more to less far away.
INPUT: the order is the same used by the user to introduce the sequences.
FAST PAIR WISE ALIGNMENT
Can be 1 or 2 for proteins; 1 to 4 for DNA.
Increase this to increase speed; decrease to improve sensitivity.
The number of diagonals around each "top" diagonal
that are considered. Decrease for speed; increase for greater
similarity scores may be expressed as raw scores
number of best diagonals in the imaginary dot-matrix plot that are considered. Decrease (must be greater than zero) to increase speed; increase to improve sensitivity.
imaginary dot-matrix plot that are considered. Decrease (must be greater than zero) to increase speed; increase to improve sensitivity.
greater than zero) to increase speed; increase to improve sensitivity.
The number of matching residues that must be found
in order to introduce a gap. This should be larger than K-Tuple
Size. This has little effect on speed or sensitivity.
For protein comparisons, a weight matrix is
used to differentially weight different pairs of aligned amino
acids. The default is the well known Dayhoff PAM 250 matrix. We
also offer a PAM 100 matrix, an identity matrix (all weights are the
same for exact matches) or allow you to give the name of a file with
your own matrix. What's more, you can choose also these other series:
Henikoff BLOSUM. These seem to be the best in order to do similarity studies in databases (homologue searches).
GONNET. These matrixs come from the Dayhoff matrixs, but they are more actualized and they are based in larger information groups, so they seem to be more sensitive.
IDENTITY MATRIX (ID). The punctuation is 10 for two identical amino acids, or 0 in the other cases.
Reduce this to encourage gaps of all sizes;
increase it to discourage them. Terminal gaps are penalized same
as all others except for END GAPS not being selected. BEWARE of making this too small (approx 5 or so); if
the penalty is too small, the program may prefer to align each
sequence opposite one long gap.
this to encourage longer gaps;
Penalization for the distance between gaps.
You will need to introduce an alignment to use this option. The format of this alignment must be one of the followings:
NBRF / PIR
EMBL / SwissProt
GCG / MSF
The method used is NJ (Neighbor-Joining) of Saitou and Nei. First, it calculates the distances (percentage of divergence) between all the pares of sequences in the multiple alignment; then, the distances matrix is calculated.
You can choose one of the following tree formats with this option:
You will need a program capable of showing the information, such as Tree-View, in order to see these trees.
As sequences diverge,
substitutions accumulate. It becomes increasingly likely that more
than one substitution (as a result of a mutation) will have happened
at a site where you observe just one difference now. This option
allows you to use formulae developed by Motoo Kimura to correct for
this effect. It has the effect of stretching long branches in trees
while leaving short ones relatively untouched. The desired effect
is to try and make distances proportional to time since divergence.
This option allows you to ignore all
alignment positions (columns) where there is a gap in any sequence.
This guarantees that "like" is compared with "like" in all distances
i.e. the same positions are used to calculate all distances. It
also means that the distances will be "metric". The disadvantage of
using this option is that you throw away much of the data if there
are many gaps. If the total number of gaps is small, it has little
UPLOAD: you can include an archive with the sequences you want to align from your computer. All the sequences must be in the same archive, and in one of these formats: NBRF/PIR, EMBL/SwissProt o FASTA (Pearson y Lipman, 1988). The sequences can be introduced in capital letters or in small letters. The symbols recognized for proteins are: A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W y Y, and for DNA/RNA: A, C, G, T y U. All the other letters of the alphabet will be considered as X for proteins, or as N in DNA/RNA. The other symbols (spaces, numbers, ...) will be ignored except the hyphen "-", which can be used to specify a gap. This can be specially useful for two reasons: 1) you can fix the position of some gaps before doing the alignment; 2) the resulting alignment can be in NBRF format using hyphens for the gaps. So these alignments can be used as input to make phylogenetic trees.
y LIPMAN, 1988) FORMAT:
similar to FASTA format but immediately
not try to create files with this