Introduction to PAUP* 4.0
PAUP* 4.0 is the successor to PAUP 3.1, which was published in 1993 by David L. Swofford, currently at the School of Computational Science & Information Technology, Florida State University. The name PAUP means Phyogenetic Analysis Using Parsimony because parsimony was the only optimality criterion employed at the time. The asterisk in the name PAUP* means and other methods. PAUP* is now easily one of the most comprehensive phylogenetic analysis computer programs available, and we will spend a good of time learning how to use this program.
PAUP* Home Page
The PAUP* Home Page is the best place to go for up-to-date information about program availability, known problems/workarounds, and help in the form of a FAQ and electronic forum. As of this writing, PAUP* is being sold by Sinaur Associates (price varies according to platform). While it is not a free program, you really do get a lot for your money compared to most other commercial software, as the next section is designed to illustrate.
What can PAUP* do?
PAUP* is capable of performing most of the types of phylogenetic analyses you might want to perform (as well as many you might not!). The following listing is not exhaustive, but is designed to give you an idea of what PAUP* can currently do:
- Algorithmic searching: Exhaustive, Branch-and-bound, Stepwise Addition, Neighbor-joining, Puzzling, UPGMA, Star Decomposition
- Heuristic searching: Nearest-neighbor Interchange (NNI), Subtree Pruning/Regrafting (SPR), Tree Bisection/Reconnection (TBR)
- Optimality criteria: parsimony, likelihood, minimum-evolution, least-squares
- Parsimony variants: Camin-Sokal, Wagner, Fitch, transversion, generalized (=weighted)
- Substitution models: JC69, F81, K80, F84, HKY85, GTR, logdet/paralinear
- Descriptive statistics: base frequencies, pairwise sequence comparisons
- Manipulating data scope: include/exclude characters, delete/restore taxa, partitions (characters and taxa)
- Statistical tests: KH test, homogeneity partition test, permutation tests, base frequency homogeneity test, likelihood ratio test of molecular clock
- Nodal support measures: jackknife, bootstrap
- Consensus methods: strict/semistrict/majority-rule/Adams consensus trees, agreement subtrees
- Trees: generation of random trees, tree-to-tree distances
- Other: Lake's invariants, plots of gamma distribution, likelihood surface check, ancestral state reconstruction, printing of trees
What can PAUP* not do?
Despite its completeness, there are a few things that PAUP* cannot do for you at the present time:
- PAUP* does not allow tree editing (like MacClade or TreeView)
- PAUP* is not able to do maximum likelihood analyses on amino acid sequences
- PAUP* does not provide codon models that allow you to take into account the codon structure of protein coding genes when analyzing nucleotide sequences (use PAML for this)
- PAUP* does not perform Bayesian analyses (we will use MrBayes later on for this)
- PAUP* (like almost all other phylogenetic analysis programs) assumes your sequences are already aligned; it will not align them for you, nor will it help you find sequences in GenBank or other databases.
In this and subsequent web pages, I will try to stick to the following typographical conventions:
- New terms will look like this
- Text that I want to emphasize will look like this
- Command names or portions of commands that you might type into a program such as PAUP* will look like this
- Keywords used in Nexus files will look like this
PAUP* is not really finished at this point. For the most part, this is not a problem since you can purchase and use it just like a finished product. The primary drawback of PAUP*'s unfinished status is that there is currently not a complete manual for the program. On the PAUP* Download Page you can find a PDF command summary and "Quick Start" tutorial; however, much of the explanatory portion of the manual is not present in any form. There are easy ways to obtain information from the program itself, however. Some of the tips listed below are concerned with getting the program to tell you what commands and command options are available.
Here are some tips to keep in mind while you use PAUP*. This list is not comprehensive; these are just some things that are not immediately apparent but which make your life easier once you know about them.
- A command line can be made visible on the Mac version, and is always apparent on all other versions of PAUP*. This may not sound like a tip, but having a command line allows you to explore many of the other tips described below.
- The help command provides a list of available commands. Often you can spot the command you need by looking at this list. Once you see a command name that looks promising, you can get a description of how to invoke the command like this:
- The ? option works for all commands and provides a list of the options for that command as well as the current default settings for those options. This is extremely useful! For example, this command would list all the current likelihood settings:
- All PAUP* menu commands have command line equivalents. While the command line is not as fun or easy to use as the menu system, there are benefits to using the command line interface. For example, you can put all the commands for an analysis in the data file itself (see section on PAUP blocks, below), allowing you to have a complete record of what you did (often very useful when a reviewer asks you to be more specific about how you performed your analysis!). PAUP blocks are also useful for making sure certain settings are always invoked when you execute the data file.
- PAUP* uses the Nexus data file format. This is a fairly complex file format used by several programs that perform phylogenetic analyses (PAUP*, MacClade, TreeView, and Component, for example). It is described in more detail below, so here I will only point out that PAUP* can put your data in Nexus format automatically if your data are in one of several recognized formats already. This is done with the tonexus command as follows:
tonexus fromfile=mydata.txt format=text tofile=mydata.nex;
tonexus fromfile=mydata.msf format=gcg tofile=mydata.nex;
tonexus fromfile=mydata.dat format=phylip tofile=mydata.nex;
The first of these commands converts a data file (mydata.txt) in plain text format (each sequence on a separate line, with the name first followed by the sequence after one or more blank spaces) to Nexus format, storing it in a file named mydata.nex. The second line converts mydata.msf (GCG MSF format) into Nexus format, again storing the resulting file as mydata.nex. The third line converts a PHYLIP formatted data file (mydata.dat) to Nexus format.
Use the command tonexus ? to list other options, including other formats that can be converted.
- PAUP* allows you to easily include and exclude sites, making it possible to leave primer sites, introns, and dubiously aligned regions in the data file even though you do not wish to include them in analyses. You can also include or exclude entire classes of sites using the keywords all, gapped, missambig, constant, and uninf. For example,
would exclude all sites containing a gap for at least one taxon (sequence). If you needed to exclude only 3rd. codon position sites, even this is easy: assuming that the first nucleotide site in each sequence corresponds to a 1st codon position, this command would exclude all the 3rd. position sites (the dot stands for the last nucleotide site in the sequence):
exclude 1-. \ 3;
This is how you include again all the sites you have excluded:
- There are parallel commands for deleting and restoring OTUs. Don't be confused by the command names delete and restore: these act just like exclude and include except they act on OTUs and not characters (=sites). If the first, second and fourth taxa were named Thermus, Sulfolobus and Pyrococcus, you could tell PAUP* to ignore them in subsequent analyses using either of the two commands below:
delete 1 2 4;
delete Thermus Sulfolobus Pyrococcus;
This is how you would reinstate the taxa deleted above:
- For long runs, PAUP* reports progress once per minute in the form of a line written to the output buffer. To have PAUP* report once every two minutes, specify 120 seconds instead of the default of 60 seconds:
- By default, PAUP* does not save the output that is generated to a file. The output is stored in what is known as an output buffer. When this buffer becomes full, the first part will begin to be overwritten by newer output. Thus, one of the first things you should do when starting any serious analysis is to start a log file (using either the menu command or a command similar to this:
log file=myoutput.txt start replace).
- PAUP* almost always produces unrooted trees, however, the trees look rooted when PAUP* draws them! You can reroot the tree by specifying an outgroup OTU (or OTUs) either before or after analysis; however, whether you root the tree or not doesn't change the fact that PAUP* searched on unrooted trees (the rooting is only used by PAUP* to draw the tree after analysis). Here's how to tell PAUP* to always draw trees with Giardia as the outgroup:
The Nexus Data File Format
PAUP* uses a data file format known as Nexus. This file format is now shared among several programs. Nexus data files always begin with the characters #nexus but are otherwise organized into major units known as blocks. Some blocks are recognized by most of the programs using the Nexus file format, whereas other blocks are private blocks (recognized by only one program). A Nexus block has the following basic structure:
Note that the elipsis (...) is never used in a Nexus data file; it is used here simply to indicate that some text has been omitted. The name of the Nexus block used as an example above is characters. Because Nexus data files are organized in named blocks, PAUP* and other programs are able to read blocks whose names they recognize and ignore blocks that are not recognized. This allows many different programs to use the same overall format without crashing when they encounter data they cannot interpret.
Blocks are in turn organized into semicolon-terminated commands. It is very important that you remember to terminate all commands with a semicolon. This is especially hard to remember for very long commands. PAUP* is pretty good about pointing out forgotten semicolons, but sometimes it doesn't realize you've left something out until some distance downstream, which can make the problem point difficult to find. Some common commands will be provided below in the description of the common blocks.
Comments can be placed in a Nexus file using square brackets. Comments can be placed anywhere, and they are used for many purposes. For example, you can effectively remove some of your data by commenting it out. You can also annotate your sequences using comments. For example, a comment like that below is useful for locating specific sites in your alignment:
If you would like your comment printed out in the output when PAUP* executes the data file, just insert an exclamation point (!) as the first character inside the opening left square bracket:
[!This is the data file used for my dissertation]
Commonly-used Nexus blocks
Here is a list of common Nexus blocks and the most-common commands within these blocks. For a complete description of the Nexus file format, take a look at this paper:
|Maddison, David R., Swofford, David L. and Maddison, Wayne P. 1997. NEXUS: an extensible file format for systematic information. Systematic Biology 46: 590-621
The purpose of a Taxa block is to provide names for your taxa (i.e., sequences). You may not use a Taxa block very often, since you can also supply names for your taxa directly in the Data block (see below). Here is an example of a Taxa block.
Note that there are four commands in this example of a Taxa block. Can you find the terminating semicolon for each of them?
- the begin command giving the block's name
- the dimensions command giving the number of taxa
- the taxlabels command providing the actual taxon labels
- the end command, telling PAUP* that there are no more commands to process for this block
The Data block is the workhorse of Nexus blocks. This is where you place the actual sequence data, and, as mentioned above, this can also be where you define the names of your sequences. Here is an example of a Data block:
dimensions ntax=5 nchar=54;
format datatype=dna missing=? gap=-;
Some things to note in this example are:
- The dimensions command comes first in a Data block, and specifies the number of sequences (taxa; ntax) and number of sites (characters; nchar).
- The format command tells PAUP* what kind of data follow (dna, rna, protein, or standard), and provides the symbols used for missing data (?) and gaps (-). The standard data format is typically used for morphological data.
- The matrix command dominates the Data block, providing the sequences themselves (as well as the taxon names). Note the semicolon terminating the matrix command!!!
- You can use upper or lower case symbols for nucleotides
- You can place whitespace anywhere except inside a taxon name or keyword (e.g., data type = dna would cause problems because datatype should not have embedded whitespace).
- If you simply must have a space in one of your taxon names, either use an underscore character in place of the space (e.g., Ginkgo_biloba) or surround the taxon name in single quotes (e.g., 'Ginkgo biloba'). In either case, PAUP* will output the space in its output.
- One item missing from the format command in the example above but which is quite useful is something known as an equate list. The following format statement will cause all occurrences of T to be changed to C and all occurrences of G to be changed to A as the data are being read into PAUP*:
format datatype=dna missing=? gap=- equate="T=C G=A";
This is like telling PAUP* to do a search-and-replace operation on the sequences before reading them in, except that your original file remains intact. Be careful when using equate, because the replacement is case sensitive (i.e., equate="t=c g=a" would have had no effect if all the nucleotides are represented by upper case letters!).
- PAUP* recognizes all the standard ambiguity codes (e.g., R for purine, Y for pyrimidine, N for undetermined, etc.).
A Trees block has the following structure:
tree one = [&U] (1,2,(3,(4,5));
tree two = [&U] (1,3,(5,(2,4));
Some things to note in this example are:
- The translate command provides short alternatives to the taxon names, making the tree descriptions shorter (takes up fewer bytes of disk space).
- the translate command is not necessary however; it is ok to use the taxon names directly in the tree descriptions
- the tree command denotes the start of a tree description, which consists of a tree name (e.g., one and two are used here), followed by an equals sign and then the tree topology in the standard, parenthetical notation (often referred to as the Newick or New Hampshire format).
- The special comments consisting of an ampersand symbol followed by the letter U tell PAUP* to interpret the tree as being an unrooted tree.
- Files containing only the #nexus plus a trees block are called tree files
The only commands you need to know at this point from a sets block are the charset and the taxset commands.
charset trnL_intron = 562-4226;
taxset gnetales = Ephedra Gnetum Welwitschia;
This sets block defines both a set of characters (in this case the sites comprising the trnL intron) and a set of taxa (consisting of the three genera in the seed plant order Gnetales: Ephedra, Gnetum and Welwitschia). We could have used the taxon numbers for the taxset definition (e.g., taxset gnetales = 1-3;) but using the actual names is clearer and less prone to error (just think of what might happen if you decided to reorder your sequences!). These definitions may be used in other blocks. A common use is in commands placed inside a paup block (see below) or typed directly at the PAUP* command prompt.
There is only one command I will introduce from the assumptions block (although there are a number of others that exist). The exset command (the word exset stands for exclusion set) is useful for creating a set of characters that are automatically excluded whenever the data file is executed. Given the following block:
exset* badsites = 1 5 47-.;
PAUP* would automatically exclude characters (i.e., sites) 1, 5, and 47 through the end of the sequence. It is the asterisk after the newterm exset that denotes this as the default exclusion set. If you left out the asterisk, PAUP* would define the exclusion set but would not automatically exclude these sites as the data file was being executed.
Paup blocks provide a way to give PAUP* commands from within a data file itself. Any command you can type at the command prompt or peform using menu commands you can place in the data file. This allows you to specify an entire analysis right in the data file. For any serious analysis, I always run PAUP* using a paup block. That way I know exactly what I did for a given analysis several days or weeks in the future. Paup blocks are also a handy way to perform certain commands every time the data file is executed. For example, you can set up your favorite likelihood substitution model, delete certain taxa or exclude certain sites from a paup block located just after your data block. Here is an example of a typical paup block:
log file=myoutput.txt start stop;
lset nst=2 basefreq=empir rates=equal tratio=estim variant=hky;
hsearch swap=tbr addseq=random nreps=100 start=stepwise;
describe 1 / plot=phylogram;
savetrees file=mytrees.tre brlens;
Here is what each line does (but don't worry too much about this since we will be talking much more about individual commands later in lab):
- The log command starts a log file (the file will be called myoutput.txt and will be overwritten if it already exists)
- The outgroup command specifies that the resulting trees should be rooted between Ephedra and everything else (this just affects the appearance of the tree when drawn)
- The set command changes the optimality criterion from the default (parsimony) to maximum likelihood
- The lset command sets up PAUP* so that the HKY85 model will be used (number of substitution rates is 2, empirical base frequencies, rates are homogeneous across sites, estimate the transition/transversion ratio, and use the HKY model rather than the other, similar F84 model)
- The hsearch command causes PAUP* to conduct 100 heuristic searches (each beginning from a different, random starting tree); each search will start with a stepwise addition tree using random addition of taxa, and this starting tree will be rearranged using the tree bisection/reconnection branch swapping method
- The describe command produces a depiction of the tree (rooted at the specified outgroup) on the output (and in the log file, since we opened a log file earlier); the tree will be shown as a phylogram, which means branch lengths will appear proportional to the average number of nucleotide substitutions per site that were inferred for that branch.
- The savetrees command saves the best tree found during the search (this is quite important and easy to forget to do!). The brlens keyword tells PAUP to save branch length information along with the tree topology.
- The log command stops the logging of output to the file myoutput.txt
- The quit command causes PAUP* to quit running; if you left out this command, PAUP* would remain running at this point, allowing you to issue other commands
Note that because PAUP* ignores blocks whose names it does not recognize, you can easily "comment out" a paup block by simply adding a character to its name. For example, adding an underscore
is enough to cause PAUP* to completely ignore this paup block. This is handy because it allows you to create multiple paup blocks for different purposes and turn them off and on whenever you need them.
You can also "comment out" a portion of a paup block using the leave command. For example, in this paup block, PAUP* will be set up for doing a likelihood analysis but will not actually conduct the search; the leave command causes PAUP* to exit the block early:
log file=myoutput.txt start stop;
lset nst=2 basefreq=empir rates=equal tratio=estim variant=hky;
hsearch swap=tbr addseq=random nreps=100 start=stepwise;
describe 1 / plot=phylogram;
savetrees file=mytrees.tre brlens;
Today's lab exercise
First, a note about characters blocks versus data blocks: the characters block is essentially a new and improved version of the data block. Feel free to use either one, but be aware that programs such as PAUP* may eventually stop using the data block since the characters block accomplishes the same thing and has features missing in the data block. To convert a data block to a characters block, just change the block name and add the keyword newtaxa to the dimensions command just before the keyword ntax. This tells PAUP* that you will be defining the names of your taxa in the characters block itself (rather than in a preceding taxa block).
Questions that should be answered (or excercises that you should do on your own) appear in this style. There is no need to turn in your answers to these exercises. It is up to you to make sure you are comfortable with this material. Please ask questions if anything is unclear. While it is possible to do these exercises outside of the scheduled lab time, working through them in lab is better because we are here to help with questions that arise.
First create a folder with a name that is unique (i.e. base it on your name) in the my documents folder.
Copy the angio35.txt file from the data folder into your own newly-created folder. (If you are not in the computer lab, you can download the file by right-clicking here and using your browser's Save Target As... menu option)
Start PAUP* but be careful to not execute the angio35.txt file (it is not yet in Nexus format). Do open the file in edit mode (using File > Open... and clicking on the Edit radio button before selecting the name of the file to open) and note that it is composed of 35 sequences DNA sequences. These are rbcL gene sequences from various green plants. The important thing to notice is that the format is quite simple: each line consists of a taxon name followed by at least one blank space, which is followed by the sequence for that taxon. Note that the blank space is important: taxon names cannot contain embedded spaces, because spaces are used to separate taxon names from the corresponding sequences.
Now type in the following command:
tonexus from=angio35.txt to=angio35.nex datatype=nucleotide format=text;
After the conversion, the file angio35.nex should be present. Open this Nexus file in edit mode to see what PAUP* did to convert the original file to Nexus format. Do not execute the file just yet because there are some additions we need to make before it is ready for analyzing.
Create an assumptions block containing a default exclusion set that excludes the following sites automatically whenever the data file is executed. This should be added to the bottom of the newly-created Nexus file (i.e., after the data). This may be most easily done using PAUP*'s built-in editor, although you may use any editor you choose (just remember to save the file as plain text).
exset * unused = 1-41 234-241 246 506-511 555 681-689 1393-1399 1797-1855 1856-1884 4754-4811;
These numbers represent nucleotide sites that either are missing a lot of data or are difficult to align. The name I gave to this exclusion set is unused, but you could name it anything you like. The asterisk tells PAUP* that you want this exset applied (i.e. you want these sites excluded) every time the file is executed.
Create a sets block comprising the following three charset commands:
This block should be placed after the assumptions block. Look at the description above of the sets block and try to do this part on your own.
Now let's execute the data file. Use File -> Open... from the main menu to execute your new angio35.nex file. If your assumptions block is correct, the output should include a statement saying that 219 characters have been excluded. If you set up your sets block correctly you should be able to enter this command:\
The first charset should be named 18S and include sites 1 through 1855
The second charset should be named rbcL and include sites 1856 through 3283
The third charset should be named atpB and include sites 3284 through 4811
and get no errors. In addition, PAUP* should tell you that 4592 characters were excluded (as a result of the exclude all command) and 1428 were re-included (as a result of the include rbcL command). For the rest of the exercise, we will be working with the data from all 3 genes, so re-include the 18S and atpB data:
include 18S atpB;
PAUP* should now say that there are a total of 4811 included characters.
The first item of business in starting an analysis in PAUP* is to begin logging the output to a file. The following command will begin saving all output to the file output.txt. Note that we have chosen to automatically replace the file if it already exists. If you are nervous about this (and would rather have PAUP* ask before overwriting an existing file), either leave off the replace keyword or substitute append, which tells PAUP* to simply add new output to the end of the file if it already exists.
log file=output.txt start replace;
Type set ? to get a listing of the general settings. PAUP* has four "settings" commands: set for general settings; pset for settings specifically related to parsimony; lset for settings specifically related to likelihood; and dset for settings specifically related to distance methods.
From the output of the set command, can you determine which optimality criterion PAUP* would use if we were to do a search at this point?
To perform a parsimony search, first try the alltrees command. This command asks PAUP* to calculate the optimality criterion for every possible tree
Did PAUP* allow you to perform an exhaustive search for 35 taxa?
Now try heuristic searching. This approach does not attempt to look at all possible trees, but instead only look at trees that are in the realm of possibility (which can stil be a lot of trees!):
The search progress will be displayed in a dialog box. When the button says Close rather than Stop, take a look at the numbers summarizing this search. What is the parsimony score of the best tree found during the search? (Write down this score somewhere for later reference.) How many trees were examined (look at # Rearrangements tried)?
Now you probably want to take a look at the tree that PAUP* found and is now holding in memory. First, however, choose an outgroup taxon so that the (unrooted) tree will be drawn in a way that looks like it is rooted in a reasonable place, say between the gymnosperms (first 7 taxa) and angiosperms (remaining taxa):
To make the tree appear to flow downward, which is more pleasing to the eye, tell PAUP* that you would like to use the tree order "right" (this is also commonly known as "ladderizing right"):
(Note that for presentations, the audience usually can see the top of the screen better than the bottom so you ladderizing right is wise if you want them to clearly see the base of the tree).
Before doing anything else, we should save this tree in a file so that it will be available later, perhaps for viewing or printing in TreeView. Let's call the treefile pars.tre. The brlens keyword in the command below tells PAUP* that you want to save the branch lengths as well as just the tree topology (almost always a good option to include):
savetrees file=pars.tre brlens;
You may have noticed that PAUP* found 5 most-parsimonious trees. These 5 trees are all indistinguishable using the parsimony criterion. Let's now use the likelihood criterion (which you will learn about in upcoming lectures) to evaluate these 5 trees:
These commands ask PAUP* to simply evaluate the likelihood score of the trees in memory. Note that because we arrived at these trees using parsimony, it is quite possible that none of these trees represents the maximum likelihood tree. That is, we may be able to find better trees under the likelihood criterion if we performed a search using the likelihood criterion. (We will not actually perform a likelihood search, as it would take at least half an hour using the default settings for this data set.) What is the likelihood score of the best tree? (As for parsimony, write this number down for later comparison.) Is the likelihood score the same for all 5 trees? Which tree is best? Important: PAUP* reports the negative of the natural logarithm of the likelihood score. This means that smaller numbers are better, as smaller numbers represent higher likelihoods.
Next, we will obtain a neighbor-joining tree. Neighbor-joining (NJ for short), recall, is one of the clustering methods: that is, it uses an optimality criterion (the minimum evolution criterion) at each step of the algorithm, but in the end produces a tree without actually examining many trees (thus only approximating a minimum evolution search, and not qualifying as an actual optimality criterion method):
Let's see how the NJ tree compares to the tree found by parsimony:
According to the parsimony criterion, is the NJ tree better than any of the trees found by parsimony? According to the likelihood criterion, is the NJ tree better than the best tree you have found thus far? Can you say definitively whether the NJ tree is better (according to the likelihood criterion) than the maximum likelihood tree?
You may have noticed that PAUP* does not let you copy text from the output window. It will, however, make a copy of the text currently displayed in the output window and put this in an editor window. Chose Edit -> Edit Display Buffer from the main menu. You can now cut/copy/paste text from this window to other applications.
That's all for today. The only thing left to do is to close the log file you opened and quit PAUP*: