Lab 26 - PAUP* IV - Confidence

Zool 575 Introduction to Biosystematics, (Sikes) Winter 2006

This lab, and many of the subsequent labs, will not be graded. I will be available during lab to answer questions. You can complete this lab at your leisure but will need to use the computers in BI 182 in order to use PAUP*.

This is a straightfoward all-PAUP* lab so it should be trouble-free and simple to complete. You will learn the commands to create and view consensus trees and conduct a bootstrappiing search. You will also be expected to start using the PAUP* command reference manual (Cmd_ref_v2.pdf) to start learning how to explore the capilities of PAUP*.

1. Open PAUP* and execute the primate-mtDNA.nex file

The primate-mtDNA.nex file is in the PAUP* folder inside the SAMPLE NEXUS FILES folder. First open the file in edite mode (when you get the open file dialog box look for a little button to click to open in edit mode rather than execute directly) and look at the data - ALWAYS look at the datafiles to see what you are working with... when satisfied, execute the file.

You can use your own dataset instead if you prefer but this lab is written to work best with the primate datafile.

The primate dataset is too large to run quickly so you will exclude most of the data - do this by typing the command:

exclude 200-898;

This tells PAUP* to ignore the characters 200-898, leaving only 199 characters to be analyzed.

2. Use PAUP* to find all Most Parsimonious Trees & display consensus trees of them

Type

hsearch

and PAUP* will do a quick parsimony search on your dataset, when done note how many trees are found and then do another search - But go to the command reference manual ((Cmd_ref_v2.pdf) inside the paupwin32 folder in the program files folder) and find the description of the command hsearch. Determine how you can ask PAUP to use Random for the order of the taxon addition sequence starting tree and then type the commands to get PAUP to do a more rigorous hsearch by asking it to use Random addition sequence for 100 replicates.

Did the more rigorous search find the same or a different number of trees?

View the first tree by typing:

describetrees 1/plot=phylogram

What is the Consistency Index and Retention index for this tree?

Note that there are multiple most parsimonious trees. From the last lab you may recall that there was only one Maximum Likelihood tree. You may also wonder if ML isn't being too decisive, as NJ is, in reporting only 1 tree when Parsimony says there are a number of equally good trees. Why is parsimony bound to find more equally good trees than ML (recall from lecture the difference in the type of number that is used as the optimality criterion between these methods)?

To see a strict consensus tree of these trees type

contree

to see a 50% majority rule consensus type:

contree all/majrule;

to see an 80% majority rule consensus type:

contree all/majrule percent=80;

Now go to the command reference manual ((Cmd_ref_v2.pdf) inside the paupwin32 folder in the program files folder) and find the description of the command contree. Look for how you would ask PAUP* to display a majority rule consensus tree with compatible subgroups - this will be an option under the contree command. [NOTE: you can get the options for any command in PAUP* by typing the command name, e.g. contree, and then a ? - but this help feature does not provide the written explanation of the options that is in the PDF manual]. Once you have determined the command add it to the end of the former command but change to percent=50 and ask PAUP* to display a majority rule tree with compatible groups. What grouping is shown as being present in less than 50% of the trees?

To understand what PAUP* is showing you try this, type

showtree all

and PAUP* will display all 7 trees. Look at each one and count how many have (Homo + Pan) as sisters, how many have (Homo + Gorilla) as sisters, and (Gorilla + Pan) as sisters. Now divide each number by 7 - one of them, the largest one, should be equal to the value PAUP* displayed on the majority rule consensus tree with compatible subgroups.

3. Use PAUP* to bootstrap the dataset

To conduct a bootstrap with the default settings type

bootstrap

But note what the default settings are - these are important! What type of branch swapping is used? How are the starting trees obtained? How many bootstrap reps are performed?

Now convince yourself that increasing the bootstrap replicates will increase the precision of the branch support values - do this by conducting the following analyses but note the values each time. The first few results should show variation among the values for each branch but as the number of reps increases the values will stop changing:

bootstrap nreps = 200

bootstrap nreps = 500

bootstrap nreps = 1000

bootstrap nreps = 10000

bootstrap nreps = 20000

Did the values stabilize? This is how one should proceed when bootstrapping (if there is sufficient computer power and time). Often people do a simple 100 rep search and stop because that is what everyone else does - but all datasets are unique and some will not provide a precise support value with only 100 reps. It is best to do successively greater numbers of replicates until the values stabilize. That said, you will note that the values are all very close - it would only matter that they were as precise as possible for branches near whatever "cutoff" you deem significant (often minimally 70% but recent studies have suggested that 85% is closer to our goal of a 95% probability of being correct).

It would be nice to compare the MP bootstrap tree to a likelihood bootstrap tree - but even if we fix the parameters and limit the dataset to only 199 characters, doing 100 ML bootstrap replicates can take about an hour. You see why people consider this a major drawback to Maximum Likelihood - imagine if you had a much larger dataset. Doing 100 bootstrap replicates could take months! [See the Appendix at the bottom of the page to see the ML bootstrap tree]

Instead you can see the effects of different signals in the data. Look again at the datafile - note that below the data matrix there is this block:

begin assumptions;
charset coding = 2-457 660-896;
charset noncoding = 1 458-659 897-898;
charset 1stpos = 2-457\3 660-896\3;
charset 2ndpos = 3-457\3 661-896\3;
charset 3rdpos = 4-457\3 662-.\3;

exset coding = noncoding;
exset noncoding = coding;

Study this block in case you might want to set one up for your own data (of course you will need to know what regions are coding and noncoding and the where the 1st codon is).

These charsets (character sets) all you to explore the signal in different parts of your data - you can exclude and include entire charsets and analyze them separately.

To begin try exlcuding all:

exclude all

then include just the 1st positions:

include 1stpos

and perform a standard, default settings, parsimony bootstrap

now exclude all again and then include just the 2nd positions

exclude all
include 2ndpos
bootstrap

Do the bootstrap values change significantly? Recall that the 1st and 2nd codon positions change much more slowly than the 3rd codon positions. Also recall that for distantly related OTUs the 3rd codon position sites could be satuated and of little use. But the converse is also true - that for closely related OTUs the 3rd codon positions may have more information than the 1st or second positions. See if this is the case here:

exclude all
include 3rdpos
bootstrap

Are the boostrap values higher, in general, or lower when you restrict the dataset to only the 3rd positions than when you restrict the dataset to only 1st or 2nd positions; how about when all the data are included?

If at any point you forget what characters are included or excluded you can type

cstatus

and PAUP will tell you how many characters are ready for analysis. Do this and note that PAUP* also lists the number of parsimony informative characters. This is a an important number! Because there are lots of constant characters in molecular data you cannot tell how much real information you have by counting all the sites - you can only tell by counting the variable characters (and some variable characters are uninformative also!)

APPENDIX - ML bootstrap tree using best fitting model (HKY+I+G) - note the time used (50 minutes) also note that the great ape clade has no resolution and itself is only weakly monophyletic. This is probably because there was not enough information in the 199 characters that we left in the data. I tested this idea by including all the characters and letting a ML bootstrap run overnight (2 hours to complete)- see the results below:

199 characters

100 bootstrap replicates completed
Time used = 00:50:09 (CPU time = 00:34:24.6)

Bootstrap 50% majority-rule consensus tree


898 Characters (Compare this to 100 bootstrap reps using all characters and Parsimony)

As suspected, with all 898 sites included the bootstrap support is much higher than with only 199 sites. Note, however, that the model used, HKY+I+G was selected using only the 199 sites, so this model and its parameter values may not be ideal for the full dataset. To find the best fitting model for the entire dataset we would have to run it through ModelTest.

100 bootstrap replicates completed
Time used = 02:05:50 (CPU time = 01:39:59.8)

Bootstrap 50% majority-rule consensus tree