Lab 23 - Model Selection - Bayes Factors

Zool 575 Introduction to Biosystematics, (Sikes) Winter 2006

This lab, and many of the subsequent labs, will not be graded. You can complete this lab at your leisure but will need to use the computers in BI 182 in order to use MrBayes (it is a free download and you can install it on any Unix, PC, or Mac computer on which you have permission to do so. The program is available here: http://mrbayes.csit.fsu.edu/index.php).

Introduction to MrBayes & Bayes Factors

Because we will not cover the theory behind Bayesian analysis until later in the course you will only learn the very basics of analysis today so that you can start work on your projects.

An overview of the procedure is as follows (but see further below for the details on how to do this):

1. Download the latest version of MrBayes (v3.1.2) in your "my documents" folder. Make two folders into which you will store all the output of two different analyses using MrBayes. Name the first "M418 no G" name the second "M418 w G"

2. You will put a datafile in each of the folders you made - each with instructions to use a different model (Mkv vs. Mkv+G).

3. You will run each analysis, get the results (a number of files will be made during each analysis - put all of these into the correct folder) and determine which of the two models fits the data better. In this case the difference is simply "Is a Gamma correction for Among Character Rate Heterogeneity needed for these data?

How to run MrBayes on the PCs in the lab

1. Obtain MrBayes and place your datafile

The datafile needs to be in the same folder / directory as the program MrBayes

You can get the datafile here

You can get MrBayes at http://mrbayes.csit.fsu.edu/index.php

Ask your TA if you need help with downloading and installing MrBayes.

Save the datafile to the MY DOCUMENTS folder by right-clicking on the link above and choosing "save as" and create a folder for it with the name M418 no G

This folder with hold the output of the analysis after it is done - you will move the files to this folder for storage. The original datafile should always be in this folder also, however, to do the analysis you will need to copy the datafile out of this folder and into the MrBayes folder.

2. Verify it is downloaded

Leave Internet Explorer and navigate to the MY DOCUMENTS folder.

You should see the folder you named - open this folder and verify that the data file M418.nex is there.

You don't need to do anything here except verify that the file downloaded OK and to learn where it is (this will be useful later).

3. Use PAUP* to open in Edit mode the data file

Open the data file you downloaded in PAUP* in edit mode and study it to see what it contains - you will see a dataset of morphological characters for orders of insects.

Below the data matrix you will see a MrBayes block that should look like this:

BEGIN MRBAYES;

log start filename=M418.log.txt replace;


lset nst=1 coding=variable;



showmodel;
mcmc ngen=100000 samplefreq=100 printfreq=1000 nchains=4 savebrlens=yes;
END;

This MrBayes block contains instructions that the program MrBayes will follow when you execute this file inside MrBayes. Note the comments on these instructions below:

log start

This tells MrBayes to start a logfile that will contain all the output to the screen - thereby saving a record of your analysis.

lset nst=1 coding=variable

This tells MrBayes the process model you will invoke. These instructions specify the morphology model (one substitution type, nst=1) and tells the program that there are no constant characters, which it would expect if these were DNA data (coding=variable).

mcmc ngen=100000 samplefreq=100 printfreq=1000 nchains=4 savebrlens=yes;

This tells MrBayes to start the MCMC chain which will search through tree/parameter space for trees and parameters that are most probable given the data and the assumptions of the model (and the prior probabilities). There are some important parts to this MCMC command:

The ngen command tells MrBayes how long it will run. The current setting of 100000 is FAR too short for a real analysis - I set it this short so it will run quickly for your lab exercise. When you do this analysis on your own you need to increase ngen to 2000000 (2 million steps).

The samplefreq command tells MrBayes how often the MCMC chain should be sampled. This is typically set at 100 but more rigorous analyses use 1000 - the number specifies how many MCMC steps will be taken between samples of trees. Thus, if we sample 1 tree for every 100 steps and the chain runs for 100000 steps then we should have saved 500 trees when done. (Again, this is far too few - much better to run the chain 2 million steps and sample 1 every 1000 trees so that you end up with 2000 trees at the end).

4. Close data file and copy to MrBayes folder

Now that you have looked inside the datafile and have some idea of what is in it, you need to transfer a copy of the datafile to the MrBayes folder. You can do this by using the right mouse button to 'copy' and then 'paste' the datafile into the MrBayes folder (which should be in the folder you downloaded it into).

5. Open MrBayes and execute the datafile

Double-click on the MrBayes program icon. It should launch and wait for further instructions. Type the following:

exe M418.nex

which should execute your datafile - the MrBayes display will show that it is reading the datafile and performing the commands therein. A log file will be created and the MCMC chain will run. You will see log-likelihood scores appear on the screen. Note that these scores improve (get larger) as the run proceeds.

Also note the value for the average standard deviation of the split frequencies. Once this value gets below around 0.5 the chain is probably 'burned in' which means that all samples AFTER that point are valid estimates of the parameters. (see next section below on 'burnin' ).

6. Summarizing your results

When the run has completed MrBayes will ask you if you want to continue - type 'n' for no and then type the following command:

sump burnin=500

This command asks MrBayes to summarize the parameters of the run and to ignore the first 500 samples (the "burnin"). Since you ran only 100,000 steps and sampled once every 100 steps you therefore have 1000 samples. (typically you would run at least 2 million steps and save once every 1000 samples for at least 2000 samples).

The results that are displayed after you enter this command will include a small table that looks like this:

Estimated marginal likelihoods for runs sampled in files
"M418.NX.run1.p" and "M418.NX.run2.p":
(Use the harmonic mean for Bayes factor comparisons of models)

Run Arithmetic mean Harmonic mean
--------------------------------------
1 -1277.31 -1301.67
2 -1277.25 -1303.39
--------------------------------------
TOTAL -1277.28 -1302.86
--------------------------------------

The 'total' harmonic mean of the marginal likelihood is what you'll use to calculate the Bayes Factor. Write this value down and then you're done with the first analysis. Note: your value might be different from the one in the table above- write your own value down!

To see the tree use the command

sumt burnin=500

but we're not interested in the tree right now, all we want to do is see which model is better for this analysis: Mkv or Mkv+G. Normally you would use both of these commands at the end of a MrBayes analysis - sump and sumt. Then you're done. QUIT MRBAYES.

7. Move the output files

Move the output files (these should all start with the text 'M418') into the folder you made with the name M418 no G. The datafile should be in there as well. This method of organizing your results, including the data and the log file, is vital so you have documentation of what you did in case you ever need to go back to replicate or examine a prior analysis.

8. Repeat the above steps for the other model

To do this you must:

1) make a copy of the datafile M418.nex so that you can edit it. Change the name of the copy to M418G.nex

2) edit the copy of the datafile using the PAUP text editor as in step 3:

3) change the MrBayes block so that it will implement the Mkv+G model as follows:

lset nst=1 rates=gamma coding=variable;

4) change the log filename so it will create a log that matches the datafile name

log start filename=M418G.log.txt replace;

5) save the datafile with the changes then move the datafile to the MrBayes folder, launch MrBayes and execute your new datafile with following command:

exe M418G.nex

6) Follow the instructions above for how to summarize your results when the run is done, write down the harmonic mean of the marignal likelihood and move all the output files to their correct folder - M418 w G

9. Bayes Factors

Now you should have two values - one for each model.

Compare them by subtracting their absolute values (to get a positive answer) and double this value. This is twice the difference of the log-likelihoods (harmonic means of the marginal log-likelihoods, actually).

If the value is 0 to 2 the models are essentially the same and the more complex model is not needed (in this case the G for rates=gamma is not necessary).

If the value is 2 to 6 the model with the better score is slightly better than the other.

If the value is 6 to 10 the model with the better score is better than the other.

If the value is >10 the model with the better score is much better than the other.

NOTE: this lab was instructional only - recall that your MCMC runs were not long enough. The shortness of your runs could have resulted in poor estimates of the likelihoods and incorrect answers from the Bayes Factor comparison. To use Bayes Factors you must do longer runs!

Also NOTE: If these models are very similar (Bayes Factor 3 or less) you can double-check the value of Gamma by looking in the sump output for the Mkv+G run and looking at the value of the parameter 'alpha'.

Recall that this value determines the shape of the Gamma distribution. A small alpha (<1) is associated with greater Among Site Rate Variation and suggests that using a model with Gamma is wise. A large alpha (>5) is associated with less AMSRV and suggests that Gamma is not needed. So if the Bayes Factor is small then the alpha should be large. Double-check this.

When done you should be able to show your TA

1. The 2 harmonic means of the marginal likelihoods for each model

2. The Bayes Factor comparing these models

3. The alpha value of the Gamma distribution

4. Which model fits the data best?

Important Note for analyzing your own morphological data with MrBayes:

MrBayes uses slightly different syntax than PAUP so a morphological data file must be slightly edited to execute properly. It is simple to do:

Look for a text string above the data matrix that looks like this:

SYMBOLS = " 0 1 2 3 4 "

Change this by putting the entire string into square brackets and add the following command before it so it looks instead like this:

datatype=standard [SYMBOLS = " 0 1 2 3 4 "]

The 'datatype=standard' command tells MrBayes that the data are morphological data. If you want to execute the file in PAUP you will have to reverse your editing by removing the square brackets from the Symbols command and placing them around the datatype command. A pain, yes, I know!