BADGER

Bayesian Analysis to Describe Genomic Evolution by Rearrangement

Version 1.02 beta, June 11, 2004

Copyright © 2004 by Bret Larget & Don Simon


An example


As an example of how to use BADGER, let us consider the Campanulaceae dataset from Moret, B., L. Wang, T. Warnow, S. Wyman (2001), "New approaches for reconstructing phylogenies from gene order", Bioinformatics, Vol. 17 Suppl. 1 2001: S165-S173. The data consists of arrangements of 105 genes for 13 taxa, including tobacco which serves as the outgroup. The input file for this example in BADGER format is camp.csv. Note that the file uses integers for gene names although arbitrary strings could be used.

We begin by doing a short run. We use the following run control (.rc) file, camp.rc:


data-file=camp.csv
initial-tree-type=good
outgroup=13
cycles=10000
sample-interval=10
window-interval=100
print-all-trees=false

This file uses mostly default values of the run controls, in particular, using the procedure for estimating the hyper-parameters.

We run badger with the command:

badger < camp.rc

This first run of 10,000 cycles took 12 seconds on our computer, a Dell GX270 with a 2.8 GHz processor. Examining the output file, run1.lpd using gnuplot, we see that the chain appears to plateau fairly quickly, after sampling 100 trees. Checking the run1.out file, we see that the tree update methods accept longer and smaller trees with about equal probability, and we also note that this short run found a minimum tree of total length 65.

Since all seems well, we will next do five longer runs of 1,000,000 cycles each. As we have a computer cluster available, we can do the runs simultaneously, although they could be done serially as well (in five times the amount of time, of course.). We prepare a base .rc file which will be incorporated into the five actual .rc files. The base .rc file is camp.base.rc:


data-file=camp.csv
initial-tree-type=good
outgroup=13
cycles=500000
sample-interval=10
window-interval=1000
print-all-trees=false

We use the genrc program to create the five .rc files:

genrc -r camp.rc -f camp.run1 -i camp.base.rc -n 5

This creates the files camp.rc.0, camp.rc.1, ... , camp.rc.4 each with a different seed for the random number generator. These .rc files will create output files using the prefixes camp.run1.0, camp.run1.1, ... On our cluster, we use Condor (Condor © Project (http://www.condorproject.org) for clustering, so we next create the following camp.cmd file:


Executable = /usr/local/bin/badger

Error = camp.err.$(Process)
Input = camp.rc.$(Process)
Output = camp.out.$(Process)
Log = camp.log

Queue 5

Also, the BADGER makefile is changed (before compiling) by replacing the line

CC = g++

with

CC = condor_compile g++

We execute five runs using the command

condor_submit camp.cmd

Without a cluster, we would need to execute the individual commands:

badger < camp.rc.0
badger < camp.rc.1
badger < camp.rc.2
badger < camp.rc.3
badger < camp.rc.4

After the completion of these runs (which took between 8 and 10 minutes each), we again examine the .lpd files using gnuplot. We note that all five files show a plateau, and decide to consider the first 1,000 sample trees as burn-in which will be discarded. Checking the .out files, we see that the tree update algorithms accepted longer trees at about the same rate as shorter trees, and we note that on these longer runs, BADGER found trees of size 64 in all five runs. This indicates that the first short run may have found the most parsimonious tree.

We use the summarize program to summarize the results, discarding the first 1,000 sampled trees of each .top file:

summarize -s 1000 camp.run1.0.top > camp.sum.0
summarize -s 1000 camp.run1.1.top > camp.sum.1
summarize -s 1000 camp.run1.2.top > camp.sum.2
summarize -s 1000 camp.run1.3.top > camp.sum.3
summarize -s 1000 camp.run1.4.top > camp.sum.4

The clade transition tables seem fine in the .sum files. Next, we use the chart program to compare the lists of common clades found in the five runs:

chart camp.sum.[0-4] > camp.chart

The result is:



{1-13}    1.000 1.000 1.000 1.000 1.000 
{1-12}    1.000 1.000 1.000 1.000 1.000 
{1-11}    0.190 0.245 0.261 0.313 0.320 **
{1-9,12}  0.200 0.263 0.284 0.328 0.283 **
{1-9}     1.000 1.000 1.000 1.000 1.000 
{1-7}     0.087 0.038 0.143 0.055 0.025 **
{1-5,7-9} 0.053 0.252 0.080 0.328 0.372 ******
{1-5,8-9} 0.035 0.215 0.116 0.145 0.221 ***
{1-4,8-9} 0.453 0.765 0.525 0.742 0.846 *******
{1-4}     0.126 0.167 0.148 0.168 0.200 *
{1-3,5-9} 0.089 0.040 0.065 0.040 0.025 *
{1-3,5-7} 0.053 0.024 0.055 0.027 0.015 
{1-3,8-9} 0.127 0.165 0.130 0.164 0.171 
{1-3}     0.161 0.183 0.174 0.186 0.196 
{1,4-9}   0.090 0.036 0.064 0.037 0.026 *
{1,4-7}   0.053 0.023 0.055 0.028 0.015 
{1,4,8-9} 0.127 0.169 0.130 0.162 0.172 
{1,4}     0.165 0.185 0.173 0.185 0.191 
{1,5-9}   0.055 0.024 0.040 0.024 0.016 
{1,5-7}   0.092 0.040 0.083 0.044 0.026 *
{1,8-9}   0.163 0.184 0.158 0.183 0.186 
{2-9}     0.094 0.040 0.066 0.041 0.026 *
{2-7}     0.056 0.025 0.055 0.028 0.015 
{2-4,8-9} 0.129 0.172 0.134 0.163 0.175 
{2-4}     0.167 0.189 0.179 0.187 0.198 
{2-3,5-9} 0.057 0.024 0.041 0.025 0.014 
{2-3,5-7} 0.090 0.040 0.079 0.044 0.025 *
{2-3,8-9} 0.162 0.183 0.159 0.180 0.179 
{2-3}     0.994 0.998 0.997 0.994 0.997 
{4-9}     0.055 0.023 0.040 0.023 0.016 
{4-7}     0.087 0.036 0.080 0.042 0.026 *
{4,8-9}   0.167 0.184 0.157 0.178 0.187 
{5-9}     0.092 0.040 0.065 0.040 0.026 *
{5-7}     0.935 0.696 0.817 0.616 0.561 *******
{5,7}     0.178 0.469 0.241 0.477 0.522 ******
{6-7}     0.799 0.368 0.746 0.431 0.322 *********
{8-9}     1.000 1.000 1.000 1.000 1.000 
{10-12}   0.610 0.492 0.455 0.359 0.397 *****
{10-11}   1.000 1.000 1.000 1.000 1.000 


The results are good, but there is still some doubt about how the taxa 5, 6, and 7 should be grouped. We do a much longer set of five runs of 5,000,000 cycles each. We repeat the steps above, changing the number of cycles to be 5,000,000 and the sample interval to be 100 in the camp.base.rc file. The runs take around 1 hour and 40 minutes each. We skip the first 1000 trees when summarizing. The new chart is:



{1-13}    1.000 1.000 1.000 1.000 1.000 
{1-12}    1.000 1.000 1.000 1.000 1.000 
{1-11}    0.256 0.314 0.257 0.250 0.276 *
{1-9,12}  0.263 0.239 0.247 0.252 0.245 
{1-9}     1.000 1.000 1.000 1.000 1.000 
{1-7}     0.029 0.000 0.020 0.017 0.000 
{1-5,7-9} 0.361 0.421 0.421 0.351 0.399 *
{1-5,8-9} 0.172 0.219 0.184 0.168 0.187 *
{1-4,8-9} 0.851 0.986 0.920 0.905 0.942 **
{1-4}     0.184 0.197 0.189 0.192 0.207 
{1-3,5-9} 0.024 0.000 0.011 0.015 0.000 
{1-3,5-7} 0.015 0.000 0.000 0.000 0.000 
{1-3,8-9} 0.176 0.193 0.186 0.183 0.188 
{1-3}     0.190 0.200 0.196 0.197 0.200 
{1,4-9}   0.024 0.000 0.012 0.015 0.000 
{1,4-7}   0.015 0.000 0.000 0.000 0.000 
{1,4,8-9} 0.178 0.198 0.184 0.181 0.185 
{1,4}     0.193 0.198 0.191 0.194 0.199 
{1,5-9}   0.015 0.000 0.000 0.000 0.000 
{1,5-7}   0.025 0.000 0.013 0.016 0.011 
{1,8-9}   0.190 0.200 0.198 0.193 0.192 
{2-9}     0.025 0.000 0.012 0.015 0.000 
{2-7}     0.014 0.000 0.000 0.010 0.000 
{2-4,8-9} 0.181 0.202 0.192 0.188 0.188 
{2-4}     0.193 0.202 0.200 0.199 0.203 
{2-3,5-9} 0.014 0.000 0.000 0.000 0.000 
{2-3,5-7} 0.024 0.000 0.013 0.016 0.000 
{2-3,8-9} 0.185 0.196 0.193 0.190 0.192 
{2-3}     0.996 0.997 0.996 0.996 0.996 
{4-9}     0.014 0.000 0.000 0.000 0.000 
{4-7}     0.025 0.000 0.014 0.015 0.000 
{4,8-9}   0.188 0.199 0.194 0.192 0.190 
{5-9}     0.025 0.000 0.012 0.016 0.000 
{5-7}     0.595 0.540 0.524 0.596 0.552 *
{5,7}     0.598 0.759 0.697 0.670 0.706 ***
{6-7}     0.273 0.060 0.174 0.215 0.155 ****
{8-9}     1.000 1.000 1.000 1.000 1.000 
{10-12}   0.482 0.447 0.496 0.499 0.479 *
{10-11}   1.000 1.000 1.000 1.000 1.000 


There are still some differences between the runs, but the differences are small. This seems to be a reasonable result and we stop here. We also note that the much longer runs found no trees smaller than total size 64.


Back to the table of contents.


This page was most recently updated on June 29, 2004.

badger@badger.duq.edu