BADGER
Bayesian Analysis to Describe Genomic Evolution by Rearrangement
Version 1.02 beta, June 11, 2004
Copyright © 2004 by Bret Larget & Don Simon
There is no guarantee that the algorithms here will correctly sample
from the desired posterior distributions in a finite run. While
theory says that the long-run frequencies will converge to the desired
posterior probabilities, inferences from insufficiently long runs can
be biased if the chain has not moved sufficiently far from the initial
state and converged to near stationarity. Obtaining consistent
results from several different starting points with different random
seeds is a minimum criterion to have confidence that the numerical
results provided by BADGER are close to their analytically determined
values.
We generally follow something like this procedure in analyzing new data sets.
- Complete one short run using all the tree proposal algorithms that
are appropriate for the number of taxa in the data, running long
enough to reach approximate stationarity.
- Graph the .lpd file.
(The free software Gnuplot or packages
such as R or
MATLAB® are good choices.)
- If the graph does not finish in a plateau, the run was too short.
- Note the approximate level of the final plateau.
- Look at the .out file.
- Verify that the number of accepted tree counts for larger and smaller trees
are approximately equal for all update methods used.
- Run several (at least five) runs from random starts for several
thousand cycles. The genrc program is useful
for generating run control files for the different runs. These
runs will help determine how much clock time per cycle the runs take.
- Plot all .lpd files to make sure the plateaus are in about the same place.
Determine a common number of cycles to discard for all runs
so that each run is well into its plateau.
- Compare the minimum total number of inversions found for each run. These are
stored as the last line of the
.min
files. Ideally,
they should be the same for all runs.
- Run
summarize
on the
different .top files (discarding many initial sample points)
to check if the same tree topologies and clades are being selected at roughly the same
posterior probabilities.
- Examine the .sum files after running
summarize
to see if named clades are reasonably defined. Use chart
to create a comparison
chart of the common clades across the .top files.
Resummarize if necessary. These short runs are more for timing
purposes than for gathering results. If the results are
inconsistent, between
the different runs (as indicated by differences in the frequencies
of common clades in the chart), which is likely, longer runs may be necessary.
- Look at transition tables between subtree topologies to see if mixing is similar
in different runs and adequate.
- If at this point, if you find that the runs are converging to very different places,
you may wish to start at a non-random tree found by another method.
It may be that the algorithms in BADGER are inadequate for your data.
- If all seems in order, it is time to do several (at least five) longer runs
to save for inference. These runs may be on the order of millions
to hundreds of millions of cycles depending on the size of the data, the time to
do the runs, and computational resource limits.
The sample rate should be as small as possible, based on storage limitations.
The ultimate size of the files you produce should be the limiting factor.
The files may be compressed later or discarded after the essential summarization has occurred.
The runs should be long enough to achieve the accuracy you desire.
You can judge time requirements from the previous exploratory runs.
Back to the table of contents.
This page was most recently updated on June 29, 2004.
badger@badger.duq.edu