doc

Some notes on pattern compression and speeding up BEAST analysis

Wed, 17 Jul 2024 00:00:00 +0000

Pattern compression for fun and profit

by Andrew Rambaut & Philippe Lemey

To calculate the likelihood of a tree given a sequence alignment, the log likelihood of each site (column in the alignment) is evaluated independently and then summed across all sites in the alignment. The likelihood of any sites that have exactly the same nucleotides for each taxon will be the identical (assuming they have been assigned the same substitution model and rate). So, in practice, BEAST ‘compresses’ the alignment into a set of unique patterns of nucleotides (referred to as just ‘patterns’) and the count of how many times they occur. Thus, the likelihood for the alignment is the sum of the likelihood for each pattern times their counts. Obviously, this trick provides considerable speed improvements over just naively calculating the likelihood for each site independently and is used by most (all?) phylogenetic software.

In the latest version (v10.5.0), BEAUti now shows the number of unique site patterns for a loaded alignment as well as the original length of the alignment:

Figure 1 | Pattern count in BEAUti data partition table.

In this example using the file WNV.fasta from the examples/Data folder shows that this alignment of West Nile Virus genomes is compressed from over 11,000 nucleotides to only 727 unique site patterns - a compression of over 15 times with a commensurate improvement in likelihood calculation speed). The less variable the sequences in the alignment are, the greater the level of compression will be. This is because almost all the compression comes from the ‘constant’ sites that have the same nucleotide for every taxon. Other sites with nucleotide changes are unlikely to share exactly the same pattern with another such that they can be compressed.

Here is the top 15 most common site patterns for the WNV.fasta alignment:

count	pattern
2821	`GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG`
2635	`AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA`
2025	`TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT`
1963	`CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC`
60	`NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAANAAAAAAAAAAAAAAA`
56	`NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGNGGGGGGGGGGGGGGG`
49	`NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCNCCCCCCCCCCCCCCC`
35	`NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG`
30	`NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC`
26	`NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA`
24	`NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTNTTTTTTTTTTTTTTT`
21	`NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT`
11	`NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAANNNNAAAAAAAAAAAAAAA`
11	`NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTNNNTTTTTTTTTTTTTTTT`
10	`NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCNNNNCCCCCCCCCCCCCCC`

The full table can be seen here. The first observation (as expected) is that the constant sites are the 4 most common patterns by a large margin (they make up 9,444 of the 11,029 sites in the alignment. The second is that most of the other patterns that are at high frequency are ones that are constant except for ambiguities (Ns) in some of the taxa. This is a feature of next generation sequencing where areas of low read coverage are filled in with Ns. Ambiguities are treated as unknown nucleotides in BEAST and are efficiently dealt with in the likelihood calculations done by BEAGLE.

This second observation offers an opportunity for further compression of the patterns. If we assume that patterns that are constant except for ambiguities are, in fact, constant then we can just include them in the counts for the respective constant patterns. For example, in the table above the 5th pattern comprises only As and Ns so might be considered as a constant A pattern. The 10th and 13th patterns are also constant As but differ in their pattern of Ns so would also be compressed similarly. Doing this for the WNV.fasta alignment goes from 727 unique site patterns down to 604 ‘ambiguous constant’ site patterns – a reduction of 17%. When run in BEAST with some standard models/settings the time drops from 11.2 minutes to 9.9 minutes for a reduction in run-time of 12%.

For other data sets, a much larger saving can be achieved. Here are a few examples:

Virus	Dataset	Sequences	Sites	Unique patterns	Compressed ambiguous	Factor
SARS-CoV-2	B.1.1.7	976	29409	2079	918	2.3x
SARS-CoV-2	omicron_BA1	1000	29409	1528	485	3.2x
Ebolavirus	Makona_1610	1610	18992	7926	2267	3.5x
Mpox virus	cladei_reservoir	60	196858	3701	386	9.6x

In general, we would expect an increase in likelihood evaluation speed to improve by similar factors (with some allowance for overheads).

In BEAST X v10.5.0 a new command line option allows this feature to be switched on:

-pattern_compression ambiguous_constant

It is important to note that this feature is an approximation to the full likelihood calculation. In testing it appears to be a good approximation with only small differences in the estimated values of the parameters. A second option, -ambiguous_threshold, specifies the maximum proportion of nucleotides in the pattern that can be ambiguous and still be considered to be constant. This can help reduce the effect of the approximation whilst still allowing considerable savings in run-time. For example, compressing all constant sites irrespective of the proportion of ambiguous nucleotides produces a drop in the log likelihood of the tree (Figure 2, all ambiguous constant). Reducing the -ambiguous_threshold to 0.25 (which is the default value if not specified) returns a tree likelihood similar to the normal compression approach (Figure 2, unique patterns).

Figure 2 | Log likelihood of the tree for different compression thresholds. 'Uncompressed' is no compression, 'unique patterns' has an ambiguity threshold of zero, `all ambiguous constant` has a threshold of 1.0.

Compression	Threshold	Pattern count	Run time
`-pattern_compression`	`-ambiguous_threshold`
`off`	n/a	11,029
`unique`	0.0	727	11.2 minutes
`ambiguous_constant`	0.25	659	8.1 minutes
`ambiguous_constant`	0.5	608	7.8 minutes
`ambiguous_constant`	1.0	604	7.3 minutes

BEAST X (v10.5.0-beta5) released

Tue, 09 Jul 2024 00:00:00 +0000

We are pleased to announce the release of BEAST X (v10.5.0-beta5)

BEAST X is the new name for BEAST v1 project and the first release version of this is v10.5.0 which supersedes v1.10.4 in the old version system. From now on we will use the full major, minor, bugfix style of semantic versioning. Thus, this version is not BEAST 10 but BEAST X v10.5 (the 10th major, 5th minor release of the original BEAST project).

Download BEAST X binaries for Mac, Windows and UNIX/Linux

Note: This is a beta version which is suitable general use but which may still have issues and bugs.
Please report any to the BEAST GitHub issue list

Developing for BEAST

Tue, 08 Aug 2023 00:00:00 +0000

A less-than-brief, non-comprehensive, introduction to BEAST development

BEAST is scientific software for statistical analyses. This means there are many things you will eventually need to know. And this cannot be stressed enough, you don’t need to know all this right now!

The purpose of this document is to provide the reader a foothold into the many pieces that one needs to know to work with BEAST at a developer level. It is not a comprehensive introduction to many of those topics, which require treatises in their own right. But hopefully it will help the reader figure out what to google, or where to look in the code base, to resolve errors.

Don’t panic

No one who has contributed to BEAST started out being good at it or knowing how to do all of it. BEAST is established software, it’s been used for phylogenetic analyses since the early 2000s. You’re standing on many shoulders, which is at times really great (and at times really frustrating). You’re also working on a massive code base. Don’t expect to know how everything works, though it’s good for your character to at least have a general idea of how phylogenetic models work.

IntelliJ has your back

Don’t write BEAST java code without IntelliJ. Just don’t. It will make your life much, much easier, and is worth every minute that it took to get acquainted with.

Setting up IntelliJ to work with BEAST may take a bit of time (all worth it!). What follows are some basic tips to help with that.

You will need to tell IntelliJ how to build BEAST with ant in order to incorporate your changes into a working version of BEAST that you can run.
You can run BEAST through IntelliJ passively or in debug mode. Either way, you need to set up a Run/Debug configuration.
- In “Add New Configuration” select “Application”.
- The main class (as requested by the space in the configuration) is dr.app.beast.BeastMain.
- Things that would go in the command line after you call BEAST, like the path to the XML, whether to overwrite, and such, go in the Program arguments space.
- You will want to make sure BEAGLE is accessible. One way this can be done is by choosing Add VM Options and in that space specifying -Djava.library.path=/usr/local/lib (or wherever you’ve installed BEAGLE to, if you don’t have root access).
For debugging purposes you will probably want to actually get into the code while it’s bring run with breakpoints. This should work if you’ve appropriately set up a Run/Debug configuration as above. Then you can put breaks in any line where you want to halt execution and run in debug mode. Note that execution stops before the line with the break (so you can put one on a return statement line), and that the line has to do something (int someInt; won’t work, int someInt = 0; will).

Make sure you work on the right branch

Working with BEAST, like with any large software project, means working with version control. For BEAST, that means git via GitHub. BEAST lives here, owned by the beast-dev group.

An introduction to git is out of the scope of this overview. But do be sure to get access priveleges to BEAST (become part of beast-dev), and be sure to work on the right branch. Branches are how big projects can work on multiple things simultaneously, and keep more stable versions of a code base side by side with actively-developed versions.

As a general rule, do not work on the master branch. This is for a more stable, less prone to breaking version of BEAST. Touch this after you know what you’re doing.
You may want to make a new branch to work on a particular feature. Choose the branch to make this new branch from carefully. If you want access to other recent work, it’s best to make it from the branch where that is happening. If you want it to merge into master easily and soon, make it from master.
Much ongoing work on HMC sampling is being done on the hmc-clock branch.

XML, java, and object-oriented programming

If you’re going to actively develop BEAST, you will eventually need to work with two different languages.

The core of BEAST is written in java, which is an object-oriented programming (OOP) language. Much of the computational burden of BEAST is outsourced to BEAGLE, which is written in C++ (another OOP language). Neither of these may be quite as cool as they once were, but that’s not to say learning either of them is a waste. Computer language skills are like real language skills, when you learn more, you start to have an easier time picking new ones up. Which is to say that learning java can help you learn other (possibly more marketable) languages, like python (which is also an OOP language).

BEAST analyses are specified via XML files. The typical person using BEAST will generate their XML entirely in BEAUti, much as described in the tutorials on the BEAST website. Then BEAST will instantiate java objects from that XML and use it to run an analysis. For such users, the XML file can be more or less an afterthought. But BEAST is an iceberg. Much of its functionality lives below the surface accessible via BEAUti, and certainly all the new work you do will begin this way as well. That means you should get comfortable manually editing XML files. A good learning experience to get started on that is to generate XMLs using BEAUti and then look into the XMLs to see how your data and settings are encoded.

One thing that both java and XMLs have is that they are all about objects. Object-oriented programming takes some getting used to, but this means that some lessons from XML-writing port to writing java code, and vice-versa.

Writing XMLs

Let’s get this out of the way first: XML is not really supposed to be human-readable or human-edited. But, it’s what we’ve got.

There are some tutorials that explain how to create custom models in XMLs, like this one and this one and many of the Advanced Tutorials. Tutorials like this can be helpful for getting a general sense of what’s going on and how to work with XML, or at least build a bit of intuition that will come in handy when you start running into issues. But there are a few things worth considering specifically, which are largely linked to the object-centered nature of an XML.

Note that there’s a reference of the sorts of things you can put in an XML block here.

General tips

The following are some general-purpose tips that will serve you well from the moment you see your first XML all the way through your ten thousandth.

Use a good text editor. Writing XMLs without a good text editor is far more painful than it needs to be. You’ll be greatly assisted by things like a search/replace tool with good regex (regular expression) functionality, syntax highlighting, auto-closing of XML tags, and more. You can use IntelliJ for this, or many other editors (for example the perpetually-popular BBEdit and VScode). Don’t use things like TextEdit, or Notepad+. Just don’t.
Don’t start from scratch. Very few BEAST XML files get made from scratch. Most start from pre-existing raw material, and many of those started as a BEAUti-generated XML. BEAUti may not have everything you want in an analysis, but it can give you a functional starting point, and it is always good to start from a file that works.
Recycle (steal) XML blocks. If you don’t know how to format an XML block for something, find an XML file that has that block. Copy it over, and modify it until it works for you. This is significantly faster and easier than trying to figure out how to format it based purely on reading through the parser’s code.

Key attributes of an XML object

All XML objects open and close with start and end tags, hence the error you may well encounter at some point: The element type "someThing" must be terminated by the matching end-tag "</someThing>". That often looks like this,

<someThing>
    code goes here
</someThing> But for some things that can be declared in one-liners, that looks like,

<someThing thing="someValue" option="whatever"/> where the last `/>` is the end-tag.

Many things can go inside these XML blocks, exactly what depends entirely on the class. If you want to know what a class should have, you can always check the rules in its parser (an easy way to search for that in IntelliJ is to search for "someThing", quotes and all, which should take you to the parser that defines its name on that basis).

There are two things to be careful about that have burned many a person and cost many hours of time.

Parsers only throw errors about things that you don’t specify.

You could throw just about anything into a treeDataLikelihood object and BEAST won’t care. A totally unrelated likelihood? Sure. An MCMC block? Okay. It will simply, and quietly, ignore anything it doesn’t know how to use. So, always check your spelling for arguments and make sure that the parser actually does something with whatever it is you’re trying to feed in.

An unparsed object throws no (straightforward) errors

This is largely a corollary to the above point. XML objects get parsed by parsers. Parsers get loaded by being included in the appropriate files (development_parsers.properties being the one for development work, more stable things go elsewhere). If a parser is not loaded, it is not called on the XML object it would parse. That object is then not parsed. No error is thrown saying “hey I don’t know what to do about this XML block.” Instead, if you’re lucky, you’ll get warnings caused by the non-existence the object you thought you had created.

So when making new parsers for new classes (as discussed more below), be very careful to get that parser into a development_parsers.properties.

Key attributes of a BEAST XML

A BEAST XML is a definition of a statistical model. We are defining the parameters of the model and how they go together. We must define the prior and the likelihood and the joint (posterior) target distribution. We must also specify how parameters are to be sampled in MCMC (via the <operators> block). There are varying consequences for failing to specify all these things which are potentially disastrous if un-caught. By way of example:

A parameter can go unsampled if an operator on it is never specified.
The wrong joint posterior distribution may be targeted if a parameter’s prior is not included in the <prior> block or its effect on the likelihood is not accounted for in the <likelihood> block.

The resulting posterior will be incorrect and possibly improper. Those are bad, so be careful!

The classic (somewhat weirdly-ordered) workflow of a BEAST XML is more or less:

Define a parameter at its first use.
Put an operator on it in the <operators> block.
In the <MCMC> block, put a prior on it in the <priors> (sub)block.
Put it in some log file in a <log> or <logTree> (sub)block (of the <MCMC> block).

There will be bugs

XML editing is bound to result in errors. Just like for any other coding job, you will find yourself spending time fixing these bugs. (As well as time putting them in unintentionally, of course.) BEAST does try to help you out with this. BEAST will write error messages and those can refer to lines or specific XML objects where problems were encountered. Heed those warnings, and start your debugging search there.

When you’re writing your own code for BEAST, you can also gain information from the error messages. BEAST can tell you what line in what java class ended up provoking a fatal error, and the call stack that led there. This is very helpful. And even when writing your own classes, XML bugs remain a leading source of issues when things go wrong. Never take your XML for granted.

Designing and writing BEAST code

Because of the java-XML dichotomy, to get something made in BEAST that is usable, you will generally need to write both a class (to do what you want) and a parser.

Parsers

A parser in BEAST is a function that parses a specific kind of XML object, like a <treeDataLikelihood> or a <newick>. When designing a parser, important questions to consider include:

What, if any, fixed options will a user need to specify or set?
More importantly, what other parts of a BEAST model will it need access to?

The Marc Suchard School of BEAST Development says that you should always start by thinking about what you want your parser to look like. There are some very good arguments in favor:

You won’t get very far writing anything if your hypothetical user (who is you in a few hours or weeks) can’t specify a way to access the code you’re going to write.
This is one less thing to debug later, helping separate out the many layers where issues can otherwise occur.
If you have a working parser, you can start using an example XML and get inside the code with IntelliJ to trace and squash bugs earlier.

On the other hand, knowing what you need to pass in requires knowing a lot about what you can pass in and how things work. You may not know what is accessible for your needs. Until you’re sure what works best for you, it may be a good idea to not let either the parser or the actual class at hand get too far ahead of the other. Minimally, don’t assume you’re done with developing your parser just because it can take in all the objects you think it needs. The same applies the other way: don’t assume your java class is done just because IntelliJ isn’t throwing any more errors. You may go through a few rounds of developing both of these before you know exactly what you need in them.

Classes

To actually do something new in BEAST, you’ll probably be writing a java class. Eventually you might need Interfaces, which are entirely abstract classes that can’t be used, but define a lot about how a class works. A java class defines an object that will be made at some point when BEAST is called on the XML. That object can be just about anything and do just about anything, BEAST has classes for parameters, trees, parameter transformations, multiple sequence alignments, the classical phylogenetic likelihood, HMC operators, and much, much more.

Classes are so general that the advice here is going to have to be pretty general. The basic structure of a class is that it gets defined, then you define a constructor that sets up a new object, and then you write any code you need so the object can do what it needs to do. IntelliJ is your friend because it is good at knowing what sorts of functions your class will need.

The extends and implements keywords are ways to relate your class to other classes. A class can extend another existing regular class, while it can only implement an abstract class. Both of these are useful tools for outsourcing a lot of work to stuff others have written. In cases like this, you may start to see super get tossed around, which is a way to tell java to let this class’s superclass (parent class) handle something. Often in BEAST you’ll see this for constructors.

Classes can (and will usually) have member objects. Member objects can be important parts of a class, like the many things that go into a tree model. Or they can be convenient ways to offload actually doing work into something else. This is why many classes have some form of likelihood object as a member, it’s much easier (and saves a lot of duplicated code) to say “hey likelihood, what are you now?” than to write out the likelihood again.

Case study: HKY

As a brief case study, let’s take a very fast look at HKY.java (specifically the version on the hmc-clock branch). This extends the class BaseSubstitutionModel while it implements several classes, Citable, ParameterReplaceableSubstitutionModel, and DifferentiableSubstitutionModel. Thankfully, most of these do what they’re named. One thing we might ask ourselves is whether we really need to know what all of these are. The answer in many cases will be no, which is quite convenient.

The class has only one member variable, kappa. This seems suspicious, where are the stationary frequencies? Let’s check the constructor. “Which constructor?” you might ask, as there are two. That’s easy enough to see here, though. The first takes in kappa as a double (a fixed value), makes a parameter out of it, and calls the other one. So the second one is the “real” constructor, and it takes in a kappa parameter and a frequency model. The real constructor then calls the super class’s constructor. If we glance at BaseSubstitutionModel, we see that it holds onto the stationary frequencies (the frequencyModel). Mystery solved!

Most of the rest of the class does 4 things:

Handles the basics of HKY. Computes things about rate matrices, makes sure things get updated when kappa changes, and allows tracking the transition-transversion rate ratio if you really want to.
Tells BEAST how to cite the model, living up to its promise that it implements Citable.
A function called public HKY factory(List<Parameter> oldParameters, List<Parameter> newParameters) which, if you poke into it, will turn out to be living up to its promise that it implements ParameterReplaceableSubstitutionModel. That is a class that can be used to have branch-specific model parameters.
A lot of functions involved in gradient computation, which only matter when you need to know about that.

There is also a public static void main(String[] args) function that serves as a test.

And this class also highlights the existence of enum, which is sort of a low-level class that you can declare and use inside a class to handle different cases for something without writing tons of if/else statements.

Beware the model graph

BEAST assembles the components of the statistical model from the XML into a graph. This allows various parts of the model to know when something has changed, so they know if something should be recomputed. For example, if you change the stationary frequencies of a GTR model, you expect that BEAST will know that:

The prior density has changed.
A rate matrix has changed, which means the phylogenetic likelihood needs to be recomputed, which means the likelihood has changed.

There are a number of components of BEAST classes that have to do with this. These exist in things that are Models or Parameters and classes that implement things like ModelListener and VariableListener. Through the ~~magic~~dark voodoo of good software design, you will often be able to do things without going anywhere near any of this. But sometimes you’ll get a dreaded error, like a likelihood not being the same after a move is rejected, and you’ll have to pay attention. The following is a non-exhaustive list of functions to pay attention to. They may be implemented wrong, unimplemented, or unimportant:

fireModelChanged
handleModelChangedEvent
fireVariableChanged
handleVariableChangedEvent
storeState
restoreState
acceptState

If you’re reading this because you need it, good luck and godspeed.

Write tests!

When you write something new, you should think really strongly about writing a test to make sure it works. Maybe a few tests. It doesn’t matter how braindead simple you think your code is. Tests are awesome, and tests are your friend.

At its most basic level, a test executes code and checks the value against a hard-coded (prespecified) value. Then it either passes and says everything worked, or complains if the values are different (or too different, sometimes you need to allow for some variation). But tests can serve a number of purposes when you’re developing new code.

They are debugging tools, to help you find and squash issues. You can write aspirational tests to make sure the code will handle everything you want it to.
A test file can serve as a repository of edge-cases, since edge cases often provoke heretofore unseen issues. When something new and weird happens, add it to the pile.
Tests prevent future muck-ups. If someone (possibly you in a few weeks or years) alters code and that in turn changes how your code executes, your test will complain about that. This will alert the author of the changes that those changes are having unexpected consequences elsewhere, so they need to make some tweaks.
Tests keep you safe from the worry of messing up existing functionality. (This is just the last point seen from the perspective of the person making changes.) If you’re going to make changes to code, and it isn’t being tested, add tests to check that you don’t break any existing functionality.

In BEAST, we have two kinds of tests, XML tests and java tests.

Tests in XML

XML tests live in beast-mcmc/ci/, mostly in beast-mcmc/ci/TestXML/ (some that need to load information live in beast-mcmc/ci/TestXMLwithLoadState/). These tests are all, as you might expect, written as XMLs. You set up XML objects, then you check that they produce the expected result. This can test many things, from simple distributions, to complicated likelihoods, to MCMC (the latter can be done without loading with judicious usage of fireParameterChanged XML blocks). These are the easiest place to make tests, and they have the additional benefit of making sure things get parsed acceptably too.

Many these tests hinge on the <report> XML block, which calls the getReport() method of a class that implements Reportable. Reports are a great way for you to see what a class you’re writing is doing, and can do and print basically anything to screen that you might want. When you want to use a report to check against a prespecified value, you can use regex to parse the report and extract what you want. This will all make more sense when you look at examples and try to start making it work.

You can run any of these tests by calling BEAST on the XML.

Note that “ci” stands for Continuous Integration because these tests are run as part of GitHub’s continuous integration. The upshot is that whenever anything is pushed to the BEAST repository on GitHub, the tests are run (this is specified here). This means that if you break something you’ll find out pretty quickly. That’s good! As bad as an “oops you broke BEAST” message feels, it’s much better to know now than after that has caused analyses to be wrong.

Tests in java

Sometimes, you can’t quite fit what you want or need to test into an XML. This can happen when you’re working with code that lives deeper in the code base. But even some things that you use in XMLs are hard to test that way (a transform block doesn’t know what variables go in it, making it hard to test directly in XML). In these cases, you may need the more powerful and flexible internal unit testing done entirely in java. These tests live in beast-mcmc/src/test.dr/ and work by creating java objects directly, then checking that functions operating on those objects produce the expected results.

You can run these unit tests directly in IntelliJ, which will tell you which (if any) of the tests in a unit test file pass and fail.

These tests are executed when BEAST is built by ant (as noted in beast-mcmc/build.xml). So if your changes to some code break something else that has a unit test, you’ll know as soon as you try to compile the code to run it.

Note that the file structure of where these tests live mimics beast-mcmc/src/dr.

A brief overview of some guiding principles

There are many tips for good software design, here are a few.

Don’t reinvent the wheel!

BEAST is old enough that many things you might want to do have been done. Before going out and writing something from scratch, see if it exists first!

Recycle, don’t rewrite

Many times you will find what you want to do is a small modification of something else. You could add your code directly to the existing class. But if you instead make a new class that extends the pre-existing class, you may have an easier time. And you certainly won’t have to worry about messing up the existing functionality of that class.

KISS

KISS is a great band but it also stands for keep it simple, stupid. Don’t make things more complex than they need to be.

Keep an eye on generality

Sometimes it’s easier not to solve just the problem at hand but a general class of problems. Sometimes you know you’ve got an extension coming down the line. Sometimes you just want to future-proof your work. For any and all of these reasons, it’s good to do things generally when possible (but done and working code is better than hypothetical code).

Test early and test often

Tests are a developer’s best friend. Write them. Use them. Love them.

Ebola Virus Local Clock Analysis

Thu, 16 May 2019 00:00:00 +0000

In 2014 there was an outbreak of Ebola virus disease in the Democratic Republic of Congo. When the first genome sequences were published (Maganga et al. 2014) it was noticed that the amount of divergence from the earliest EBOV genomes from the 1970s was considerably less than for the West African epidemic genomes which were from the same year. This suggested that the DRC lineage had exhibited a substantially lower rate of evolution (Lam et al. 2015). Lam et al. speculated that this may be due to it being in a different host species with different evolutionary forces at work.

However, during the West African outbreak, a number of examples of long-term latency were observed where someone who had recovered from EVD months later transmitted the virus to another individual – usually a sexual partner (Blackley et al. 2016). In the most extreme example, there was a 15 month interval between the acute infection and the onward transmission (Diallo et al. 2016). It was noticed that these cases were often associated with a short branch length suggesting a reduced rate of evolution or a form of latency with reduced replication for much of the period. This suggests that a similar process could be at work for EBOV in the non-human animal hosts over longer timescales.

Sequences from the last 3 DRC outbreaks (in 2017, summer 2018 and the currently ongoing one in the North East of the country) also exhibit this apparently reduced branch length. See this post for a tree produced by the INRB and USAMRIID that shows this effect and also Mbala-Kingebeni et al. 2019b.

To explore EBOV rate varation in non-human hosts, we assembled a data set of genomes that spans the known history of the virus. Most EBOV genomes have been sampled from human cases so we have included one genome per outbreak, preferring those with precise dates of sampling. A list of sequences used is given in Table 1 along with their Genbank accession numbers and, where available, a reference to the published work describing them.

accession	country	name	date	outbreak	reference
KR063671	DRC	Yambuku-Mayinga	1976-10-01	Yambuku/1976
KC242791	DRC	Bonduni	1977-06	Bonduni/1977	Carroll et al. 2013
KC242792	GAB	Gabon	1994-12-27	Minkebe/1994	Carroll et al. 2013
KU182905	DRC	Kikwit-9510621	1995-05-04	Kikwit/1995
KC242793	GAB	1Eko	1996-02	Mayibout/1996	Carroll et al. 2013
KC242798	GAB	1Ikot	1996-10-27	Booue/1996	Carroll et al. 2013
KC242800	GAB	Ilembe	2002-02-23	Mekambo/2001	Carroll et al. 2013
KF113529	COG	Kelle_2	2003-10	Mbomo/2003	Chiu et al. 2013
HQ613403	DRC	M-M	2007-08-31	Luebo/2007	Grard et al. 2011
HQ613402	DRC	034-KS	2008-12-31	Luebo/2008	Grard et al. 2011
KJ660347	GIN	Makona-Gueckedou-C07	2014-03-20	West_Africa/2013	Baize et al. 2014
KP271018	DRC	Lomela-Lokolia16	2014-08-20	Boende-Lokolia/2014	Naccache et al. 2014
MH613311	DRC	Muembe.1	2017-05-07	Likati/2017	Nsio et al. 2018
MH733477	DRC	Tumba-BIK009	2018-05-10	Bikoro-Mbandaka/2018	Mbala-Kingebeni et al. 2019a
MK007330	DRC	Ituri-18FHV090	2018-07-28	Kivu/2018	Mbala-Kingebeni et al. 2019b

Table 1 A list of the genomes used in this post and their references.

Building a maximum likelihood tree of these genomes shows the apparent slow down in the recent lineages (Figure 1; yellow dots). A root-to-tip regression (the line is fitted only to the green dots) shows how far below the expected line these are (this is similar to Figure 5 in Mbala-Kingebeni et al. 2019b).

Figure 1. A tree and root-to-tip plot for the 15 Ebola virus genomes in Table 1. This is an interactive figure: click the points to include/exclude them from the regression. The yellow tips are not included in the regression. Click on a branch of the tree to re-root the tree at that position. The source code for this figure is available here.

Methods

To characterise this effect, we have used relaxed-clock models in BEAST to allow different rates of evolution for different parts of the tree.

Figure 2. A maximum likelihood tree of the 15 EBOV genomes with the ‘slow’ clades highlighted.

Two lineages are identified as having lower than expected divergence by Mbala-Kingebeni et al. 2019b (see Figure 2):

One comprising the outbreak in 2017 in Likati and the on-going 2018-2019 outbreak in North Kivu Province – represented by the genomes Muembe.1 and Ituri-18FHV090.
The other comprising the outbreak in 2014 in Lokolia and the 2018 outbreak in Équateur Province – represented by the genomes Lomela-Lokolia16 and Tumba-BIK009.

Firstly we used the Local Clock model which allows us to specify which parts of the tree have different rates (although this doesn’t specify which bits are fast and which are slow). This allows us to assign a different rate of evolution to the two lineages described above (including the ‘stem’ branch leading to each clade).

This model was used in Mbala-Kingebeni et al. (2019b) where these two lineages are labelled clade a (“EBOV/Tum” & “EBOV/Lom”) and clade c (“EBOV/Muy” & “EBOV/Itu”), respectively (Figure S8 of the Supplementary information). This paper shows that both these clades have a lower rate of evolution over all (Figure S8B).

As a comparison we also ran the analysis with a strict molecular clock (which assumes a single rate over the whole tree) and a log-normal uncorrelated relaxed clock (which allows each branch to have a different rate, independently drawn from a log-normal distribution). We also ran a strict molecular clock but excluding the recent DRC outbreak genomes.

For all of these analyses we constrained the tree topology so all of the viruses after the 1970s were monophyletic to maintain a consistent rooting. This was the rooting suggested by a much earlier an analysis (Dudas and Rambaut 2014).

Analysis was done by partitioning the genomes into 1st, 2nd & 3rd codon positions for the concatenated protein coding regions and a 4th partition comprising the concatenated intergenic regions. Each partition was given an HKY model with gamma distributed site-specific rates and parameters for each were unlinked.

XML files for all the analyses are available here..

Results

For the local clock model (Figure 3), you can see the two lineages that have been allowed a different rate and both have a slower rate than the rest of the tree (i.e. the branches are coloured by rate with blue meaning lower than average). This and all the subsequent trees are drawn on the same timescale to allow a comparison of the depth of the trees.

Figure 3. A local clock tree where two lineages, identified a priori, are allowed to evolve at a different rate. The branches are coloured by rate with blue meaning lower than average, red higher. Green bars represent 95% credible intervals for the date of the node.

Secondly, for the relaxed clock tree (Figure 4), you can see that there is variation in rate across the tree (the clades of interest do, however, have the slowest rates). Note however that the age of the root of the tree is further back in time and the HPD bar spans nearly 4 decades. Essentially the lognormal distribution is struggling to adequately describe the variation in rates given the extreme outliers seen in Figure 3.

Figure 4. The uncorrelated lognormal (UCLN) relaxed clock model. The rate for each branch is inferred independently with no a priori structure imposed.

Finally, as a comparison, here is the strict molecular clock tree with the same rate over the whole tree (Figure 5). Once again, the root of the tree is much older than the local clock model and the relative branch lengths are very different.

Figure 5. The strict molecular clock model with a single rate describing the whole tree.

Refining the model

Looking at the relaxed clock tree in Figure 4, we notice that for the two clades of interest, the tip branches seem to have a higher rate than the stem branches (they are less blue and they are shorter than in the local clock model). This suggests another possibility — that it is not the whole clade that has a lower rate of evolution but just the branch leading to the common ancestor of the pair. This makes more sense if this is being produced by a process of latency (i.e., a switch between active replication and no replication). This would mean that, parsimoniously, there were just these two branches where the virus was latent for some period of time. We would assume that an internal node in the tree represents active replication and epidemiological spread and thus the virus being in the non-latent state. Thus it is unlikely that a whole clade and stem exhibits latency (unless the propensity to latency increased on the stem lineage).

Figure 6. The two stem branches given a different rate of evolution in the refined local clock model (the rest of the tree is assumed to have the same rate including the tip branches of the two clades identified in Figure 2.

To examine this we can set up a new local clock model which just has the internal stem branch given the different rate of evolution with the tip branches in this clade having the same rate as the rest of the tree (Figure 6).

Figure 7. Stem branch only local clock model. Only the stem branches above the two clades of interest are allowed different rates of evolution.

In comparison with the clade-specific local clock (Figure 3), the most recent common ancestors of the Muembe.1/18FHV090 pair and Lokolia/Bikoro pair are much more recent. Other than that, the trees are very similar. The rates of evolution on the two stem lineages are even slower (more blue). We compare the actual values of these rates in Figure 9.

Looking at the average rate of evolution over the whole tree (Figure 8) shows the slow-down in the in the two lineages affects the strict clock to a much greater degree than the relaxed clocks.

Figure 8. Box-and-whisker plot of the mean rates of evolution across all four models.

But if we look at the local clock models and compare the rates for the Likati/North Kivu and Lokolia/Équateur clades and the respective stem branches (Figure 9), we see the slow rates (the stem-only model gives an even slower rate for this one branch - supporting the idea that this is the branch that experienced some ‘latency’). Interestingly the rates for the two stem branches are even lower than the clades and very similar (whereas the rates for the clades are different because they include a mixture of fast and slow branches for different amounts of time).

Figure 9. Box-and-whisker plot of the estimated rates for the two local clock model variants. The rate labelled ‘Tree’ is the rate for the rest of the tree (excluding the local clocks), then the rates for the Likati/North Kivu and Lokolia/Équateur lineages when the whole clade is included and then only the stem lineages.

Finally we ran the strict clock model on a data set where we omitted the four most recent DRC genomes that are involved in the apperent slow down in rates (the last 4 sequences in Table 1). We compared this rate with the two local clock models for the rate of evolution estimated for the parts of the tree that are not included in the local clocks (the red branches in Figures 3 and 7).

Figure 10. Box-and-whisker plot of the estimated rate for the tree (excluding the local clock rates for these models) in comparison to the strict clock rate and the rate for a strict clock on a data set that excludes the 4 recent DRC genomes (i.e., excluding the lineages that are exhibiting slow downs).

Model selection

BEAST implements a number of related approaches for comparing competing models (see here for some detailed instructions on applying these). These compute a marginal likelihood estimate - essentially a goodness-of-fit which takes into account the complexity of the models. The ratio of these provides a Bayes factor - a measure of the relative ‘plausibility’ of the models given the data.

model	log MLE	log Bayes factor
strict clock	-33661.96	—
clade local clock	-33566.74	95.22 (vs. strict clock)
stem local clock	-33563.86	2.88 (vs. clade local clock)
UCLN relaxed clock	-33539.62	24.24 (vs. stem local clock)

Table 2 Marginal likelihood estimates (MLE) and Bayes factors for the models discussed here. The 3rd column gives the difference between the log MLE for the model and the model above which is the log Bayes factor for the two.

The uncorrelated lognormal relaxed clock (UCLN) is the best fitting model (by a good margin) as it clearly accommodates some of the ‘slow-down’ in the two stem branches but also other variation in rate across the tree (Figure 4). However, the time and rate estimates are very variable.

The good fit of the UCLN model suggests there is random variation in rate across the tree as well as the specific ‘latency’ slow downs. So we constructed a model that is a mix of the stem local clock and the UCLN — this essentially states that the two stem branches have their own rate and the rates for the rest of the tree are drawn from the lognormal relaxed clock (Figure 11).

Figure 11. MCC tree constructed for the mix of the stem local clock model and the UCLN relaxed clock.

Over all this tree is quite similar to the straight UCLN one (Figure 4) but with much tighter credible (HPD) intervals on the node ages. This suggests overall better model fit (less of a struggle to fit competing patterns of rate variation). Indeed the MLE estimate for this model is -33536.59 giving a log Bayes factor of 3.03 (more than 20-fold) over the UCLN model. The rates are comparable (Figure 12) but as expected the addition of the relaxed clock gives more variation in these.

Figure 12. The rates for the tree and the two stem branches under the stem local clock model (left) and the stem local clock + relaxed clock model (right).

Final points

Although we forced the rooting of the tree to be the same for each model, it is likely that the strict clock model and the relaxed clock model would give a different rooting (and possibly rates) if the constraint was removed.

Finally, we are developing an explicit model of latency which will act as a molecular clock model, infer the branches that have evidence of latency and estimate parameters of the process. More on this soon.

References

Baize, S. et al., 2014. Emergence of Zaire Ebola Virus Disease in Guinea. The New England journal of medicine, 371(15), pp.1418–1425.

Carroll, S.A. et al., 2013. Molecular Evolution of Viruses of the Family Filoviridae Based on 97 Whole-Genome Sequences. Journal of virology, 87(5), pp.2608–2616.

Diallo, B. et al., 2016. Resurgence of Ebola Virus Disease in Guinea Linked to a Survivor With Virus Persistence in Seminal Fluid for More Than 500 Days. Clinical infectious diseases: an official publication of the Infectious Diseases Society of America, 63(10), pp.1353–1356. Grard, G. et al., 2011. Emergence of divergent Zaire ebola virus strains in Democratic Republic of the Congo in 2007 and 2008. The Journal of infectious diseases, 204 Suppl 3, pp.S776–84.

Lam, T.T.-Y. et al., 2015. Puzzling origins of the Ebola outbreak in the Democratic Republic of the Congo, 2014. Journal of virology, pp.JVI.01226–15. https://doi.org/10.1128/JVI.01226-15

Maganga, G.D. et al., 2014. Ebola Virus Disease in the Democratic Republic of Congo. The New England journal of medicine, 371(22), pp.2083–2091.

Mbala-Kingebeni, P., Pratt, C.B., et al., 2019. 2018 Ebola virus disease outbreak in Équateur Province, Democratic Republic of the Congo: a retrospective genomic characterisation. The Lancet infectious diseases. http://dx.doi.org/10.1016/S1473-3099(19)30124-0.

Mbala-Kingebeni, P., Aziza, A., et al., 2019. Medical countermeasures during the 2018 Ebola virus disease outbreak in the North Kivu and Ituri Provinces of the Democratic Republic of the Congo: a rapid genomic assessment. The Lancet infectious diseases. http://dx.doi.org/10.1016/S1473-3099(19)30118-5.

Nsio, J. et al., 2019. 2017 Outbreak of Ebola Virus Disease in Northern Democratic Republic of Congo. The Journal of infectious diseases. https://http://doi.org/10.1093/infdis/jiz107

Measuring BEAST performance

Sat, 17 Nov 2018 00:00:00 +0000

When running BEAST it reports the time taken to calculate a certain number of states (e.g., minutes/million states). It is obviously tempting to compare this time between runs as a measure of performance. However, unless you are testing the performance of the same XML file on different hardware or for different parallelization options, this will never be a reliable measure and may lead you astray.

The MCMC algorithm in BEAST picks operators (transition kernels or ‘moves’) from the list of potential operators proportional to their given weight. Some operators change a single parameter value, some change multiple parameters and others will alter the tree. BEAST tries to only recalculate the likelihood of the new state for the bits of the state that have changed. Thus some operators will only produce a modest amount of recomputation (e.g., changing a bit of the tree may only require the likelihood at a few nodes to be recalculated) whereas others will require a lot of computation (e.g., changing the evolutionary rate will require the recalculation of absolutely everything). Thus if the computationally heavy operators are given more weight then the average time per operation over the course of the chain will go up. But this is not necessarily a bad thing.

Note: This posting is primarily about improving the statistical performance of BEAST irrespective of the hardware being used. For a discussion of improving the computational performance on various types of hardware, see this page.

Efficient sampling and ESSs

The ultimate aim of an MCMC analysis is to get the maximum amount effectively independent samples from the posterior as possible (as measured by effective sample size, ESS). Ideally, we would aim to get the same ESS for all parameters in the model but we are often less interested in some parameters than others and we could allow a lower ESS for those. A high ESS is more important the more we are interested in the tails of the distribution of a parameter. So some parameters, in particular those that are part of the substitution model such as the transition-transversion ratio kappa are down weighted. Changing these is computationally expensive, requiring a complete recalculation of the likelihood for the partition, but we are rarely interested in the value. We simply want to marginalize our other parameters over their distributions. So we can accept a lower ESS for kappa as the cost of focusing on other parameters.

Note: In most cases, substitution model parameters easily achieve high ESS values, which is why they are typically updated less often than for example clock and coalescent model parameters.

To demonstrate this we can look at an example BEAST run. This is a data set of 62 carnivore mitochondrial genome coding sequences giving a total of about 5000 site patterns. The model was an HKY+gamma, strict molecular clock, constant size coalescent (the XML file is available here). The data was run on BEAST v1.10.4 on an Dell server for 10M steps for a total run time of 4.08 hours.

Data: Carnivores mtDNA 62 taxa, 10869bp, 5565 unique site patterns

Model: HKY+G, Strict clock, Constant size coalescent

Machine: Dell Precision 3.10GHz Intel Xeon CPU E5-2687

XML file: carnivores.HKYG.SC.CPC.classic.xml

You can see the effect of different operators by looking at the operator table reported at the end of the run:

Table 1

Operator                                          Tuning  Count      Time     Time/Op  Pr(accept) 
scale(kappa)                                      0.913   92847      548576   5.91     0.2322      
frequencies                                       0.01    92528      547586   5.92     0.2355      
scale(alpha)                                      0.939   92835      548560   5.91     0.2317      
scale(nodeHeights(treeModel))                     0.927   277983     1651622  5.94     0.2329      
subtreeSlide(treeModel)                           0.013   2778288    2415749  0.87     0.2315      
Narrow Exchange(treeModel)                                2778497    1936405  0.7      0.0091      
Wide Exchange(treeModel)                                  277044     206821   0.75     0.0002      
wilsonBalding(treeModel)                                  277574     355560   1.28     0.0002      
scale(treeModel.rootHeight)                       0.262   277733     72405    0.26     0.2391      
uniform(nodeHeights(treeModel))                           2777648    2650767  0.95     0.1207      
scale(constant.popSize)                           0.474   277023     2697     0.01     0.2375      

The operators on the substitution model (kappa, frequencies and alpha) are amongst the most computationally expensive taking on average 5.9 milliseconds per operation. Although they are not selected very often relative to the others (only about 3% of the time) their total contribution to the runtime is over 15% of the total. On the other hand the 7 operators that alter the tree generally have low cost (about 0.4 milliseconds per operation) and making up 85% of the total runtime because they are picked 95% of the time. The population size parameter is very cheap so comprises a tiny fraction of runtime even though it is called quite a lot.

By default the operators on each of the sustitution model parameters have a weight of 1, the sum of all tree operators has a weight of 102 and the population size operator has a weight of 3 (see the operators panel in BEAUti for the weights for each operator).

Mixing

A further complication is that different choices of operators, priors etc., can effect the efficiency of mixing of the MCMC (how fast it converges and explores the parameter space). This is reflected in a higher ESS perhaps even at the cost of more computation per step - what matters is that the gain in ESS is proportionally higher than the computational cost.

If we load the resulting log file into Tracer we can calculate the ESS for these parameters (with a 10% burnin removed):

Table 2

Parameter mean value ESS

kappa 27.18 1503

frequencies1 0.390 1954

frequencies2 0.305 2590

frequencies3 0.082 2927

frequencies4 0.223 2286

alpha 0.235 4355

constant.popSize 1.997 8617

treeModel.rootHeight 0.506 1790

treeLength 8.146 1620

treeLikelihood -1.93E5 3944

Parameter	mean value	ESS
kappa	27.18	1503
frequencies1	0.390	1954
frequencies2	0.305	2590
frequencies3	0.082	2927
frequencies4	0.223	2286
alpha	0.235	4355
constant.popSize	1.997	8617
treeModel.rootHeight	0.506	1790
treeLength	8.146	1620
treeLikelihood	-1.93E5	3944

You can see that all of the ESSs are quite high. The two parameters that relate to the tree, treeModel.rootHeight and treeLength (the sum of all the branch lengths - not technically a parameter but a metric) show ESSs of >1000. These values are not necessarily indicative of how well the tree is mixing overall so we cal also look at the ‘ESS’ for the likelihood (the likelihood of the data given the tree). This is a probability density not a parameter but looking at how (un)correlated the values are will be another indication of how well the tree has been mixing.

The ESSs for the substitution model parameters are high suggesting that we could afford to down-weight their operators to reduce their contribution to the total runtime.

Optimising efficiency

To measure the overall efficiency of BEAST – i.e., the number of independent samples being generated per unit time (or per kWh of electricity) – it is probably best to consider ESS/hour for the parameters of interest.

Table 3

Parameter ESS ESS/hour

kappa 1503 368

constant.popSize 8617 2109

treeModel.rootHeight 1790 438

treeLength 1620 396

treeLikelihood 3944 965

Parameter	ESS	ESS/hour
kappa	1503	368
constant.popSize	8617	2109
treeModel.rootHeight	1790	438
treeLength	1620	396
treeLikelihood	3944	965

Focusing on kappa as a representitive of the substitution model, rootHeight, treeLength and treeLikelihood to represent the tree and constant.popSize the coalescent model, wr can calculate the ESS/hour for the above run.

If we reduce the weight of the kappa, alpha and frequencies operators by a factor of 10 (this can be done in BEAUti’s operator table or by editing the XML), the total runtime goes down to 3.73 hours – about a 10% saving.

Which is nice.

The ESSs for kappa (and the other down-weighted operators) predictably goes down but is still reasonable:

Table 4

Parameter ESS ESS/hour

kappa 515 138

constant.popSize 9001 2414

treeModel.rootHeight 1090 292

treeLength 922 247

treeLikelihood 2793 749

Parameter	ESS	ESS/hour
kappa	515	138
constant.popSize	9001	2414
treeModel.rootHeight	1090	292
treeLength	922	247
treeLikelihood	2793	749

Note that the ESSs for treeModel.rootHeight, treeLength and treeLikelihood have also gone down (but not by as greater degree as kappa) and constant.popSize has actually gone up in ESS (to the maximum where every sample is independent). So by down-weighting the substitution model operators we have reduced the ESS/hour across the board (with the exception of the coalescent prior). It is still possible that the tree topology is mixing better but we aren’t measuring that directly.

We could look at reducing the weight of the constant.popSize operator by a factor of 3 (returning the substitution model operators back to their original weights). The total run time goes up to 4.16 hours because we are doing fewer cheap moves and more expensive ones – but the ESS/hour for all the other parameters goes up:

Table 5

Parameter ESS ESS/hour

kappa 1708 410

constant.popSize 6992 1679

treeModel.rootHeight 1751 421

treeLength 1812 435

treeLikelihood 4219 1013

Parameter	ESS	ESS/hour
kappa	1708	410
constant.popSize	6992	1679
treeModel.rootHeight	1751	421
treeLength	1812	435
treeLikelihood	4219	1013

Operator acceptance rates

One other thing to note here is the Pr(accept) column in the operator analysis, Table 1, above. This records how often a proposed operation is actually accepted according to the Metropolis-Hastings algorithm. A rule of thumb is that a move should be accepted about 23% of the time to be optimally efficient (this is an analytical result for certain continuous moves but we assume it also approximately applies for tree moves). Operators are generally ‘tuned’ to achieve this ratio by adjusting the size of the move (how big a change is made to the parameter – big moves will be accepted less often than small ones). Some moves (Narrow Exchange, Wide Exchange and WilsonBalding) are not tunable and you can see they have a very small acceptance probability. This means they are acting inefficiently at exploring the tree-space but consume considerable computational time. On the other hand they may be important for convergence initially where large moves are favoured.

We can try reweighting these operators down by a factor of 10 and see the effect.

Firstly the total runtime is 4.33 hours – more than 6% slower than our original run. However, if we look at the ESS and ESS/hour values:

Table 6

Parameter ESS ESS/hour

kappa 2455 567

constant.popSize 7138 1650

treeModel.rootHeight 2586 598

treeLength 2719 628

treeLikelihood 4873 1126

Parameter	ESS	ESS/hour
kappa	2455	567
constant.popSize	7138	1650
treeModel.rootHeight	2586	598
treeLength	2719	628
treeLikelihood	4873	1126

We are generally doing much better than before with the ESS/hour up over the previous runs (the only looser is constant.popSize but it is still higher than all the others).

Concluding remarks

Don’t use time/sample as a comparative measure of performance for different data or sampling regimes.
A better measure of BEAST performance than the average time per million steps would be the average time per effectively independent sample (i.e., ESS/hour). In the example above, the treeLength measure goes from 396 independent values per hour to 628, nearly doubling.
Choosing operator weights to achieve better performance (as ESS/hour) is a difficult balancing act and may need multiple runs and examination of operator analyses and ESSs. It may usually be better to be conservative about these and worry about getting statistically correct results more than saving a few hours of runtime.
Because of the stochastic nature of the algorithm BEAST can be variable from run to run both in the total total runtime (because of variability in the operators picked and their computational cost) and the ESS of parameters. The run time will also depend on what else the computer is doing at the same time (these results were done on a many core machine with nothing else of significance running).
The optimal weights for operators will also vary considerably by data set meaning it is difficult to come up with reliable rules.
We are currently working on improving the operators and weights to achieve a reliable increase in statistical performance. More on this soon …

Note: The operator weights that BEAUti generates by default are intended to be robust (we want to try to ensure convergence) and may not be optimal in all circumstances. Adjustment of these might achieve significant improvements in ESS/hour but caution should be exercised and the results examined closely to ensure that that convergence has been achieved. As always we strongly recommend that at least 2 replicate runs are performed and the results compared.

BEAST v1.10.4 released

Wed, 14 Nov 2018 00:00:00 +0000

We are pleased to announce the release of BEAST v1.10.4

BEAST v1.10.4 fixes a bug when trying to specify a burnin on the command line version of LogCombiner. It also introduces two new command line options specific to BEAGLE v3.1 (-beagle_threading_off and -beagle_thread_count).

Download BEAST v1.10.4 binaries for Mac, Windows and UNIX/Linux

BEAST v1.10.3 released

Sun, 28 Oct 2018 00:00:00 +0000

We are pleased to announce the release of BEAST v1.10.3

BEAST v1.10.3 fixes an important bug where performance was degraded when using BEAGLE 3 on CPUs (compared with using BEAGLE 2 or using GPUs).

Download BEAST v1.10.3 binaries for Mac, Windows and UNIX/Linux

BEAST v1.10.0 released

Sun, 10 Jun 2018 00:00:00 +0000

We are pleased to announce the release of BEAST v1.10

BEAST v1.10.0 is a major new version with many new features which focus on flexibility of model specification, integration of different data sources, and increasing the speed and efficiency of sampling.

The new version coincides with the publication of a paper describing many of the new features:

Suchard MA, Lemey P, Baele G, Ayres DL, Drummond AJ & Rambaut A (2018) Bayesian phylogenetic and phylodynamic data integration using BEAST 1.10 Virus Evolution 4, vey016. DOI:10.1093/ve/vey016

Download BEAST v1.10.0 binaries for Mac, Windows and UNIX/Linux

BEAST v1.8.4 released

Fri, 17 Jun 2016 00:00:00 +0000

BEAST v1.8.4 has been released:

Download BEAST v1.8.4 binaries for Mac, Windows and UNIX/Linux

Version 1.8.4 released 17th June 2016
New Features:

    New structured list of citations printed to screen before running.
    Option ('-citation_file') to write citation list to file.
    Option in BEAUti Priors panel to set parameters to 'Fixed Value'

Bug Fixes:

    Issue 808: Set autoOptimize to false in the randomWalkOperator on 
               Pagel's lambda
    Issue 806: SRD06 in BEAUTi selecting incorrect options.
    Issue 799: Relative rate parameters for partitions were not being 
               created. All partitions within a clock model have a 
               relative rate if their substitution models are unlinked.
    Issue 798: Calculating pairwise distances was slow for big data sets -
               removed this (but initial values no longer suggested based
               on data).
    Issue 797: Removed 'meanRate' from Priors tab in BEAUti.
    Issue 794: Running with empty command line causes error.
    Issue 792: Check to see that the same likelihood isn't included multiple
               times into the density.