Bayesian Phylogeographic Clustering


When given a section of the DNA sequences of some individuals (of the same species) along with their geographical location, one of the questions asked by biologists is how to split our data into clusters in terms of their geographical distribution, so that the results are consistent with the genetic history. Assuming that mutations occur as a Markov Process and considering appropriate models for the distribution of the demography, the question reduces to a clustering problem subject to the constraints imposed by the genetic history.

We use a migration model defined on the coalescent tree and project it onto the haplotype tree in order to define phylogeographic clusters. Combining it with an underlying model for the geographical distribution of populations and an appropriate evolutionary model, inferences on the joint posterior distribution are possible, avoiding stepwise conditioning. Inferences on the parameters are drawn using Markov chain Monte Carlo samplers

I have been working on a number of different datasets and you can have a look at the type of output I obtain for a specific one. The data is from Beetles on the island of La Palma (in the Canary islands), and their geographical location along with DNA sequence was recorded. Below is a cladogram, i.e. a network in which nodes represent sequences and 2 nodes being connected means that these 2 sequences are 1 mutation apart both in terms of their sequences but also their history. The distinction here is made because sometimes 2 sequences might be 1 mutation apart, however the mutational steps may have taken a different route to reach that result.

Coloured cladogram

The different colours represent different clusters, which correspond to the geographical clusters shown below in the contour plot of the geographical distribution of each cluster. The size of each node if proportional to the number of observed individuals with that sequence, and black nodes represent unobserved sequences. In the contour plot, the large dot on the right represents the most likely ancestral location, indicating that the island was colonised from the east, which agrees with historical evidence.

Contour plot of the populations

Software

Please refer to my software page for an R package implementation of our method.

Work in Progress

Papers and presentations