Install and Load Required Packages

You only need to install the packages once. Skip these lines if they’re already installed.

l1ou package: You will have to download l1ou from github. To download from github do 1. install the R package “devtools”; 2. run library(devtools); 3. install_github(“khabbazian/l1ou”). More tips on how to install l1ou (especially for Windows users), please check: https://github.com/khabbazian/l1ou

#install.packages("devtools")
#library(devtools)
#install_github("khabbazian/l1ou")

The same steps apply to installing “bayou”, which can also be downloaded from GitHub.

#install.packages("devtools")
#library(devtools)
#install_github("uyedaj/bayou")
#install.packages(phytools)
#install.packages(geiger)
#install.packages(ggplot2)
#install.packages(OUwie)
#install.packages(l1ou)
#install.packages(bayou)
library(phytools)
library(geiger)
library(ggplot2)
library(OUwie)
library(l1ou)
library(bayou)

Tutorial Overview

In this tutorial, we will:

  1. Explore OU methods that use regimes defined a priori
  1. Explore OU methods that do not require regimes defined a priori

1. OU methods using pre-defined regimes

To run OUwie, we need to define hypotheses about the ecological or environmental regimes under which we expect trait evolution to follow different trajectories. To do this, we’ll use stochastic character mapping. As my friend Gustavo Burin puts it: “It’s impossible to reconstruct the true evolutionary history of a trait across a clade unless you have a time machine”. That being the case, what we can do is reconstruct multiple plausible histories based on an assumed model, and evaluate the probabilities associated with each of them.

We’ll use Stochastic Character Mapping (introduced briefly in the previous tutorial) implemented in the R package phytools. As before, we’ll run this method using the phylogeny and habitat data from the vipers dataset. Make sure your R working directory is set to the folder containing the files.

#Load and take a look at the phylogeny
vipers.tree <- read.tree("vipers_tree.tre")
plot(vipers.tree, cex = 0.1)
#Load body size and habitat data for the viper dataset:
bs.hab.vipers <- read.csv2("vipers_habitat_bodysize_mod.csv", row.names = 1)
head(bs.hab.vipers)

Now, let’s perform Stochastic Character Reconstructions:

set.seed(123)

hab.vipers.vec <- setNames(bs.hab.vipers$habitat, bs.hab.vipers$species)

simmap.vipers <- make.simmap(vipers.tree, hab.vipers.vec, model = "ARD", Q = "empirical", nsim = 10) #This will take ~2min to run.

Remember: Why this setup? - model = “ARD”: “All Rates Different”; allows all transitions between habitat states to be freely estimated. - Q = “empirical”: uses the transition matrix inferred from observed species data to guide the mapping.

Let’s take a look at the 10 simmaps generated. We will use the same colors as in Alencar et al. 2017

colors <- c("#ffa500", "#a52a2a", "#0000ff", "#008b00")

par(mfrow = c(4, 3))

for(i in 1:10){
  plotSimmap(simmap.vipers[[i]], colors = setNames(colors, 1:4), type = "fan", fsize = 0.01, lwd = 2, hold = FALSE)
}

Each color represents a different habitat category. Unlike what I showed in class, here we have four habitat categories instead of two (we performed analyses using both schemes in the paper):

Orange (#ffa500): terrestrial forested and open microhabitats (1)
Brown (#a52a2a): terrestrial open microhabitats (2)
Blue (#0000ff): terrestrial forested microhabitats (3)
Green (#008b00): arboreal microhabitats (4)

Analyzing each map separately can be useful for seeing pairwise differences, but it’s hard to mentally combine all maps. So let’s create an object that summarizes the information contained in each of the maps, which will allow us to visualize the probabilities of each state at each node of the phylogeny.

simmap.summ <- summary(simmap.vipers)
simmap.summ #contains useful summary information. Take a look!

par(mfrow = c(1, 1))

plotSimmap(simmap.vipers[[1]], colors = setNames(colors, 1:4), type = "fan", fsize = 0.01, lwd = 3, hold = FALSE)
nodelabels(node = 1:simmap.vipers[[1]]$Nnode + Ntip(simmap.vipers[[1]]), pie = simmap.summ$ace, cex = 0.6, piecol = colors)
legend("bottomright", legend = c("Open + Closed", "Open", "Closed", "Arboreal"), pch = 19, col = colors, text.col = "black")

Q: Based on this plot, what is the most likely state for the ancestor of all viperids?

In addition to evaluating the possible evolutionary histories of habitat in vipers, we can use OUwie to test whether body size evolved similarly across habitats. Specifically, we can assess whether body size changes more rapidly in certain habitats, or whether, for example, arboreal environments impose distinct selective pressures; patterns that would be reflected in different models or parameter estimates.

OUwie (Beaulieu et al. 2012)

The OUwie package allows us to fit multiple models of continuous trait evolution (such as body size) across different habitats. These models fall into two main classes: Brownian Motion (BM) and Ornstein-Uhlenbeck (OU). The method was developed within a model selection framework, so to determine which model best fits the data, we need to test several options and compare their AIC values. The models currently implemented are:

  1. BM1: Brownian Motion with a single evolutionary rate (σ2)
  2. BMS: Brownian Motion with different phenotypic evolutionary rates (σ2) for each state
  3. OU1: Ornstein-Uhlenbeck with a single optimum (θ) shared by all species
  4. OUM: OU with different optima (θ) for each state, but the same α and σ2
  5. OUMV: OU with different optima (θ) and σ2 for each state, but the same α
  6. OUMA: OU with different optima (θ) and α for each state, but the same σ2
  7. OUMVA: OU with different optima (θ), σ2, and α for each state

The more complex the models, the longer they can take to fit. So, for now, we’ll focus on simpler ones.

Remember: we generated 10 simmaps in the previous step. Ideally, we would fit all the models of interest across all 10 simmaps. The code below shows how to fit a single model to one simmap. This takes ~5min. Due to time constraints, I’ve already run the full analysis so you can jump straight to examining the results if you want to.

ouwie.res.bm1 <- OUwie(simmap.vipers[[1]], bs.hab.vipers, model = "BM1", simmap.tree = TRUE, diagn = TRUE)
ouwie.res.bms <- OUwie(simmap.vipers[[1]], bs.hab.vipers, model = "BMS", simmap.tree = TRUE, diagn = TRUE)
ouwie.res.ou1 <- OUwie(simmap.vipers[[1]], bs.hab.vipers, model = "OU1", simmap.tree = TRUE, diagn = TRUE)
ouwie.res.oum <- OUwie(simmap.vipers[[1]], bs.hab.vipers, model = "OUM", simmap.tree = TRUE, diagn = TRUE)
ouwie.res.oumv <- OUwie(simmap.vipers[[1]], bs.hab.vipers, model = "OUMV", simmap.tree = TRUE, diagn = TRUE)

Let’s take a look at the results one by one to get a sense of what each model is estimating:

ouwie.res.bm1 #Estimates a single rate (σ2) across all habitats
ouwie.res.bms #Estimates different rates (σ2) for each habitat
ouwie.res.ou1 #Estimates a single optimum (θ), strength of selection (α), and rate (σ2)
ouwie.res.oum #Estimates different optima (θ) for each habitat, but a single α and σ2
ouwie.res.oumv  #Estimates different optima (θ) and rates (σ2) for each habitat, but a single α

Note: I did not include the OUMA and OUMVA models because they take significantly longer to run and often return errors depending on the simmap used. In many cases, the data may not contain enough information for the model to converge on a reasonable solution. But feel free to try them out if you’re curious!

Let’s see which model provides the best fit among those we compared:

aicc.values <- c(
  BM1  = ouwie.res.bm1$AICc,
  BMS  = ouwie.res.bms$AICc,
  OU1  = ouwie.res.ou1$AICc,
  OUM  = ouwie.res.oum$AICc,
  OUMV = ouwie.res.oumv$AICc
)

aicc.values

Let’s now use the results from Alencar et al. (2017). In that study, the best-fitting model was OUMVA, applied across 449 stochastic reconstructions of habitat transitions. Here, we’ll extract only the rate (σ²) estimates for further analysis.

#This is a summary object we created to help organize the OUwie results, which can become overwhelming (especially when working with multiple simmaps and phylogenies). Take a moment to explore the object. The abbreviations FO, O, F, and A correspond to the habitat categories 1, 2, 3, and 4 that we've been working with.

load("ouwie_vipers_filtered.Rdata")
head(vipersresults.filtered)

#Let's extract all the sigma (σ²) values from the best-fitting model (OUMV) across the 449 simmaps.

sigma.table <- data.frame(
  sigma = as.numeric(unlist(lapply(vipersresults.filtered, function(x) x[c("FO_sigma", "O_sigma", "F_sigma", "A_sigma"), 7]))),
  habitat = factor(rep(c("Mixed", "Open", "Forest", "Arboreal"), 449), levels = c("Mixed", "Open", "Forest", "Arboreal"))
) 

#Now let's visualize how phenotypic rates (sigma) vary across habitats.

ggplot(data = sigma.table, mapping = aes(x = habitat, y = sigma)) +
  geom_jitter(aes(colour = habitat)) +
  scale_colour_manual(values = colors) +
  theme(legend.position = "none") +
  ylim(0, 0.006) +
  labs(x = "Habitat", y = "Sigma - OUMVA")
#By setting ylim, we're excluding some outliers from the plot.
#If you're curious, try plotting again without + ylim(0, 0.006) to see the full range of values.

Note: It’s worth mentioning two useful tools in the OUwie package: OUwie.boot and OUwie.sim. These functions help assess model adequacy and explore uncertainty in parameter estimates.

OUwie.sim allows you to simulate traits under a given model and parameters. This is useful for understanding what kind of data you might expect under the “best” model (and compare to your empirical trait).

OUwie.boot performs parametric bootstrapping to assess the reliability of parameter estimates. Internally, it uses OUwie.sim to simulate trait data under a specified model and parameters. It then refits the model to each simulated dataset, generating a distribution of parameter estimates. By comparing these simulated estimates to those obtained from the empirical data, you can evaluate whether your observed parameter values are likely to be recovered under the fitted model.

Together, these tools provide a way to go beyond model comparison and ask whether your best-fitting model actually provides a good explanation for the data. We won’t use them in this tutorial due to time constraints, but they are highly recommended.


2. OU methods that do not require regimes defined a priori

l1ou (Khabbazian et al. 2016)

l1ou provides a framework for identifying shifts in (macro)adaptive regimes without requiring predefined hypotheses, as we did earlier with OUwie. Instead, it uses a data-driven approach to detect changes in the expected trait optima directly from the data. The method fits OU models under the assumption that adaptive peaks can shift across the phylogeny, allowing us to model a changing (macro)adaptive landscape over time and among lineages. It searches for the most likely configuration of shifts and supports model selection using different criteria, such as AIC and BIC. This approach allows us to explore how many shifts best explain observed trait patterns, which lineages share similar optima, and where convergence may have occurred along the tree.

Let’s use l1ou to explore the macroevolutionary adaptive landscape for length in anguilliforme fishes (eels). Remember that l1OU only changes theta (optima) and fix one alpha and sigma for the entire tree. We will explore l1ou (and bayou further down) using a phylogeny from Rabosky et al. 2018.

#Load and plot the phylogeny
anguilliformes.tree <- read.tree("anguilliformes_tree.tre")
plot(anguilliformes.tree, cex = 0.1)
axisPhylo()
#Load the trait data
load("anguilliformes_length.Rdata") 
anguilliformes.sl

#Formatting trait data (needs to be in vector format)
anguilliformes.sl.vec <- anguilliformes.sl$Standard_length
names(anguilliformes.sl.vec) <- anguilliformes.sl$tree_name

#We will use "adjust_data" to "adjust" the tree and traits to meet the requirements of l1ou
anguilli.format <- adjust_data(anguilliformes.tree, anguilliformes.sl.vec)

#Take a look at the output
anguilli.format$tree
anguilli.format$Y

Time to run l1ou. First, let’s estimate the best shift configuration and explore the output. I recommend checking the help page for ?estimate_shift_configuration to explore the function’s arguments and what they do.

anguilli.model <- estimate_shift_configuration(anguilli.format$tree, anguilli.format$Y, rescale=F, root.model = "OUrandomRoot", criterion="AICc", alpha.upper = 5, nCores = 5)
#this takes ~1min

anguilli.model$shift.values #Shows the estimated *change* in optima (θ) associated with each shift. In other words, this is the difference between the "old" and "new" optima when a shift occurs. It reflects the magnitude and direction of the shift, but NOT the absolute trait optimum value.

anguilli.model$nShifts 
#Number of shifts in optima detected

anguilli.model$optima 
#Optima estimated for each evolutionary regime

#Take a look at the best shift configuration recovered by l1ou
plot(anguilli.model, cex = 0.2) #the values represent the magnitude and direction of the shift

Now that we have estimated the best shift configuration, we can test whether any of the inferred trait optima are statistically indistinguishable and therefore represent the same adaptive regime (i.e., convergence).

anguilli.convergent <- estimate_convergent_regimes(anguilli.model, criterion = c("AICc"), method = c("backward"), nCores = 5)
#This step takes around 5 minutes to run. To save time, you can skip the code above and load the pre-computed results by running the lines below (I've already run the analysis for you. Just uncomment the line below to load the saved results).

#load("anguilli_convergent.Rdata")
anguilli.convergent #This shows which nodes in the phylogeny belong to the same regime, meaning those lineages have converged in body length.

Now, let’s assess how strongly supported the detected regime shifts are by running bootstrapped replicates of the trait data on the tree using the l1ou_bootstrap_support function. This is a useful tool for evaluating how robust your inferred shift configuration is. It helps identify which regime shifts are well supported and which ones might be more uncertain. Higher support means greater confidence that a shift is real.

In this example, I ran 10 replicates—but ideally, you’d want to run many more. However, this process can be time-consuming: 10 replicates took about 30 minutes on my machine. You don’t need to run the code yourself now, just load the results from the pre-computed run and take a look at the output.

#boot.anguilli.model <- l1ou_bootstrap_support(anguilli.model, nItrs = 10, nCores = 5) 

load(file = "boot_anguilli_model.Rdata")

boot.anguilli.model$detection.rate #You can readily see that several shifts were recovered in fewer than 10% of bootstrap replicates

sum(boot.anguilli.model$detection.rate > 0.5) #Only 10 shifts were recovered in more than 50% of the bootstraps

boot.anguilli.model$all.shifts #Here you can check the shifts detected in each bootstrap run

Now let’s use the example code provided in ?l1ou_bootstrap_support to plot the bootstrap support for each shift previously detected by l1ou.

nEdges <- Nedge(anguilliformes.tree)
e.w <- rep(1,nEdges) 
e.w[anguilli.model$shift.configuration] <- 3
e.l <- round(boot.anguilli.model$detection.rate*100, digits=1)
e.l <- ifelse(e.l>50, paste0(e.l,"%"), NA)
plot(anguilli.model, edge.label=e.l, edge.ann.cex=0.7, edge.label.ann=TRUE, cex=0.2, label.offset=0.02, edge.width=e.w)

Q: Try changing the “support limit” to 10%. What can you conclude?

bayou (Uyeda and Harmon 2014)

Let’s begin by simulating a multi-optimum Ornstein-Uhlenbeck process on a phylogeny so that we can get a feel for how these models work to model adaptation on phylogenies.

First let’s simulate a tree and rescale it to 100 million years. The second step is optional, but it will help us make sure all our trees and parameters are on a common scale that will probably be typical of the trees you’ll analyze. We will also reorder the tree into “postorder” format. bayou will automatically reorder the tree to postorder format, but it helps to begin by reordering your tree so that the branch numbers used by bayou can be easily matched to the tree.

tree <- sim.bdtree(b = 1, d = 0, stop = "taxa", n = 50, seed = 1)
tree$edge.length <- tree$edge.length/max(branching.times(tree))*100
tree <- reorder(tree, "postorder")

plot(tree, cex = 0.5)

Now let’s simulate some data and use bayou to estimate adaptive shifts on phylogenies. Let’s first define a set of parameter values to use to simulate our data.

set.seed(1)
truepars <- list(alpha=0.1, sig2=0.1, 
                 k=4, ntheta=5, theta=rnorm(5, 0, 4))
trueshiftlocations <- list(sb = c(94, 71, 45, 12), loc = c(6.9, 2.2, 0.8, 3.9)) #I encourage you to use identifyBranches instead.
truepars <- c(truepars, trueshiftlocations)
truepars$t2 <- 2:5

plotBayoupars(truepars, tree, cex = 0.5)

Now using the function dataSim, we can simulate trait data.

dat <- dataSim(truepars, tree, model="OU")$dat

To add realism, let’s add some measurement error to the data. This is a good reminder to always try to use measurement error in your analyses. OU models especially are affected by measurement error. This is because OU models have the effect of “erasing” evolutionary history with increasing alpha. If you don’t account for measurement error, then that measurement error will be transferred to the evolutionary process model. You can make a Brownian Motion model look very OU like if there is a lot of measurement error.

MEvar <- 0.1
dat <- dat + rnorm(length(dat), 0, sqrt(MEvar))

We can now define our prior for our model. The prior function is going to take our parameters and output the prior probability of our parameter values. It represents our initial degree of belief in what values the parameters will take.

priorOU <- make.prior(tree, dists=list(dalpha="dhalfcauchy", dsig2="dhalfcauchy", dk="cdpois", dtheta="dnorm"), param=list(dalpha=list(scale=0.1), dsig2=list(scale=0.1), dk=list(lambda=10, kmax=50), dsb=list(bmax=1, prob=1), dtheta=list(mean=mean(dat), sd=1.5*sd(dat)))
)
#*Obs*. If you get `Error in plot.new() : figure margins too large`, just increase the window size of the plot window and should be fine.

make.prior tries to be reasonable and not make you type everything out, but do not be comfortable with defaults. One trick to make sure your prior functions are reasonable is to simulate a bunch of values and take the quantiles of the distribution. We are using a half-Cauchy distribution for alpha and sig2, which is a good weakly informative prior for scale parameters.

To run our MCMC, we have to initiate the MCMC chain with some starting values. It’s good to run multiple chains from multiple different starting values. Let’s simulate some values from the prior distribution and make sure our prior functions works.

startpars <- priorSim(priorOU, tree, plot=TRUE)$pars[[1]]
startpars
priorOU(startpars)

We’re now going to take what we have and put it into the function bayou.makeMCMC. This function does not immediately initiate the MCMC, but it makes an object that we can use to manage our MCMC analysis. When bayou runs an MCMC, it writes the output to a set of files that need to put somewhere. This ensures that the memory doesn’t get full of increasingly long chains. new.dir=TRUE, which means it will save the output into your R temporary directory but you can specify a particular directory if you wish (using the argument file.dir=).

#Uncomment the lines below to run the analysis yourself, or skip ahead and load the pre-computed results provided below.

#set.seed(1)

#mcmcOU <- bayou.makeMCMC(tree, dat, SE=MEvar, prior=priorOU, file.dir = "mcmcOU_bayou", plot.freq=NULL) #Set up the MCMC

#mcmcOU$run(10000) #Run the MCMC

The full MCMC results are written to a set of files. We can load them back in to R as follows.

load("mcmcOU_saved.RData")
chainOU <- mcmcOU$load()
chainOU

Let’s take a look at the results. We can set a “burnin” parameter that tells the package coda to discard the first bit of the chain.

chainOU <- set.burnin(chainOU, 0.3)
summary(chainOU)
plot(chainOU, auto.layout=FALSE)

Our traces will probably look bad, 10,000 generations isn’t long enough to obtain convergence. Also, note the small effective sample sizes in our summary (the NA’s for the all theta row are expected, this is because these aren’t a single parameter, but a variable number of optima that are coming in and out of existence throughout the chain).

Let’s visualize what we have so far. First, we will plot the truth, then 3 alternative ways of visualizing our chain.

par(mfrow=c(2,2))

plotBayoupars(truepars, tree, main = "True parameters")

plotSimmap.mcmc(chainOU, burnin = 0.3, pp.cutoff = 0.3) #This plot shows the location of shifts and regimes, with circle sizes representing their associated posterior probabilities.

plotBranchHeatMap(tree, chainOU, "theta", burnin = 0.3, pal = cm.colors) #Colors here shows the theta values

phenogram.density(tree, dat, burnin = 0.3, chainOU, pp.cutoff = 0.3) #This plot shows how the trait evolved through time according to the model. You can see that one lineage reached a much higher optimum (theta) compared to the rest of the tree.

Q: How well is bayou doing in picking up the simulated shifts?

bayou is a very flexible method that allows for alternative parameterizations, including “Quantitative Genetics” and “Allometric” models. I recommend checking out the bayou GitHub tutorial to explore these options in more detail.

Now let’s try running bayou with some empirical data. We’ll apply it to the viper + body size dataset. This is the same one used for the OUwie analyses but this time, we’ll exclude the habitat data, since bayou does not require regimes to be defined a priori.

#Load the tree
vipers.tree <- read.tree("vipers_tree.tre")

#Load body size
bs.hab.vipers <- read.csv2("vipers_habitat_bodysize_mod.csv", row.names = 1)
head(bs.hab.vipers)

#Create a named vector with body size data and exclude the habitat data we don't need
bs.vipers <- bs.hab.vipers$body.size
names(bs.vipers) <- bs.hab.vipers$species

Now let’s work on the priors:

#I'm setting Kmax to half the number of branches in the phylogeny (length(vipers.tree$edge.length) / 2), and dalpha to 0.2 * Kmax. I'm also using the empirical body size values to generate a prior for theta.

priorOU <- make.prior(vipers.tree, dists=list(dalpha="dhalfcauchy", dsig2="dhalfcauchy", dk="cdpois", dtheta="dnorm"), param=list(dalpha=list(scale=0.1), dsig2=list(scale=0.1), dk=list(lambda = 49, kmax=247), dsb=list(bmax=1, prob=1), dtheta=list(mean=mean(bs.vipers), sd=1.5*sd(bs.vipers))))
dev.off()

Remember, we need to initialize the MCMC with starting values, so let’s go ahead and do that:

set.seed(123)

startpars <- priorSim(priorOU, vipers.tree, plot=TRUE, cex = 0.1)$pars[[1]]
priorOU(startpars)

We are now ready to run bayou:

#mcmcOU_vipers <- bayou.makeMCMC(vipers.tree, bs.vipers, prior=priorOU, startpar=startpars, file.dir = "mcmcOU_vipers", plot.freq=NULL) #Set up the MCMC

#mcmcOU_vipers$run(30000) #This takes around 2 minutes to run. In practice, you'd want to run this for much longer to ensure proper convergence.

Now let’s remove the first 30% of the MCMC as burn-in and take a look at the results. Let’s first see if the parameters converged (spoiler: they didn’t):

load("mcmcOU_vipers_saved.RData")
chain.bs.vipers <- set.burnin(mcmcOU_vipers$load(), 0.3)

summary(chain.bs.vipers)
plot(chain.bs.vipers, auto.layout=FALSE)
#Yikes...check out those effective sample sizes and trace plots. Definitely not nearly enough sampling to trust the estimates yet.

Now we’ll visualize the shifts that appear in at least 30% of the posterior samples. These are shown as different colors with larger circles. Try increasing the pp.cutoff, say to 0.9, to see how the visualization changes.

plotSimmap.mcmc(chain.bs.vipers, burnin = 0.3, pp.cutoff = 0.3, cex = 0.1)
plotBranchHeatMap(vipers.tree, chain.bs.vipers, "theta", burnin = 0.3, cex = 0.1, pal = cm.colors)
phenogram.density(vipers.tree, bs.vipers, burnin = 0.3, chain.bs.vipers, pp.cutoff = 0.3) 

Obs. Look at those big guys, those are the bushmasters! The longest vipers out there (though not necessarily the heaviest).