Earlier in this course, we learned how to use MCMCglmmm
to estimate Additive Genetic Variance-Covariance matrices (G-matrices or G) from related individuals. Let’s do a comparison of estimating the G-matrix from pedigrees and the R-matrix from phylogenies.
#Load some familiar packages
library(MCMCglmm)
## Loading required package: Matrix
## Loading required package: coda
## Loading required package: ape
library(mvtnorm)
#Let's simulate our data. As simulation deities, these are the TRUE VALUES of G and E
Gmatrix <- matrix(c(3.5, 1.1,
1.1, 2), byrow=TRUE, ncol=2, nrow=2)
Ematrix <- matrix(c(1, -0.2,
-0.2, 1), byrow=TRUE, ncol=2, nrow=2)
#Let's grab Patrick Carter's Toy pedigree and make it smaller to speed up computation
Toy4ped <- readRDS("./Toy4ped.rds")
pedigree <-prunePed(Toy4ped,keep=1501:2500, make.base=TRUE)
There are built in tools to simulate breeding values from a pedigree and given the G and E matrices. Let’s use them.
set.seed(1) #To get consistent results
simdat <- rbv(pedigree, Gmatrix, nodes="ALL") + rmvnorm(nrow(pedigree),c(0,0), Ematrix) #Simulates breeding values from a pedigree
simdat <- data.frame(trait1=simdat[,1], trait2=simdat[,2], animal=rownames(simdat)) #Arrange and name it how MCMCglmm wants
plot(simdat[,1], simdat[,2], xlab="Trait 1", ylab="Trait 2", pch=21, col="gray20", bg="gray20")
Let’s rerun a very simple quantitative genetic analysis:
#I took these from the MCMCglmm tutorial, should be similar to what Patrick had you do yesterday.
prior <- list(R=list(V=diag(2)*(0.002/1.002),nu=1.002),
G=list(G1=list(V=diag(2)*(0.002/1.002),nu=1.002)))
mmQG <- MCMCglmm(cbind(trait1, trait2)~ trait-1, random=~us(trait):animal, rcov=~us(trait):units, family = c("gaussian", "gaussian"), prior=prior, data=simdat, pedigree = pedigree)
##
## MCMC iteration = 0
##
## MCMC iteration = 1000
##
## MCMC iteration = 2000
##
## MCMC iteration = 3000
##
## MCMC iteration = 4000
##
## MCMC iteration = 5000
##
## MCMC iteration = 6000
##
## MCMC iteration = 7000
##
## MCMC iteration = 8000
##
## MCMC iteration = 9000
##
## MCMC iteration = 10000
##
## MCMC iteration = 11000
##
## MCMC iteration = 12000
##
## MCMC iteration = 13000
#We're gonna run it for FAR TOO SHORT A TIME cause we don't care if our estimates are really rough...in the interests of speed
Let’s pull out our estimated G and E from the our model fit using the median of the posteriors.
G <- matrix(c(median(mmQG$VCV[,1]), median(mmQG$VCV[,2]),
median(mmQG$VCV[,3]), median(mmQG$VCV[,4])), byrow=TRUE, nrow=2, ncol=2)
E <- matrix(c(median(mmQG$VCV[,5]), median(mmQG$VCV[,6]),
median(mmQG$VCV[,7]), median(mmQG$VCV[,8])), byrow=TRUE, nrow=2, ncol=2)
cat("True values for G-matrix \n", Gmatrix, "\n Estimated values for G-matrix \n", G, "\n\n")
## True values for G-matrix
## 3.5 1.1 1.1 2
## Estimated values for G-matrix
## 3.317423 1.366492 1.366492 2.304761
cat("True values for E-matrix \n", Ematrix, "\n Estimated values for E-matrix \n", E)
## True values for E-matrix
## 1 -0.2 -0.2 1
## Estimated values for E-matrix
## 1.366273 -0.3820973 -0.3820973 1.087083
Not half bad!! Now we used a fancy simulator from the package. I want to demystify it for you. We don’t need that to simulate data (although it is more efficient than what I’m going to have you do now). To get data like our data, we simply draw from a multivariate normal distribution. All normal distributions just need a mean and variance. All of our traits for all individuals have the same mean of 0 for both traits. The complete variance-covariance matrix FOR ALL OBSERVATIONS is equal to kronecker(G, A) + kronecker(E,I)
where A
is the relatedness matrix, and I
is the identity matrix. In other words, mvrnorm(1, c(0,0), kronecker(G,A) + kronecker(E,I))
. The “Kronecker product” multiplies each entry of one matrix (e.g. G) by another matrix (e.g. A), basically making a matrix of submatrices. This will be much slower than rbv.
set.seed(1)
A <- solve(inverseA(pedigree)$Ainv)
I <- diag(nrow(simdat))
.simdat2 <- MASS::mvrnorm(1, rep(0, dim(G)[1]*nrow(A)), kronecker(G, A) + kronecker(E, I))
Does the distribution look similar to before?
simdat2 <- data.frame(trait1=.simdat2[1:nrow(A)], trait2=.simdat2[(nrow(A)+1):(2*nrow(A))], animal=pedigree$animal)
plot(simdat2[,1:2], xlab="Trait 1", ylab="Trait 2", pch=21, col="gray20", bg="gray20", main="Direct from mvNormal Distribution", xlim=c(-7,7), ylim=c(-7,7))
plot(simdat[,1:2], xlab="Trait 1", ylab="Trait 2", pch=21, col="gray20", bg="gray20", main="From rbv in MCMCglmm", xlim=c(-7,7), ylim=c(-7,7))
Let’s see if the model fit gets a similar answer.
prior <- list(R=list(V=diag(2)*(0.002/1.002),nu=1.002),
G=list(G1=list(V=diag(2)*(0.002/1.002),nu=1.002)))
mmQG2 <- MCMCglmm(cbind(trait1, trait2)~ trait-1, random=~us(trait):animal, rcov=~us(trait):units, family = c("gaussian", "gaussian"), prior=prior, data=simdat2, pedigree = pedigree)
##
## MCMC iteration = 0
##
## MCMC iteration = 1000
##
## MCMC iteration = 2000
##
## MCMC iteration = 3000
##
## MCMC iteration = 4000
##
## MCMC iteration = 5000
##
## MCMC iteration = 6000
##
## MCMC iteration = 7000
##
## MCMC iteration = 8000
##
## MCMC iteration = 9000
##
## MCMC iteration = 10000
##
## MCMC iteration = 11000
##
## MCMC iteration = 12000
##
## MCMC iteration = 13000
G2 <- matrix(c(median(mmQG2$VCV[,1]), median(mmQG2$VCV[,2]),
median(mmQG2$VCV[,3]), median(mmQG2$VCV[,4])), byrow=TRUE, nrow=2, ncol=2)
E2 <- matrix(c(median(mmQG2$VCV[,5]), median(mmQG2$VCV[,6]),
median(mmQG2$VCV[,7]), median(mmQG2$VCV[,8])), byrow=TRUE, nrow=2, ncol=2)
cat("True values for G-matrix \n", Gmatrix, "\n Estimated values for G-matrix \n", G2, "\n\n")
## True values for G-matrix
## 3.5 1.1 1.1 2
## Estimated values for G-matrix
## 3.201018 1.296374 1.296374 2.322227
cat("True values for E-matrix \n", Ematrix, "\n Estimated values for E-matrix \n", E2)
## True values for E-matrix
## 1 -0.2 -0.2 1
## Estimated values for E-matrix
## 1.610994 -0.4109324 -0.4109324 1.16305
Again, not bad!
#II. Now let’s move to a phylogenetic model. We’re going to do the same thing as before. This time, our traits are measured as species means. We can assume a model of genetic drift, and a tree in millions of years.
library(geiger)
library(phytools)
## Loading required package: maps
Ne <- 10000000 #set the effective population size
Rmatrix <- Gmatrix/Ne*1000000 #multiply by a million, so we can have our tree in millions of years
Lmatrix <- Ematrix/10
tree <- sim.bdtree(b=1, n=200, stop="taxa")
tree$edge.length <- tree$edge.length/max(branching.times(tree)) * 50 #Makes tree 50 million years old
We can use the R package geiger to simulate our trait data under Brownian Motion. To analogize to the the QG model, we’re going to add some random noise on top of that by simulating from the E matrix. In other words, our phylogenetic heritabilities will be less than 1. (This is analogous to what is done in so-called lambda models, which transform the tree. Lambda is similar in interpretation as the phylogenetic heritability we will get here).
phydat <- sim.char(tree, par=Rmatrix, nsim=1, root=0, model="BM")[,,1] + MASS::mvrnorm(n=length(tree$tip.label), c(0,0), Lmatrix)
plot(phydat, xlab="Trait 1", ylab="Trait 2", pch=21, col="gray20", bg="gray20", main="From sim.char in Geiger", xlim=c(-10,10), ylim=c(-10,10))
Convert to MCMCglmm format. Pass our tree into the function inverseA
to get the A matrix to feed into MCMCglmm.
phydat <- data.frame(trait1=phydat[,1], trait2=phydat[,2], animal=tree$tip.label)
rownames(phydat) <- tree$tip.label
inv.phylo <- inverseA(tree, scale=FALSE) #Take the inverse of the tree's variance-covariance matrix
Besides feeding the function our inverse of the A matrix from the phylogeny instead of the pedigree, this function call should be identical to the previous ones!
prior <- list(R=list(V=diag(2)*(0.002/1.002),nu=1.002),
G=list(G1=list(V=diag(2)*(0.002/1.002),nu=1.002)))
mmPCM <- MCMCglmm(cbind(trait1, trait2)~1, random=~us(trait):animal, rcov=~us(trait):units, family = c("gaussian", "gaussian"), prior=prior, data=phydat, ginverse=list(animal=inv.phylo$Ainv), scale=FALSE)
##
## MCMC iteration = 0
##
## MCMC iteration = 1000
##
## MCMC iteration = 2000
##
## MCMC iteration = 3000
##
## MCMC iteration = 4000
##
## MCMC iteration = 5000
##
## MCMC iteration = 6000
##
## MCMC iteration = 7000
##
## MCMC iteration = 8000
##
## MCMC iteration = 9000
##
## MCMC iteration = 10000
##
## MCMC iteration = 11000
##
## MCMC iteration = 12000
##
## MCMC iteration = 13000
Let’s compare estimates to the true values.
R <- matrix(c(median(mmPCM$VCV[,1]), median(mmPCM$VCV[,2]),
median(mmPCM$VCV[,3]), median(mmPCM$VCV[,4])), byrow=TRUE, nrow=2, ncol=2)
Lambda <- matrix(c(median(mmPCM$VCV[,5]), median(mmPCM$VCV[,6]),
median(mmPCM$VCV[,7]), median(mmPCM$VCV[,8])), byrow=TRUE, nrow=2, ncol=2)
cat("True values for R-matrix \n", Rmatrix, "\n Estimated values for R-matrix \n", R, "\n\n")
## True values for R-matrix
## 0.35 0.11 0.11 0.2
## Estimated values for R-matrix
## 0.3185308 0.0826571 0.0826571 0.2078679
cat("True values for L-matrix \n", Lmatrix, "\n Estimated values for L-matrix \n", Lambda)
## True values for L-matrix
## 0.1 -0.02 -0.02 0.1
## Estimated values for L-matrix
## 0.02489792 0.03835006 0.03835006 0.22726
Not terrible! Now let’s do our demystification of what distribution we’re pulling from again. Let’s make the phylogenetic variance-covariance matrix.
V <- vcv.phylo(tree)
V[1:10,1:10]
## s1 s2 s3 s4 s5 s6 s7 s8
## s1 50.00000 48.31730 28.99285 28.99285 28.99285 28.99285 28.99285 28.99285
## s2 48.31730 50.00000 28.99285 28.99285 28.99285 28.99285 28.99285 28.99285
## s3 28.99285 28.99285 50.00000 46.02603 46.02603 40.30517 40.30517 34.91043
## s4 28.99285 28.99285 46.02603 50.00000 46.26448 40.30517 40.30517 34.91043
## s5 28.99285 28.99285 46.02603 46.26448 50.00000 40.30517 40.30517 34.91043
## s6 28.99285 28.99285 40.30517 40.30517 40.30517 50.00000 42.10492 34.91043
## s7 28.99285 28.99285 40.30517 40.30517 40.30517 42.10492 50.00000 34.91043
## s8 28.99285 28.99285 34.91043 34.91043 34.91043 34.91043 34.91043 50.00000
## s9 28.99285 28.99285 34.91043 34.91043 34.91043 34.91043 34.91043 42.47080
## s10 28.99285 28.99285 34.91043 34.91043 34.91043 34.91043 34.91043 42.47080
## s9 s10
## s1 28.99285 28.99285
## s2 28.99285 28.99285
## s3 34.91043 34.91043
## s4 34.91043 34.91043
## s5 34.91043 34.91043
## s6 34.91043 34.91043
## s7 34.91043 34.91043
## s8 42.47080 42.47080
## s9 50.00000 47.35977
## s10 47.35977 50.00000
Now, we create a new dataset doing the same thing we did for the individuals on the pedigree, but use V instead of A.
set.seed(2)
ntip <- length(tree$tip.label)
I <- diag(ntip)
.phydat2 <- MASS::mvrnorm(n=1, mu=rep(0,2*ntip), kronecker(Rmatrix, V) + kronecker(Lmatrix, I))
Does it give us the same answer?
phydat2 <- data.frame(trait1 = .phydat2[1:ntip], trait2 = .phydat2[(ntip+1):(2*ntip)], animal=tree$tip.label)
mmPCM2 <- MCMCglmm(cbind(trait1, trait2)~1, random=~us(trait):animal, rcov=~us(trait):units, family = c("gaussian", "gaussian"), prior=prior, data=phydat2, ginverse=list(animal=inv.phylo$Ainv), scale=FALSE)
##
## MCMC iteration = 0
##
## MCMC iteration = 1000
##
## MCMC iteration = 2000
##
## MCMC iteration = 3000
##
## MCMC iteration = 4000
##
## MCMC iteration = 5000
##
## MCMC iteration = 6000
##
## MCMC iteration = 7000
##
## MCMC iteration = 8000
##
## MCMC iteration = 9000
##
## MCMC iteration = 10000
##
## MCMC iteration = 11000
##
## MCMC iteration = 12000
##
## MCMC iteration = 13000
R2 <- matrix(c(median(mmPCM2$VCV[,1]), median(mmPCM2$VCV[,2]),
median(mmPCM2$VCV[,3]), median(mmPCM2$VCV[,4])), byrow=TRUE, nrow=2, ncol=2)
Lambda2 <- matrix(c(median(mmPCM2$VCV[,5]), median(mmPCM2$VCV[,6]),
median(mmPCM2$VCV[,7]), median(mmPCM2$VCV[,8])), byrow=TRUE, nrow=2, ncol=2)
cat("True values for R-matrix \n", Rmatrix, "\n Estimated values for R-matrix \n", R2, "\n\n")
## True values for R-matrix
## 0.35 0.11 0.11 0.2
## Estimated values for R-matrix
## 0.3666159 0.1119979 0.1119979 0.2718917
cat("True values for L-matrix \n", Lmatrix, "\n Estimated values for L-matrix \n", Lambda2)
## True values for L-matrix
## 0.1 -0.02 -0.02 0.1
## Estimated values for L-matrix
## 0.03554768 0.01345103 0.01345103 0.0112972
Squint and it looks great.