Applying GLMMs to wild metabarcoding data

This is a tutorial illustrating the methods presented in Sweeny et al. 2023 (https://journals.asm.org/doi/10.1128/msystems.00040-23)

We provide an outline of applying generalised linear mixed-effects models to relative abundance tables of metabarcoding data as a means to account for complex covariates that are present in wildlife microbiota datasets. The below from the manuscript illustrates the model formulation and output.

For computing efficiency, an anonymised subset of one of the pilot datasets used in the companion manuscript is used in this tutorial and can be downloaded from the ‘Data’ folder on github (https://github.com/arsweeny/microbiome-glmm/tree/main/Data)

0. Set-up

To run this tutorial you will require the following packages:

library(tidyverse) # data manipulation and ggplot 
library(janitor) # data cleaning 
library(magrittr)  # data manipulation 
library(reshape2) # data manipulation 
library(phyloseq) # output format of sequence processing 
library(microbiome) # useful exploration and visualisation functions 
library(MCMCglmm) # implementation of GLMMs 
library(ALDEx2) # CLR transformations 
library(plotrix) # std error functions 
library(pals) # palettes 
library(MetBrewer) # palettes 
library(ggregplot) # package my friend made with lots of handy funcions - https://github.com/gfalbery/ggregplot

Plot admin

Set-up some palettes for later use

MicroColours <- met.brewer("Renoir",12, "discrete")
MicroColoursRamp <- colorRampPalette(MicroColours)
MicroColoursExpand <- MicroColoursRamp(n=20)
MicroColoursCat <- list(
  Age = MicroColours[c(3,4)],
  Sex= MicroColours[c(1,5)],
  Season=MicroColours[c(2,8)]
) #%>% unlist()


AmyTheme <- theme_bw(base_size=14) + 
  theme(
    axis.title = element_text(colour="black"),
    axis.title.x = element_text(angle=45, hjust=1) 
  )

Mods <- list() # Model list for later 

Reading in data objects

We will use the phyloseq object representing the feature abundance table, metadata, and taxonomy table for exploration of the data and analysis presented in the first section of the results (alpha and beta diversity).

We will use the csv object representing the phyloseq objected ‘melted’ into one dataframe for the mixed modelling approach outlined in Section 3.1

path <- "Data/"
df <- read_csv(paste0(path, "GLMMMTutorialData.csv")) # csv file 
ps <- readRDS(paste0(path,"PhyloseqGLMM.rds")) # phyloseq object 

Explore the dataset

Summarise phyloseq object

This data was previously filtered to includes ASVs with a total abundance of 250 reads, so we have no singletons and a total of 794 ASVs and 30 samples we will proceed with.

ps 
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 794 taxa and 30 samples ]
## sample_data() Sample Data:       [ 30 samples by 3 sample variables ]
## tax_table()   Taxonomy Table:    [ 794 taxa by 6 taxonomic ranks ]
summarize_phyloseq(ps) 
## [[1]]
## [1] "1] Min. number of reads = 20375"
## 
## [[2]]
## [1] "2] Max. number of reads = 46860"
## 
## [[3]]
## [1] "3] Total number of reads = 1093264"
## 
## [[4]]
## [1] "4] Average number of reads = 36442.1333333333"
## 
## [[5]]
## [1] "5] Median number of reads = 36980.5"
## 
## [[6]]
## [1] "7] Sparsity = 0.606591099916037"
## 
## [[7]]
## [1] "6] Any OTU sum to 1 or less? NO"
## 
## [[8]]
## [1] "8] Number of singletons = 0"
## 
## [[9]]
## [1] "9] Percent of OTUs that are singletons \n        (i.e. exactly one read detected across all samples)0"
## 
## [[10]]
## [1] "10] Number of sample variables are: 3"
## 
## [[11]]
## [1] "age_class" "sex"       "side"

A look at the metadata

Sample variables include a mix of age classes (15 lamb and 15 adult), males and females, and individuals from the east and west side of the study system. Samples are row names.

sample_data(ps) %>% as.data.frame() %>%  head
##     age_class    sex side
## S57     Adult Female East
## S61      Lamb Female West
## S48      Lamb   Male West
## S42     Adult Female East
## S50      Lamb Female East
## S34      Lamb Female West

Relative abundance and alpha and beta diversity analysis

Community composition summary

By far community dominated by Firmicutes and Bacteroidetes

psPhy <- tax_glom(ps, taxrank = "phylum") # aggregates to the phylum level 
psMelt <- psmelt(psPhy) # melts into a combined df for plots 

SampleDepth <- psMelt %>% 
  group_by(Sample) %>% 
  summarise(Depth=sum(Abundance)) %>% as.data.frame() # summarise reads per sample 
RelAbundance <- full_join(psMelt, SampleDepth) %>% 
  group_by(Sample, phylum) %>% 
  summarise(RelAbundance=Abundance/Depth) %>% as.data.frame() # summarise proportion of each phyla for each sample 

require(plotrix)
RelAbundanceMean <- RelAbundance %>% 
  dplyr::select(-Sample) %>%  group_by(phylum) %>% 
  summarise_each(funs(mean, sd, std.error)) %>% 
  arrange(desc(mean)) # summarise proportion of phyla across all samples 

RelAbundanceMean
## # A tibble: 16 × 4
##    phylum                 mean       sd std.error
##    <chr>                 <dbl>    <dbl>     <dbl>
##  1 Firmicutes         0.471    0.0940    0.0172  
##  2 Bacteroidetes      0.380    0.0993    0.0181  
##  3 Proteobacteria     0.0546   0.112     0.0204  
##  4 Fibrobacteres      0.0320   0.0234    0.00427 
##  5 Verrucomicrobia    0.0175   0.0366    0.00667 
##  6 Cyanobacteria      0.0153   0.0116    0.00213 
##  7 Spirochaetes       0.00853  0.00665   0.00121 
##  8 Epsilonbacteraeota 0.00639  0.0122    0.00223 
##  9 Tenericutes        0.00464  0.00292   0.000533
## 10 Lentisphaerae      0.00241  0.00207   0.000378
## 11 Euryarchaeota      0.00220  0.00334   0.000610
## 12 Kiritimatiellaeota 0.00173  0.00205   0.000375
## 13 Planctomycetes     0.00147  0.00120   0.000219
## 14 Actinobacteria     0.00139  0.00276   0.000504
## 15 Fusobacteria       0.000475 0.00260   0.000475
## 16 Elusimicrobia      0.000396 0.000949  0.000173
ggplot(psMelt, aes(x = Sample, y = Abundance, fill = phylum)) + 
  geom_bar(stat = "identity", position = "fill", colour="transparent") + 
  scale_fill_manual(values=MicroColoursExpand) + 
  facet_grid(~age_class, scales="free_x") + 
  theme_bw(base_size = 14) + # labs(subtitle = "B") + 
  guides(fill=guide_legend(ncol=2)) +
  theme(axis.text.x = element_text(angle=45, hjust=1, size=6),
        strip.background = element_rect(fill="white"))

1. Alpha & Beta Diversity Community Analysis

Alpha diversity

Significant differences in diversity between age classes

# alpha metrics 
ps.even <- evenness(ps, index = "all")

ps.meta <- meta(ps)
ps.meta$simpson <- ps.even$simpson 
hist(ps.meta$simpson) 

# test for normality 
shapiro.test(ps.meta$simpson) # p=0.3525 - data not violating normality 
## 
##  Shapiro-Wilk normality test
## 
## data:  ps.meta$simpson
## W = 0.96222, p-value = 0.3525
qqnorm(ps.meta$simpson) # looks okay 

# compare between host groups 
SimpsonTestAge<- t.test(ps.meta$simpson[ps.meta$age_class=="Lamb"], ps.meta$simpson[ps.meta$age_class=="Adult"])

SimpsonTestAge # p = 0.0197 
## 
##  Welch Two Sample t-test
## 
## data:  ps.meta$simpson[ps.meta$age_class == "Lamb"] and ps.meta$simpson[ps.meta$age_class == "Adult"]
## t = -2.4932, df = 24.811, p-value = 0.0197
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.14879071 -0.01414539
## sample estimates:
## mean of x mean of y 
## 0.1917007 0.2731688

Visualise alpha diversity differences between age classes

require(ggpubr)
alpha.ps <- ps.meta %>% 
  ggviolin(., x="age_class", y="simpson", 
                add = "boxplot", fill = "age_class", 
                palette = c(MicroColoursCat$Sex), 
                alpha=0.6, add.params = list(alpha=0.6), 
                legend = "top") +
  stat_compare_means(method = "t.test", 
                     aes(label=..p.signif..), 
                     label.x=1.5, label.y=0.4,
                     size=7)

alpha.ps

Beta diversiy

ps.rel <- microbiome::transform(ps, "compositional")
bx.ord_pcoa_bray <- ordinate(ps.rel, "PCoA", "bray")

metadf.bx <- data.frame(sample_data(ps.rel))
bray_ps.bxn <- phyloseq::distance(physeq = ps.rel, method = "bray")

set.seed(999)
# Adonis test
library(vegan)
adonis.test <- adonis(bray_ps.bxn ~ age_class, data = metadf.bx)
adonis.test<-clean_names(adonis.test$aov.tab)
PERMANOVA results
df sums_of_sqs mean_sqs f_model r2 pr_f
age_class 1 0.7568348 0.7568348 4.145245 0.1289536 0.001
Residuals 28 5.1122134 0.1825791 NA 0.8710464 NA
Total 29 5.8690482 NA NA 1.0000000 NA

Visualise composition differences between host groups

beta.ps <- plot_ordination(ps.rel, 
                             bx.ord_pcoa_bray, 
                             color="age_class"
                             #label = "sample_id", 
) + 
  geom_point(aes(shape = age_class), size= 4) + 
  scale_colour_manual(values=MicroColoursCat$Sex)+
  theme_bw() +
  theme(plot.title = element_text(hjust = 0, size = 12),
        legend.position = "top")+
  stat_ellipse()+
  guides(shape="none")


beta.ps 

2. Mixed Model Analysis

Mixed models here will all be carried out in MCMCglmm, but your favourite package/ program should work just fine with some adaptation as well. Feel free to contact me about syntax switches between algorithms.

2A. Poisson GLMMs

  • These models produce the main results of the manuscript and use the raw read count per ASV per sample as the response with Poisson error families
  • Below we also present alternative approaches for computing speed and adjustment of questions asked by the model

Model set-up

  • read abundance as response
  • fixed effects should be relevant host factors that may affect sample variation in read abundance - in this instance samples were collected from adults and lambs to examine effects of age on the gut microbiome
mf <- 20 # multiplier for model iterations and chains 
resp<-"abundance" 
fixed<- c("age_class") 

mcmcF<- as.formula(paste(resp, "~", paste(fixed)))
  
Prior3 <- list(R = list(V = diag(1), nu = 0.002),
               G = list(G1 = list(V = diag(1), nu = 0.002, alpha.mu = rep(0,1), alpha.V = diag(1)*100),
                        G2 = list(V = diag(1), nu = 0.002, alpha.mu = rep(0,1), alpha.V = diag(1)*100),
                        G3 = list(V = diag(1), nu = 0.002, alpha.mu = rep(0,1), alpha.V = diag(1)*100)))

Run model

# to run the model 
Mods[[1]]<- MCMCglmm(fixed= mcmcF, 
                          random= ~  sample + asv + age_class:asv, 
                          prior=Prior3,
                          data=df,
                          verbose=T, pr=T, pl=T, # will save posteriors for random effect levels & model predictions for each iteration 
                          family = "poisson",
                          nitt = 13000*mf,
                          thin = 10*mf,burnin=3000*mf)

Notes: * verbose = T will print progress * pr = T saves posterior distributions for all random effect levels (see section 4) * pl = T saves latent variables (model predicted variables for every data point & iteration, useful for testing for excess zeroes) * Note this did take my laptop ~ 90 minutes to run, so not a fast model!

Poisson results

View proportion variance associated with each random term

PropVarPois <- MCMCRep(Mods[[1]], scale="link") %>% # function from ggregplot to calculate proportion variance for each random effect 
  as.data.frame() %>% 
  mutate_at(c("Mode", "lHPD", "uHPD"), as.numeric)

PropVarPois
##       Component       Mode        lHPD       uHPD
## 1        sample 0.01203397 0.007792082 0.02416657
## 2           asv 0.17223235 0.146959862 0.21102106
## 3 age_class:asv 0.21607760 0.193830355 0.24683450
## 4         units 0.53964912 0.518421149 0.55962454

Variance components will not add to 1 here because there is some variance attributable to the poisson process. We’ll calculate that and visualise the variance breakdown for this model.

PropVarPois %>% pull(Mode) %>% sum -> TotalVar 
1- TotalVar -> PoissonVar 

PropVarPois[5, ] <- c("poisson", PoissonVar, NA, NA) 

PropVarPois %<>% 
  mutate_at(c("Mode", "lHPD", "uHPD"), as.numeric)

PropVarPois%>% 
  mutate(Component=fct_relevel(Component, "age_class:asv", "asv", "sample","poisson", "units")) %>% 
  ggplot(aes(x=1, y=as.numeric(Mode), fill=Component)) + 
  geom_bar(stat="identity", colour="transparent") + 
  scale_x_continuous(labels=NULL) + 
  scale_fill_manual(values=MicroColours[c(2,4,6,8,10)]) +
  theme_bw(base_size = 14) + 
  #theme(legend.position = "top") + 
  theme(axis.ticks.x = element_blank(), 
        legend.text = element_text(size=11), 
        legend.title = element_text(size=12)) + 
  labs(x='Poisson model', y='Proportion Variance')

3. Differential abunance analysis

  • We can use model outputs to explore which specific taxa are driving differential abundances between groups of interests (e.g. age), which is commonly of great interest in microbiome studies
  • This can be estimated in the example of age by using each ASV-by- season level of the random effect of asv:season and comparing posterior distributions for each ASV across factor levels
  • For example, the mean of the posterior distribution for ASV1:adult – that for ASV1:lamb can be interpreted as the differential abundance of ASV1 between spring and summer

Rank taxa effects

Let’s pull out the posterior distributions relevant to our host age groups of interest

Solutions<-Mods[[1]]$Sol[, 1:dim(Mods[[1]]$Z)[2]]

SolutionsDF<- Solutions %>% as.data.frame()

SolKeepAge <- SolutionsDF %>% dplyr::select(contains("age_class:asv")) %>% colnames()

SolutionsAge<- SolutionsDF %>% dplyr::select(SolKeepAge) 

SolutionsAdult <- SolutionsAge %>% 
  gather(key="level", value="posterior") %>% as.data.frame() %>% 
  separate(level, into=c("term", "age_class", "asv"), sep="[.]")

Here’s a function I wrote that you can apply to posterior distributions for each ASV in both age groups to calculate differences between groups (see Sections 2 & 3 in manuscript)

RanefPostComp<- function(df, comps, taxa, var){
  require(tidyverse)
  require(MCMCglmm)
  not_all_na <- function(x) any(!is.na(x))
  df$group <- df[, var]
  df$taxa <- df[, taxa]
  df<- df %>% filter(term!="(Intercept)") %>% 
    mutate(taxa=as.factor(taxa))
  LevelComps<-list()
  Levels=levels(df[, "taxa"])
  for(x in 1:length(Levels)){
    post1 <- df %>% subset(taxa==Levels[x] & group== comps[1]) %>% pull(posterior)
    post2 <- df %>% subset(taxa==Levels[x] & group== comps[2]) %>% pull(posterior)
    if(length(post2>0 ) & length(post1>0)){
      postComp <- post2-post1 %>% as.matrix() 
    }
    else{
      postComp <- rep(NA, 1000)
    }
    LevelComps[[x]] <-postComp
  }
  LevelCompsFin<- bind_cols(LevelComps) %>% as.matrix()
  colnames(LevelCompsFin) <- paste(Levels, comps[[2]], sep=".")
  LevelCompsFin %>% as.data.frame() %>% select_if(not_all_na) -> df 
  
  df1<-apply(df,2, function(a) mean(a)[1])
  df1l<-apply(df,2,function(a)  HPDinterval(as.mcmc(a))[1])
  df1u<-apply(df,2,function(a)  HPDinterval(as.mcmc(a))[2])
  df1v<-apply(df,2,function(a)  var(a)[1])
  
  df2 <- cbind(df1, df1l, df1u, df1v)
  colnames(df2) <- c("mean", "lower", "upper", "var")
  df2 <- df2 %>% as.data.frame() %>% 
    rownames_to_column(var="term") %>% 
    mutate(taxa=sapply(str_split(term, "[[.]]"), '[', 1),
           effect=var, 
           nonZeroCI= ifelse(lower<0&upper<0, "Y", 
                       ifelse(lower>0&upper>0, "Y", "N"))) 
  colnames(df2)[which(names(df2) == "taxa")] <- paste(taxa)
  return(df2)
  
}

Let’s apply this function to the age-related posterior distributions from the model. This will take a little while to run, especially for datasets larger than this test data! (Or just read them in below)

ASVByAge <- RanefPostComp(SolutionsAdult, 
                          comps=c("Lamb", "Adult"), 
                          taxa="asv", 
                          var="age_class") 

rm(var) # to avoid conflict later with another function 

Visualise taxa effects

Let’s prep the taxonomy rank effects for visualisation with some extra information

Taxo <- df %>% filter(!duplicated(asv)) %>% 
  dplyr::select(asv, phylum, fam) %>% as.data.frame() %>% 
  rename(family=fam) # MCMCglmm get's a little cranky if a variable in the df is called family so was renamed previously 

Prev <- df %>% group_by(asv) %>% 
  summarise(prevASV=length(abundance[abundance>0])/ 30) %>%  # n samples 
  arrange(-prevASV)

TaxoDf <- left_join(Prev, Taxo) 

TaxAge <- merge(ASVByAge, TaxoDf, by="asv", all.x=TRUE)

Identify taxa with the strongest effects (positive & negative shifts)

TaxAge %>% group_by(phylum) %>% na.omit() %>% 
  summarise(nASV=length(asv)) %>% filter(nASV>1) %>% 
  pull(phylum) %>% as.character() -> keepPhy # retain Phylum with > 1 asv for visualisation 

TopAgeASV<- TaxAge %>% 
  filter(nonZeroCI=="Y") # pull out taxa effects with CI not spanning zero 

Top50 <- TaxAge %>% arrange(-mean) %>% 
  filter(phylum %in% keepPhy) %>% 
  dplyr::select(asv, phylum, lower, upper, mean) %>% 
  #filter(lower>0) %>%  #make sure errors don't span zero 
  top_n(50) %>% 
  mutate(group="real")

Bottom50 <- TaxAge %>% arrange(-mean) %>% 
  filter(phylum %in% keepPhy) %>% 
  dplyr::select(asv, phylum, lower, upper, mean) %>%  
  #filter(upper<0.1) %>% 
  top_n(-50) %>% 
  mutate(group="real")

Create some filler categories for the axis breaks for creating a panel with highest ranked taxa (see Fig 2B & 2D)

DummiesAge<-paste0("dummy", rep(1:25))
GroupAge<-rep("dummy", 25)

DummyAge<- cbind(DummiesAge, GroupAge) %>% as.data.frame()
colnames(DummyAge)<- c("asv", "group")

select50<- bind_rows(Top50, DummyAge, Bottom50) 
select_x_age<- select50 %>% pull(asv)
MicroColoursPhy <- MicroColours
PlotPal <- data.frame(keepPhy, MicroColoursPhy[c(1,3,4:10,12)])
colnames(PlotPal) <- c("phylum", "colour")

AgeTopBottom<-
  select50 %>% filter(!is.na(phylum)) %>% 
  ggplot(aes(x=reorder(asv, mean), y=mean, colour=phylum))+
  geom_point(position=position_dodge(w=0.4), size=1.5)+
  geom_errorbar(position=position_dodge(w=0.5),
                aes(ymin=lower,ymax=upper),size=0.3,width=0.2)+
  geom_segment(aes(y=0, yend=0, x=120, xend=70), colour="black", size=0.7)+
  geom_segment(aes(y=0, yend=0, x=70, xend=50), linetype="dashed", colour="black", size=0.7)+
  geom_segment(aes(y=0, yend=0, x=50, xend=0), colour="black", size=0.7)+
  #geom_hline(aes(yintercept=0, linetype=group))+
  theme_classic(base_size = 14)+labs(x=NULL)+
  scale_colour_manual(values=PlotPal %>% 
                        filter(phylum %in% select50$phylum) %>% 
                        mutate(colour=as.character(colour)) %>% 
                        pull(colour), 
                      breaks=unique(as.character(select50$phylum)) %>% 
                        sort()) + 
  scale_x_discrete(limits=rev(select_x_age))+
  #geom_vline(aes(xintercept=50.5, colour="grey"), size=0.2) + 
  coord_flip()+
  theme(strip.text.x = element_text(size = 0.8), 
        strip.background =element_rect(fill="white"),
        axis.text.y = element_blank(), 
        axis.ticks.y=element_blank())+
  labs(subtitle = "B")

TaxAge  %>% na.omit() %>% subset(phylum %in% keepPhy) -> SinaDFAge

AgeSina<-
  ggplot() +
  geom_violin(data=SinaDFAge,
              aes(y=mean, x=phylum, fill=phylum), 
              colour="transparent", alpha=0.2) + 
  ggforce::geom_sina(data=SinaDFAge,aes(y=mean, x=phylum, size=1/var, colour=phylum), alpha=0.3) + 
  scale_size(range=c(0.15, 3), name="Inverse Variance") +
  scale_colour_manual(values=PlotPal %>% 
                        filter(phylum %in% SinaDFAge$phylum) %>% 
                        mutate(colour=as.character(colour)) %>% 
                        pull(colour), 
                      breaks=unique(as.character(SinaDFAge$phylum)) %>% 
                        sort()) + 
  scale_fill_manual(values=PlotPal %>% 
                        filter(phylum %in% SinaDFAge$phylum) %>% 
                        mutate(colour=as.character(colour)) %>% 
                        pull(colour), 
                      breaks=unique(as.character(SinaDFAge$phylum)) %>% 
                        sort()) + 
  geom_hline(yintercept=0, linetype="dashed")+
  theme_bw(base_size = 14) + 
  #coord_flip() +theme_ipsum(base_size = 10) + 
  theme(axis.text.x = element_text(angle=45, hjust=1), 
        axis.title.x = element_blank(),
        legend.position = "top")+ #+facet_wrap(~phylum) +
  guides(colour=FALSE, fill=FALSE)

AgeSinaRect <- AgeSina+
  geom_rect(data=Top50, 
            aes(xmin=0.75, xmax=10.25, ymin=min(mean-0.15), ymax=max(mean+0.25)),
            fill="transparent", color="grey60",size=0.5) +
  geom_rect(data=Bottom50,
            aes(xmin=0.75, xmax=10.25, ymin=min(mean-0.30), ymax=max(mean+0.25)),
            fill="transparent", color="grey60",size=0.5)+
  labs(subtitle = "A")

We can now visualise a plot comparable (but with fewer ASVs) to Figure 2 in the manuscript, where the highest ranked positive and negative shifts are highlighted by the rectangle in panel A & panel B forest plot.

As in the manuscript, although there is a lot of variation within phlya, the majority of highly positive shifts (increases into adulthood) are found in the Bacteroidetes, and the majority of the highly negative shifts (decreases into adulthood) are found in the Firmicutes.

require(patchwork)

AgeSinaRect + AgeTopBottom

4. Model Extension Examples

4A. Gaussian GLMMs

CLR Transformation Background

  • The CLR takes the form of the log-ratio of the count for a given taxon in a given sample and the geometric mean of the counts of all taxa in that sample

  • This transformation has several desirable properties:

  • As a log-ratio it is approximately Gaussian in distribution, allowing the use of a Gaussian linear mixed model

  • Secondly, the mean of the CLR of all taxa in any sample will be zero, allowing for simplification of our random effects structure (i.e. removal of random effects not dealing with taxa)

  • CLR is increasingly used for microbiome analyses for it’s explicit acknowledgement of read abundances as compositional (Gloor et al 2017)

  • Below we use the package ‘ADLEx2’ (Fernandes et al 2013) to calculate a centered log-ratio for each ASV in each sample.

  • In addition to CLR values, ALDEx2 estimates technical variation for each ASV from a probability distribution using Monte-Carlo instances drawn from the Drichlet distribution, which maintains the compositional nature of the data.

  • This variance for each asv:sample level can be incorporated as measurement error in the MCMCglmm syntax, comparable to the ‘asv:sample’ term in the Poisson models (3A).

  • Further details of CLR models can be found in ESM section 2.

CLR data prep

Put data in correct (matrix format)

df_t <- df  %>% 
  dplyr::select(asv:abundance) %>% 
  spread(sample, abundance) # spread data to wide format 

df_m <-df_t %>% 
  column_to_rownames("asv") %>% 
  as.matrix() # turn into a matrix 

Calculate CLR values and associated error

The below will create a transformed dataframe, or can read in the CLR format data in the next code chunk

n_sample<-length(unique(df$sample)) #30 
x<-aldex.clr(df_m,as.character(seq(1,n_sample)), mc.samples=128, denom="all", verbose=F) # run ALDEx2

y<-melt(getMonteCarloInstances (x)) # extract the data
colnames(y)<-c("asv","mcmc_instance","clr","sample") # tidy for df 

z<-aggregate(y$clr,by=list(y$asv,y$sample),FUN=mean) # take mean across monte-carlo samples
meas_var<-aggregate(y$clr,by=list(y$asv,y$sample),FUN=var) # calculate variance across monte-carlo samples
z$sd<-meas_var$x # add variances to data frame
colnames(z)<-c("asv","sample","clr","meas_var") # correct dimensions 
mean(z$clr) #  will be very close to zero

meta<- df %>% 
  dplyr::select(sample, asv, phylum, abundance, id, age_class, sex) 

CLR <- merge(z, meta, by=c("asv", "sample"), all.x=TRUE) #looks good 
CLR <- read_csv("Data/GLMMMTutorialDatasetCLR.csv")

Visualising CLR: What’s going on in the above code

Each ASV per sample ends up with a CLR value taking into account the abundance of all other ASVs in the sample and a distribution of possible values, from which error around the CLR estimate is obtained

asvs<- paste0("ASV", seq(1:50)) %>% sample(., size=5)

meanCLR <- z %>% mutate(meanCLR=clr) %>% 
  dplyr::select(asv, meanCLR, sample) %>%
  filter(asv %in% asvs) %>% 
  mutate(sample_asv=paste(asv, sample, sep="-")) 

y_t<-y %>% 
  mutate(sample_asv=paste(asv, sample, sep="-")) %>% 
  filter(asv %in% asvs) 

ridges <- merge(y_t %>% dplyr::select(-c("asv", "sample")), meanCLR, by=c("sample_asv"), 
                 all.x=TRUE, all.y=FALSE) %>% 
  arrange(meanCLR, mcmc_instance) %>% 
  mutate(order=row_number())

require(ggridges)
clr_posteriors <- ridges %>% 
  ggplot(aes(x=clr, y=reorder(sample_asv,-meanCLR), group=sample_asv)) +
  scale_x_continuous(limits=c(-15,15))+
  geom_density_ridges(scale=5, size=0.25)+
  theme_minimal() + 
  geom_vline(xintercept = 0, linetype="dashed")+
  theme(axis.text.y = element_blank()) + labs(y="asv", x="clr")

clr_posteriors

CLR model set-up

mf <- 20 # multiplier for model iterations and chains 

Prior2 <- list(R = list(V = diag(1), nu = 0.002),
               G = list(G1 = list(V = diag(1), nu = 0.002, alpha.mu = rep(0,1), alpha.V = diag(1)*100),
                        G2 = list(V = diag(1), nu = 0.002, alpha.mu = rep(0,1), alpha.V = diag(1)*100)))

CLR run model

Mods[[2]] <-MCMCglmm(clr~ 1,
                  random=~ age_class:asv + asv, # note sample does not need to be included here 
                  mev=CLR$meas_var, # measurement error variance 
                  data=CLR,
                  prior=Prior2,
                  verbose=T, 
                  pr=F, pl=F, 
                  nitt= 13000*mf, 
                  thin=10*mf, burnin=3000*mf) 

Notes: * this also takes a fair bit of time (~40 min for me) but will be more efficient for more complex random effect structures * pr & pl are set to F for the purposes of this tutorial, but if you want the posteriors/ predictions saved, set to T

CLR results

PropVarCLR <- MCMCRep(Mods[[2]]) %>% 
  as.data.frame() %>% 
  mutate_at(c("Mode", "lHPD", "uHPD"), as.numeric) 

PropVarCLR[3,1] <- "mev"

We can compare these variance components to the those of the Poisson model and see how well they match.

We can see below that the variation in community composition associated with age class is about 20% in each model.

PropVarAll <- bind_rows(PropVarPois %>% 
                          mutate(Model="Poisson"), 
                        PropVarCLR %>% 
                          mutate(Model="Gaussian")) 

PropVarAll%>% 
  mutate(Component=fct_relevel(Component, "age_class:asv", "asv", "sample","poisson", "mev", "units")) %>% 
  ggplot(aes(x=Model, y=as.numeric(Mode), fill=Component)) + 
  geom_bar(stat="identity", colour="transparent") + 
  scale_fill_manual(values=MicroColours[c(2,4,6,8,10,12)]) +
  theme_bw(base_size = 14) + 
  #theme(legend.position = "top") + 
  theme_bw(base_size = 14) + 
  labs(x='Model', y='Proportion Variance')

4B. Hierarchical Taxonomic Effects

  • Poisson and CLR models introduced above ask how the asv community varies between age groups, and identify specific asvs which contribute to these differences and may warrant further investigation.
  • Differential abundance results in this manuscript (Figure 2) & tutorial (above) indicate that there is a high degree of variation for effects of asvs, even within the same phylum when visualised grouped
  • We therefore were further interested in from which taxonomic level most variation arises for observed effects.
  • Here we show an alternative to the Poisson model above which considers additional taxonomic levels (ASV, family, phylum) as an example of one approach.
  • We note that this nested taxonomic effects approach is not the only way, and that fitting a phylogenetic tree in the random effects structure offers an additional option which explictly accounts for phylogenetic distance.

Nested taxonomy model set-up

mf <- 20 # multiplier for model iterations and chains 

Prior9 <- list(R = list(V = diag(1), nu = 0.002),
               G = list(G1 = list(V = diag(1), nu = 0.002, alpha.mu = rep(0,1), alpha.V = diag(1)*100),
                        G2 = list(V = diag(1), nu = 0.002, alpha.mu = rep(0,1), alpha.V = diag(1)*100),
                        G3 = list(V = diag(1), nu = 0.002, alpha.mu = rep(0,1), alpha.V = diag(1)*100),
                        G4 = list(V = diag(1), nu = 0.002, alpha.mu = rep(0,1), alpha.V = diag(1)*100),
                        G5 = list(V = diag(1), nu = 0.002, alpha.mu = rep(0,1), alpha.V = diag(1)*100),
                        G6 = list(V = diag(1), nu = 0.002, alpha.mu = rep(0,1), alpha.V = diag(1)*100),
                        G7 = list(V = diag(1), nu = 0.002, alpha.mu = rep(0,1), alpha.V = diag(1)*100), 
                        G8 = list(V = diag(1), nu = 0.002, alpha.mu = rep(0,1), alpha.V = diag(1)*100),
                        G9 = list(V = diag(1), nu = 0.002, alpha.mu = rep(0,1), alpha.V = diag(1)*100))) 

Nested taxonomy run model

Mods[[3]]<- MCMCglmm(fixed= mcmcF, 
                          random= ~ sample + asv + age_class:asv +  
                            phylum + phylum:sample + age_class:phylum + 
                            fam + fam:sample + age_class:fam, 
                          prior=Prior9,
                          data=df,
                          verbose=T, pr=F, pl=F, # will save posteriors for random effect levels & model predictions for each iteration 
                          family = "poisson",
                          nitt = 13000*mf,
                          thin = 10*mf,burnin=3000*mf) 

Taxonomy results

View proportion variance breakdown with added taxonomic levels

PropVarTaxa <- MCMCRep(Mods[[3]], scale="link") %>% # function from ggregplot to calculate proportion variance for each random effect 
  as.data.frame() %>% 
  mutate_at(c("Mode", "lHPD", "uHPD"), as.numeric)

PropVarTaxa
##           Component         Mode         lHPD       uHPD
## 1            sample 0.0269881669 1.445599e-02 0.05075494
## 2               asv 0.1506133772 1.273715e-01 0.18110125
## 3     age_class:asv 0.1566505516 1.303556e-01 0.18020695
## 4            phylum 0.0002892060 1.428179e-07 0.04702793
## 5     phylum:sample 0.0154113850 6.350466e-03 0.02726147
## 6  age_class:phylum 0.0002328383 6.401795e-07 0.03563875
## 7               fam 0.0003373888 6.033218e-09 0.09344710
## 8        fam:sample 0.0371861831 2.851100e-02 0.05019781
## 9     age_class:fam 0.0339000125 1.513097e-02 0.10228162
## 10            units 0.4514965291 4.091250e-01 0.48510608

Variance components will not add to 1 here because there is some variance attributable to the poisson process. We’ll calculate that and visualise the variance breakdown for this model.

PropVarTaxa %>% pull(Mode) %>% sum -> TotalVar 
1- TotalVar -> TaxasonVar 

PropVarTaxa[11, ] <- c("poisson", TaxasonVar, NA, NA) 

PropVarTaxa %<>% 
  mutate_at(c("Mode", "lHPD", "uHPD"), as.numeric)

PropVarTaxa  %>% 
  ggplot(aes(x=1, y=as.numeric(Mode), fill=Component)) + 
  geom_bar(stat="identity", colour="transparent") + 
  scale_x_continuous(labels=NULL) + 
  scale_fill_manual(values=MicroColours) +
  theme_bw(base_size = 14) + 
  #theme(legend.position = "top") + 
  theme(axis.ticks.x = element_blank(), 
        legend.text = element_text(size=11), 
        legend.title = element_text(size=12)) + 
  labs(x='Nested taxonomy model', y='Proportion Variance')

Most variance is coming from the asv level (age_class:asv), followed by the family (age_class:fam) and finally the phylum (age_class:phylum) level. You can visualise the output of this model following code in part 3 of this tutorial to create a plot comparable to Figure S4 in this manuscript.